Description
The function castToDeclContext takes around 2% of the execution time in my test run.
I was profiling clang to see if by any chance I could spot any small mistakes that were taking a significant amount of time on debug builds, and while pretty much everything is either complicated enough where optimizing would be impossible for me or is already very performant, I noticed that valgrind was reporting an interesting function as the top of the 'auto' category:
It is being executed 1.8 billion times, and the implementation looks pretty trivial to me:
Looking at the assembly, I expected pretty much a lookup table and an add operation, but it is pretty clear that the resulting code is not ideal:
clang::Decl::castFromDeclContext(clang::DeclContext const*): # @clang::Decl::castFromDeclContext(clang::DeclContext const*)
.L_ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE$local:
movzwl 8(%rdi), %edx
movq %rdi, %rax
andl $127, %edx
leal -1(%rdx), %esi
cmpl $84, %esi
ja .LBB65_7
leaq .LJTI65_0(%rip), %rdi
movq $-40, %rcx
movslq (%rdi,%rsi,4), %rsi
addq %rdi, %rsi
jmpq *%rsi
.LBB65_2:
addq %rcx, %rax
retq
.LBB65_4:
movq $-48, %rcx
addq %rcx, %rax
retq
.LBB65_5:
movq $-64, %rcx
addq %rcx, %rax
retq
.LBB65_6:
movq $-56, %rcx
addq %rcx, %rax
retq
.LBB65_7:
leal -53(%rdx), %esi
movq $-72, %rcx
cmpl $6, %esi
jb .LBB65_2
addl $-34, %edx
xorl %ecx, %ecx
cmpl $5, %edx
setae %cl
shll $4, %ecx
orq $-64, %rcx
addq %rcx, %rax
retq
.LJTI65_0:
<JUMP TABLE>
The PR I'll submit in a few minutes fixes this problem by eliminating the need for the macro DECL_CONTEXT_BASE (it's only used here and in two other analogous functions), and reordering the AST decl order to prioritize classes that inherit from DeclContext. I also experimented with hand rolled offset tables, but this is far from maintainable even if it manages to compress 3 lookup tables into one. The resulting assembly is just:
clang::Decl::castFromDeclContext(clang::DeclContext const*): # @clang::Decl::castFromDeclContext(clang::DeclContext const*)
.L_ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE$local:
movzwl 8(%rdi), %ecx
leaq .Lswitch.table._ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE(%rip), %rdx
movq %rdi, %rax
andl $127, %ecx
addq (%rdx,%rcx,8), %rax
retq
And the build difference of clang+clang-tools-extra with a debug build:
NonOpt: ninja 19007,02s user 760,01s system 2284% cpu 14:25,23 total
Opt: ninja 18806,18s user 763,33s system 2308% cpu 14:07,74 total
So around ~1.02 speedup, ~0.98 of the previous execution, nothing earth shattering, but what would be expected from valgrind, and I already did all the legwork, so I might as well send it :)