Skip to content

castToDeclContext takes 2% of execution time  #76824

Closed
@Destroyerrrocket

Description

@Destroyerrrocket

The function castToDeclContext takes around 2% of the execution time in my test run.

I was profiling clang to see if by any chance I could spot any small mistakes that were taking a significant amount of time on debug builds, and while pretty much everything is either complicated enough where optimizing would be impossible for me or is already very performant, I noticed that valgrind was reporting an interesting function as the top of the 'auto' category:
interesting_entry

It is being executed 1.8 billion times, and the implementation looks pretty trivial to me:
code_snippet

Looking at the assembly, I expected pretty much a lookup table and an add operation, but it is pretty clear that the resulting code is not ideal:

clang::Decl::castFromDeclContext(clang::DeclContext const*): # @clang::Decl::castFromDeclContext(clang::DeclContext const*)
.L_ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE$local:
  movzwl 8(%rdi), %edx
  movq %rdi, %rax
  andl $127, %edx
  leal -1(%rdx), %esi
  cmpl $84, %esi
  ja .LBB65_7
  leaq .LJTI65_0(%rip), %rdi
  movq $-40, %rcx
  movslq (%rdi,%rsi,4), %rsi
  addq %rdi, %rsi
  jmpq *%rsi
.LBB65_2:
  addq %rcx, %rax
  retq
.LBB65_4:
  movq $-48, %rcx
  addq %rcx, %rax
  retq
.LBB65_5:
  movq $-64, %rcx
  addq %rcx, %rax
  retq
.LBB65_6:
  movq $-56, %rcx
  addq %rcx, %rax
  retq
.LBB65_7:
  leal -53(%rdx), %esi
  movq $-72, %rcx
  cmpl $6, %esi
  jb .LBB65_2
  addl $-34, %edx
  xorl %ecx, %ecx
  cmpl $5, %edx
  setae %cl
  shll $4, %ecx
  orq $-64, %rcx
  addq %rcx, %rax
  retq
.LJTI65_0:
<JUMP TABLE>

The PR I'll submit in a few minutes fixes this problem by eliminating the need for the macro DECL_CONTEXT_BASE (it's only used here and in two other analogous functions), and reordering the AST decl order to prioritize classes that inherit from DeclContext. I also experimented with hand rolled offset tables, but this is far from maintainable even if it manages to compress 3 lookup tables into one. The resulting assembly is just:

clang::Decl::castFromDeclContext(clang::DeclContext const*): # @clang::Decl::castFromDeclContext(clang::DeclContext const*)
.L_ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE$local:
  movzwl 8(%rdi), %ecx
  leaq .Lswitch.table._ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE(%rip), %rdx
  movq %rdi, %rax
  andl $127, %ecx
  addq (%rdx,%rcx,8), %rax
  retq

And the build difference of clang+clang-tools-extra with a debug build:
NonOpt: ninja 19007,02s user 760,01s system 2284% cpu 14:25,23 total
Opt: ninja 18806,18s user 763,33s system 2308% cpu 14:07,74 total

So around ~1.02 speedup, ~0.98 of the previous execution, nothing earth shattering, but what would be expected from valgrind, and I already did all the legwork, so I might as well send it :)

Metadata

Metadata

Assignees

No one assigned

    Labels

    clang:frontendLanguage frontend issues, e.g. anything involving "Sema"performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions