castToDeclContext takes 2% of execution time 

The function castToDeclContext takes around 2% of the execution time in my test run.

I was profiling clang to see if by any chance I could spot any small mistakes that were taking a significant amount of time on debug builds, and while pretty much everything is either complicated enough where optimizing would be impossible for me or is already very performant, I noticed that valgrind was reporting an interesting function as the top of the 'auto' category:
![interesting_entry](https://github.com/llvm/llvm-project/assets/25348040/8b7e5d79-d7c7-4c3e-a0de-2ea63058c5aa)

It is being executed 1.8 billion times, and the implementation looks pretty trivial to me:
![code_snippet](https://github.com/llvm/llvm-project/assets/25348040/a9f87c3f-55da-43b8-b028-db79ad61cf20)

Looking at the assembly, I expected pretty much a lookup table and an add operation, but it is pretty clear that the resulting code is not ideal:

```asm
clang::Decl::castFromDeclContext(clang::DeclContext const*): # @clang::Decl::castFromDeclContext(clang::DeclContext const*)
.L_ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE$local:
  movzwl 8(%rdi), %edx
  movq %rdi, %rax
  andl $127, %edx
  leal -1(%rdx), %esi
  cmpl $84, %esi
  ja .LBB65_7
  leaq .LJTI65_0(%rip), %rdi
  movq $-40, %rcx
  movslq (%rdi,%rsi,4), %rsi
  addq %rdi, %rsi
  jmpq *%rsi
.LBB65_2:
  addq %rcx, %rax
  retq
.LBB65_4:
  movq $-48, %rcx
  addq %rcx, %rax
  retq
.LBB65_5:
  movq $-64, %rcx
  addq %rcx, %rax
  retq
.LBB65_6:
  movq $-56, %rcx
  addq %rcx, %rax
  retq
.LBB65_7:
  leal -53(%rdx), %esi
  movq $-72, %rcx
  cmpl $6, %esi
  jb .LBB65_2
  addl $-34, %edx
  xorl %ecx, %ecx
  cmpl $5, %edx
  setae %cl
  shll $4, %ecx
  orq $-64, %rcx
  addq %rcx, %rax
  retq
.LJTI65_0:
<JUMP TABLE>
```

The PR I'll submit in a few minutes fixes this problem by eliminating the need for the macro DECL_CONTEXT_BASE (it's only used here and in two other analogous functions), and reordering the AST decl order to prioritize classes that inherit from DeclContext. I also experimented with hand rolled offset tables, but this is far from maintainable even if it manages to compress 3 lookup tables into one. The resulting assembly is just:

```asm
clang::Decl::castFromDeclContext(clang::DeclContext const*): # @clang::Decl::castFromDeclContext(clang::DeclContext const*)
.L_ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE$local:
  movzwl 8(%rdi), %ecx
  leaq .Lswitch.table._ZN5clang4Decl19castFromDeclContextEPKNS_11DeclContextE(%rip), %rdx
  movq %rdi, %rax
  andl $127, %ecx
  addq (%rdx,%rcx,8), %rax
  retq
```
  
And the build difference of clang+clang-tools-extra with a debug build:
NonOpt: ninja  19007,02s user 760,01s system 2284% cpu 14:25,23 total
Opt: ninja  18806,18s user 763,33s system 2308% cpu 14:07,74 total

So around ~1.02 speedup, ~0.98 of the previous execution, nothing earth shattering, but what would be expected from valgrind, and I already did all the legwork, so I might as well send it :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

castToDeclContext takes 2% of execution time #76824

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

castToDeclContext takes 2% of execution time #76824

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions