-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Tracking GlobalAllocator #9309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Note that we do not shoot for zero overhead here -- maintaining linked list requires extra time and space. The idea is that one would use this allocator only when investigating a specific perf problem, by hiding it behind a That being said, the following are nice to have:
|
Minimizing the tag overhead could be done by using a separate chunk of memory for each tag I think. If it is fine to round all allocation sizes to the nearest size class of the inner allocator (which would also account for the overhead of the inner allocator due to internal fragmentation) then I don't think any memory overhead is necessary. Just a way to query the data structures used by the inner allocator, which could be done by forking an existing rust allocator. |
If I understand this issue right, the only reason to tag every allocation with some extra metadata, is to be able to know the category / tag of memory this is on deallocation. The whole purpose of this is to have memory stats per tag. Rather than allocating extra memory, and storing the tag inline, why not store the tag as a part of the heap pointers?
I think both of these methods would keep the performance / behavior of the code closer to uninstrumented use case. |
Performance of method 2 would probably be better. Method 1 will probably trash the TLB more than usual code (because more pagetables are needed). |
Genius! |
Update: method 2 will probably not work for some things. If e.g. a buffer with a tag is passed to the kernel in a syscall, the pointer is also checked by the kernel, which means it will become impossible to use the tags for anything talking to the kernel. I liked method 1 better anyway ^^ |
Seems like jemalloc already uses arenas internally, to improve heavily multi threaded code. Might be possible to manually create them, and control their place in the address space. I think I will experiment with this for a little, and maybe try to integrate something in ra, if it works out. This will of course defeat the actual purpose of the arenas then ^^. |
x86_64 requires that pointers are canonicalized using sign extension. If you try to use a pointer with the upper 16 bits changed you are guaranteed to get a page fault.
This option would work though, but only on 64bit systems. 32bit systems only have 2GB worth of allocatable address space. |
@bjorn3 thanks! Did not know that about x86_64. That is probably also the reason, why linux uses only very limited pointer tagging in the kernel itself. |
I implemented the original idea (not one of my crazy optimization ideas) a while back, and kind of forgot. It is working well already, but I can not really say, that it gives me much insight into rust-analyzer memory usage. If you have any idea, where I should place the instrumentation calls to get started, tell me. Currently, I just wrapped entry points to the parser. It might still be some work, to add sensible instrumentation calls throughout the entire rust-analyzer. |
If you are interested, the most problematic thing was the stack for the tags ^^. Turns out you can `t use e.g. a Vec for that, because it obviously calls the GlobalAlloc as well. Will result in endless recursion. |
Plugging instrumentation here: should do the trick. We already use profile in most interesting places of ra, so piggybacking on the ProfileSpan should give you enough coverage. |
Uh oh!
There was an error while loading. Please reload this page.
As described in https://rust-analyzer.github.io/blog/2020/12/04/measuring-memory-usage-in-rust.html, memory profiling long-runing applications in Rust is hard, because it's impossible to pause the program and "pares the heap" -- inspect all the currently allocated blocks and their owners.
I think it's possible to recover heap parseability by writting a special global allocator. Specifically, this allocator will maintain an intrusive doubly linked list of currently allocated heap blocks, and will provide an API to iterate the blocks. Internally, it'll just wrap some other allocator, and will over-allocate each requested block to stuff meta with pointers at the start of the block.
Additionally, the allocator will maintain a therad-local stack of active "heap tags", and it will tag each block with the topmost tag. Having tags will allow the user to classify the allocated blocks.
Draft API
The user (rust-analyzer) will then do something like this:
That is, the user can answer questions like "how much memory is occupied by all ItemTrees" and "how much memory is occupied by all parse ItemTrees"
The text was updated successfully, but these errors were encountered: