Tracking GlobalAllocator #9309

matklad · 2021-06-17T10:33:26Z

As described in https://rust-analyzer.github.io/blog/2020/12/04/measuring-memory-usage-in-rust.html, memory profiling long-runing applications in Rust is hard, because it's impossible to pause the program and "pares the heap" -- inspect all the currently allocated blocks and their owners.

I think it's possible to recover heap parseability by writting a special global allocator. Specifically, this allocator will maintain an intrusive doubly linked list of currently allocated heap blocks, and will provide an API to iterate the blocks. Internally, it'll just wrap some other allocator, and will over-allocate each requested block to stuff meta with pointers at the start of the block.

Additionally, the allocator will maintain a therad-local stack of active "heap tags", and it will tag each block with the topmost tag. Having tags will allow the user to classify the allocated blocks.

Draft API

pub struct TrackAlloc<A: GlobalAllocator> {
    a: A
    first_allocated_block: Option<*const ()>
}

impl<A: GlobalAllocator> GlobalAllocator for TrackAlloc<A> { }

therad_local! {
    static tags: RefCell<Vec<AllocTag>> = default()
}


pub struct AllocTag(pub NonZeroUsize);

impl AllocTag {
    pub fn push(self) -> impl Drop { ... }
}

pub struct MemoryBlock { ... }

impl MemoryBlock {
    pub fn for_each_allocated(f: impl FnMut(&MemoryBlock)) {}
    pub fn size(&self) -> usize {}
    pub fn tag(&self) -> Option<AllocTag> {}
}

The user (rust-analyzer) will then do something like this:

pub fn item_tree_query(file_id: FileId) -> Arc<ItemTree> {
    let _a = AllocTag("ItemTree" as *const as NonZeroUsize).push();
     // allocate stuff
}

pub fn pares_query(file_id: FileId) -> ParesTree {
    let _a = AllocTag("ParseTree" as *const as NonZeroUsize).push();
     // allocate stuff
}

That is, the user can answer questions like "how much memory is occupied by all ItemTrees" and "how much memory is occupied by all parse ItemTrees"

matklad · 2021-06-17T12:28:08Z

Note that we do not shoot for zero overhead here -- maintaining linked list requires extra time and space. The idea is that one would use this allocator only when investigating a specific perf problem, by hiding it behind a cfg flag.

That being said, the following are nice to have:

minimize slow-down and memory overhead (can we use a xor-list here?)
make it possible to run this "in production", by adding some kind of env-variable or some such to toggle the tracking on and off

bjorn3 · 2021-06-17T12:39:13Z

Minimizing the tag overhead could be done by using a separate chunk of memory for each tag I think. If it is fine to round all allocation sizes to the nearest size class of the inner allocator (which would also account for the overhead of the inner allocator due to internal fragmentation) then I don't think any memory overhead is necessary. Just a way to query the data structures used by the inner allocator, which could be done by forking an existing rust allocator.

ruabmbua · 2021-09-06T19:34:22Z

If I understand this issue right, the only reason to tag every allocation with some extra metadata, is to be able to know the category / tag of memory this is on deallocation. The whole purpose of this is to have memory stats per tag.

Rather than allocating extra memory, and storing the tag inline, why not store the tag as a part of the heap pointers?
Usually devs run 64bit systems like x86_64, where we have massive 48bit virtual address spaces. That would give us two ways to store the tag information in the pointer:

Build / modify a allocator, to use several distinct memory mappings in the virtual address space, each containing its own heap. Each tag would essentially only claim memory from a very specific section of the virtual address space. Pointers can be quickly recognized in dealloc. Massive virtual address spaces, and mmap with MAP_ANONYMOUS, MAP_PRIVATE give us the necessary API to control the address space, at least in linux.
Just use the few bits in pointers, which are not touched by address translation, e.g. I think 16 bits on x86_64. This would then be somewhat architecture specific but simpler. I have no idea, if this can cause undefined behavior. Should be fine, since the tagging would be done inside the GlobalAlloc implementation.

I think both of these methods would keep the performance / behavior of the code closer to uninstrumented use case.

ruabmbua · 2021-09-06T19:37:22Z

Performance of method 2 would probably be better. Method 1 will probably trash the TLB more than usual code (because more pagetables are needed).

matklad · 2021-09-06T19:51:04Z

Genius!

ruabmbua · 2021-09-06T19:51:41Z

Update: method 2 will probably not work for some things. If e.g. a buffer with a tag is passed to the kernel in a syscall, the pointer is also checked by the kernel, which means it will become impossible to use the tags for anything talking to the kernel.

I liked method 1 better anyway ^^

ruabmbua · 2021-09-06T20:14:47Z

Seems like jemalloc already uses arenas internally, to improve heavily multi threaded code. Might be possible to manually create them, and control their place in the address space. I think I will experiment with this for a little, and maybe try to integrate something in ra, if it works out.

This will of course defeat the actual purpose of the arenas then ^^.

bjorn3 · 2021-09-06T20:18:24Z

Just use the few bits in pointers, which are not touched by address translation, e.g. I think 16 bits on x86_64. This would then be somewhat architecture specific but simpler. I have no idea, if this can cause undefined behavior. Should be fine, since the tagging would be done inside the GlobalAlloc implementation.

x86_64 requires that pointers are canonicalized using sign extension. If you try to use a pointer with the upper 16 bits changed you are guaranteed to get a page fault.

Build / modify a allocator, to use several distinct memory mappings in the virtual address space, each containing its own heap. Each tag would essentially only claim memory from a very specific section of the virtual address space. Pointers can be quickly recognized in dealloc. Massive virtual address spaces, and mmap with MAP_ANONYMOUS, MAP_PRIVATE give us the necessary API to control the address space, at least in linux.

This option would work though, but only on 64bit systems. 32bit systems only have 2GB worth of allocatable address space.

ruabmbua · 2021-09-06T20:41:46Z

@bjorn3 thanks! Did not know that about x86_64. That is probably also the reason, why linux uses only very limited pointer tagging in the kernel itself.

ruabmbua · 2021-10-22T23:10:28Z

I implemented the original idea (not one of my crazy optimization ideas) a while back, and kind of forgot.

It is working well already, but I can not really say, that it gives me much insight into rust-analyzer memory usage. If you have any idea, where I should place the instrumentation calls to get started, tell me. Currently, I just wrapped entry points to the parser. It might still be some work, to add sensible instrumentation calls throughout the entire rust-analyzer.

ruabmbua · 2021-10-22T23:15:18Z

If you are interested, the most problematic thing was the stack for the tags ^^. Turns out you can `t use e.g. a Vec for that, because it obviously calls the GlobalAlloc as well. Will result in endless recursion.

matklad · 2021-10-23T08:34:33Z

Plugging instrumentation here:

https://github.com/rust-analyzer/rust-analyzer/blob/a75353e8acc63d660b2bcb2e70f496ddc0567789/crates/profile/src/hprof.rs#L61

should do the trick. We already use profile in most interesting places of ra, so piggybacking on the ProfileSpan should give you enough coverage.

matklad added E-hard fun A technically challenging issue with high impact labels Jun 17, 2021

This was referenced Jun 17, 2021

MemoryUsage of Arc forgets to account for ret-counts wasmerio/loupe#17

Open

Add optional typed variants of allocation and deallocation routines rust-lang/wg-allocators#89

Open

This was referenced Jun 23, 2021

Switch from log to tracing #9274

Closed

Apply Profile-guided optimization to improve performance #9412

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tracking GlobalAllocator #9309

Tracking GlobalAllocator #9309

matklad commented Jun 17, 2021 •

edited

Loading

matklad commented Jun 17, 2021 •

edited

Loading

Uh oh!

bjorn3 commented Jun 17, 2021

Uh oh!

ruabmbua commented Sep 6, 2021

Uh oh!

ruabmbua commented Sep 6, 2021

Uh oh!

matklad commented Sep 6, 2021

Uh oh!

ruabmbua commented Sep 6, 2021

Uh oh!

ruabmbua commented Sep 6, 2021 •

edited

Loading

Uh oh!

bjorn3 commented Sep 6, 2021

Uh oh!

ruabmbua commented Sep 6, 2021

Uh oh!

ruabmbua commented Oct 22, 2021

Uh oh!

ruabmbua commented Oct 22, 2021

Uh oh!

matklad commented Oct 23, 2021

Uh oh!

Tracking GlobalAllocator #9309

Tracking GlobalAllocator #9309

Comments

matklad commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

matklad commented Jun 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjorn3 commented Jun 17, 2021

Uh oh!

ruabmbua commented Sep 6, 2021

Uh oh!

ruabmbua commented Sep 6, 2021

Uh oh!

matklad commented Sep 6, 2021

Uh oh!

ruabmbua commented Sep 6, 2021

Uh oh!

ruabmbua commented Sep 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bjorn3 commented Sep 6, 2021

Uh oh!

ruabmbua commented Sep 6, 2021

Uh oh!

ruabmbua commented Oct 22, 2021

Uh oh!

ruabmbua commented Oct 22, 2021

Uh oh!

matklad commented Oct 23, 2021

Uh oh!

matklad commented Jun 17, 2021 •

edited

Loading

matklad commented Jun 17, 2021 •

edited

Loading

ruabmbua commented Sep 6, 2021 •

edited

Loading