Skip to content

Tracking GlobalAllocator #9309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
matklad opened this issue Jun 17, 2021 · 12 comments
Open

Tracking GlobalAllocator #9309

matklad opened this issue Jun 17, 2021 · 12 comments
Labels
E-hard fun A technically challenging issue with high impact

Comments

@matklad
Copy link
Member

matklad commented Jun 17, 2021

As described in https://rust-analyzer.github.io/blog/2020/12/04/measuring-memory-usage-in-rust.html, memory profiling long-runing applications in Rust is hard, because it's impossible to pause the program and "pares the heap" -- inspect all the currently allocated blocks and their owners.

I think it's possible to recover heap parseability by writting a special global allocator. Specifically, this allocator will maintain an intrusive doubly linked list of currently allocated heap blocks, and will provide an API to iterate the blocks. Internally, it'll just wrap some other allocator, and will over-allocate each requested block to stuff meta with pointers at the start of the block.

Additionally, the allocator will maintain a therad-local stack of active "heap tags", and it will tag each block with the topmost tag. Having tags will allow the user to classify the allocated blocks.

Draft API

pub struct TrackAlloc<A: GlobalAllocator> {
    a: A
    first_allocated_block: Option<*const ()>
}

impl<A: GlobalAllocator> GlobalAllocator for TrackAlloc<A> { }

therad_local! {
    static tags: RefCell<Vec<AllocTag>> = default()
}


pub struct AllocTag(pub NonZeroUsize);

impl AllocTag {
    pub fn push(self) -> impl Drop { ... }
}

pub struct MemoryBlock { ... }

impl MemoryBlock {
    pub fn for_each_allocated(f: impl FnMut(&MemoryBlock)) {}
    pub fn size(&self) -> usize {}
    pub fn tag(&self) -> Option<AllocTag> {}
}

The user (rust-analyzer) will then do something like this:

pub fn item_tree_query(file_id: FileId) -> Arc<ItemTree> {
    let _a = AllocTag("ItemTree" as *const as NonZeroUsize).push();
     // allocate stuff
}

pub fn pares_query(file_id: FileId) -> ParesTree {
    let _a = AllocTag("ParseTree" as *const as NonZeroUsize).push();
     // allocate stuff
}

That is, the user can answer questions like "how much memory is occupied by all ItemTrees" and "how much memory is occupied by all parse ItemTrees"

@matklad
Copy link
Member Author

matklad commented Jun 17, 2021

Note that we do not shoot for zero overhead here -- maintaining linked list requires extra time and space. The idea is that one would use this allocator only when investigating a specific perf problem, by hiding it behind a cfg flag.

That being said, the following are nice to have:

  • minimize slow-down and memory overhead (can we use a xor-list here?)
  • make it possible to run this "in production", by adding some kind of env-variable or some such to toggle the tracking on and off

@bjorn3
Copy link
Member

bjorn3 commented Jun 17, 2021

Minimizing the tag overhead could be done by using a separate chunk of memory for each tag I think. If it is fine to round all allocation sizes to the nearest size class of the inner allocator (which would also account for the overhead of the inner allocator due to internal fragmentation) then I don't think any memory overhead is necessary. Just a way to query the data structures used by the inner allocator, which could be done by forking an existing rust allocator.

@ruabmbua
Copy link
Contributor

ruabmbua commented Sep 6, 2021

If I understand this issue right, the only reason to tag every allocation with some extra metadata, is to be able to know the category / tag of memory this is on deallocation. The whole purpose of this is to have memory stats per tag.

Rather than allocating extra memory, and storing the tag inline, why not store the tag as a part of the heap pointers?
Usually devs run 64bit systems like x86_64, where we have massive 48bit virtual address spaces. That would give us two ways to store the tag information in the pointer:

  1. Build / modify a allocator, to use several distinct memory mappings in the virtual address space, each containing its own heap. Each tag would essentially only claim memory from a very specific section of the virtual address space. Pointers can be quickly recognized in dealloc. Massive virtual address spaces, and mmap with MAP_ANONYMOUS, MAP_PRIVATE give us the necessary API to control the address space, at least in linux.

  2. Just use the few bits in pointers, which are not touched by address translation, e.g. I think 16 bits on x86_64. This would then be somewhat architecture specific but simpler. I have no idea, if this can cause undefined behavior. Should be fine, since the tagging would be done inside the GlobalAlloc implementation.

I think both of these methods would keep the performance / behavior of the code closer to uninstrumented use case.

@ruabmbua
Copy link
Contributor

ruabmbua commented Sep 6, 2021

Performance of method 2 would probably be better. Method 1 will probably trash the TLB more than usual code (because more pagetables are needed).

@matklad
Copy link
Member Author

matklad commented Sep 6, 2021

Genius!

@ruabmbua
Copy link
Contributor

ruabmbua commented Sep 6, 2021

Update: method 2 will probably not work for some things. If e.g. a buffer with a tag is passed to the kernel in a syscall, the pointer is also checked by the kernel, which means it will become impossible to use the tags for anything talking to the kernel.

I liked method 1 better anyway ^^

@ruabmbua
Copy link
Contributor

ruabmbua commented Sep 6, 2021

Seems like jemalloc already uses arenas internally, to improve heavily multi threaded code. Might be possible to manually create them, and control their place in the address space. I think I will experiment with this for a little, and maybe try to integrate something in ra, if it works out.

This will of course defeat the actual purpose of the arenas then ^^.

@bjorn3
Copy link
Member

bjorn3 commented Sep 6, 2021

Just use the few bits in pointers, which are not touched by address translation, e.g. I think 16 bits on x86_64. This would then be somewhat architecture specific but simpler. I have no idea, if this can cause undefined behavior. Should be fine, since the tagging would be done inside the GlobalAlloc implementation.

x86_64 requires that pointers are canonicalized using sign extension. If you try to use a pointer with the upper 16 bits changed you are guaranteed to get a page fault.

Build / modify a allocator, to use several distinct memory mappings in the virtual address space, each containing its own heap. Each tag would essentially only claim memory from a very specific section of the virtual address space. Pointers can be quickly recognized in dealloc. Massive virtual address spaces, and mmap with MAP_ANONYMOUS, MAP_PRIVATE give us the necessary API to control the address space, at least in linux.

This option would work though, but only on 64bit systems. 32bit systems only have 2GB worth of allocatable address space.

@ruabmbua
Copy link
Contributor

ruabmbua commented Sep 6, 2021

@bjorn3 thanks! Did not know that about x86_64. That is probably also the reason, why linux uses only very limited pointer tagging in the kernel itself.

@ruabmbua
Copy link
Contributor

I implemented the original idea (not one of my crazy optimization ideas) a while back, and kind of forgot.

It is working well already, but I can not really say, that it gives me much insight into rust-analyzer memory usage. If you have any idea, where I should place the instrumentation calls to get started, tell me. Currently, I just wrapped entry points to the parser. It might still be some work, to add sensible instrumentation calls throughout the entire rust-analyzer.

@ruabmbua
Copy link
Contributor

If you are interested, the most problematic thing was the stack for the tags ^^. Turns out you can `t use e.g. a Vec for that, because it obviously calls the GlobalAlloc as well. Will result in endless recursion.

@matklad
Copy link
Member Author

matklad commented Oct 23, 2021

Plugging instrumentation here:

https://github.com/rust-analyzer/rust-analyzer/blob/a75353e8acc63d660b2bcb2e70f496ddc0567789/crates/profile/src/hprof.rs#L61

should do the trick. We already use profile in most interesting places of ra, so piggybacking on the ProfileSpan should give you enough coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
E-hard fun A technically challenging issue with high impact
Projects
None yet
Development

No branches or pull requests

3 participants