-
Notifications
You must be signed in to change notification settings - Fork 545
Memory optimization for loading weights for no_std mode #2871
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, I agree there is a lot of room to improve for loading weights. We haven't focused on this initially. But I think this is perfect time. Since you're in the weeds of it, it would be great if you try using more efficient Rust APIs to consume the existing preallocated memory without duplicating (albeit temporarily). I know there are some. We will review your PR. |
I have the exact same issues, and I noticed the same things with the Raspberry Pi Pico, I would be willing to tackle this issue as well with a team that I'm working with. Is there an active PR for this, or should I create one? |
@BjornTheProgrammer I just pushed some experiments to #2881. On my ESP32s3 it does not panic because of allocation failures. |
This is what I'm seeing using
I've been trying to add a zero-copy serialization like rkyv, but it's not that straightforward as rykv requires |
I agree, and that's exactly where I got stuck (and please note that I've never used rkyv before, so I might be going the wrong way) because
and similar in counterpart Is there a way to use rkyv without I'm almost tempted to try and grab the weights tensors directly from my model and serialize and deserialize them myself as a quick hack. |
Oh and also note, that I'm not claiming that all of those allocations I posted come from bincode, I haven't looked closely at what happens after the bytes are deserialized by bincode and there probably is more allocations/deallocations there. |
My PR #2892 has been merged! The memory savings weren't quite what I expected. I believe that the most optimization will probably come from the executor backend. I'm going to try to create some tooling or process for inspecting memory usage in depth to really discover where the best savings can come from. |
Hey folks,
Thanks for building a genius framework! From a recent issue, you probably remember that I tried running the SqueezeNet example on an ESP32. I switched to an ESP32-S3 with 8 MB PSRAM. After failing to run it there due to allocation failures, I started a discussion in the #esp-rs:matrix.org chat, which turned out to be super fruitful (BIG shoutout!).
A few key findings and questions that I will try to summarize from the thread:
can burn operate directly off from the EMBEDDED_STATES, i.e. without copying the model to RAM
Ideally they should split their model into "readonly" stuff and "readwrite" stuff and then the "readonly" stuff is used as-is. I.e. flashed un-decoded.
Apparently there is room for some improvement to run models in
no_std
which I guess will also be beneficial when running withstd
.Looking forward hearing your thoughts.
The text was updated successfully, but these errors were encountered: