Skip to content

feat: unbloat, reduce image size #70

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

feat: unbloat, reduce image size #70

wants to merge 3 commits into from

Conversation

MQ37
Copy link
Contributor

@MQ37 MQ37 commented Mar 31, 2025

Closes #72

These changes bring image size from ~1.4 GB to ~600 MB.

@MQ37 MQ37 changed the title feat: unbload, reduce image size feat: unbloat, reduce image size Mar 31, 2025
@MQ37 MQ37 requested review from jirispilka, matyascimbulka and metalwarrior665 and removed request for jirispilka March 31, 2025 13:45
@matyascimbulka
Copy link
Collaborator

@MQ37 Could you give us links to the build as well as some runs (STANDBY and STANDALONE) using it?

@MQ37
Copy link
Contributor Author

MQ37 commented Mar 31, 2025

@MQ37 Could you give us links to the build as well as some runs (STANDBY and STANDALONE) using it?

Sure 👍

Build link: https://console.apify.com/actors/B2VM9FhWyxLEMb7tm/builds/1.0.25/log
Normal run: https://console.apify.com/view/runs/osLeK8ehOOosBGuot
Standby run: https://console.apify.com/view/runs/UQmDACw95ugfnTSDm

Copy link
Collaborator

@matyascimbulka matyascimbulka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is really cool. Reducing the image size by 1GB without reducing functionality is impressive. Thank you.

@metalwarrior665
Copy link
Member

Have you checked if this makes sense to do for the Apify Docker base images?

@MQ37
Copy link
Contributor Author

MQ37 commented Mar 31, 2025

Have you checked if this makes sense to do for the Apify Docker base images?

Since it uses distroless image where we need to copy all the requirements I don't think it makes sense and is viable to make this a base image for Actors - they can use some specific libs or different versions.

@jirispilka
Copy link
Collaborator

Yeah, this is great. It will speed up start significantly (when image is not cached)

I think it could make sense to use it as a base image for certain use cases — or to include it in the templates as an example.

Assuming there isn’t some catch I’m not seeing right now? @metalwarrior665 can you think of any?
Or discuss this with someone from tooling?

We definitely need to test it properly before releasing it in the RAG Web Browser.

@metalwarrior665
Copy link
Member

metalwarrior665 commented Apr 1, 2025

This should definitely go through tooling/platform review as the base Dockerfiles are cached/preloaded in some way.

Btw: When we dropped image size in Google Maps from 2 TB to 400 MB (browser -> cheerio), we saw no improvement in startup time. I think the caching just takes care of that. So I would measure what benefit we actually want - usually startup times and build times.

@MQ37
Copy link
Contributor Author

MQ37 commented Apr 2, 2025

just found out that when I tested I probably tested locally with the playwright which does not use the docker image and when copyting from builder image I forgot to copy the browser with it's requirements 🤦 When including the browser and all the libs we can realistically at maximum save ~300 MB which I don't know is worth the overhead (we would have to solve loading of the dynamically linked libraries). But we can look into the distroless images for other, simpler use cases for node only Actors.

@MQ37
Copy link
Contributor Author

MQ37 commented Apr 2, 2025

just found out that when I tested I probably tested locally with the playwright which does not use the docker image and when copyting from builder image I forgot to copy the browser with it's requirements 🤦 When including the browser and all the libs we can realistically at maximum save ~300 MB which I don't know is worth the overhead (we would have to solve loading of the dynamically linked libraries). But we can look into the distroless images for other, simpler use cases for node only Actors.

Taking this back, hacked it to work (its really hacky) the image is ~600 MB and playwright works 👍 More than 300 MB but way less than original 1.3 GB.

@MQ37
Copy link
Contributor Author

MQ37 commented Apr 2, 2025

@metalwarrior665
Copy link
Member

metalwarrior665 commented Apr 2, 2025

Cool! Could you still make a post in #product-dev-tools and present this achievement before we merge? We can take this Actor as PoC of the distroless approach but it should be approved by the Tooling team. There might be some assumptions we have about our base Dockerfiles

@metalwarrior665
Copy link
Member

@MQ37 If you look at the first 2 log lines of your runs, the Docker pull takes 6+ seconds which is terrible. Usually it takes between 1-2 seconds. You can try a few more tests but I think this is cache miss and unless we make your distroless approach a new standard, we have to pause this.

@jirispilka
Copy link
Collaborator

Yes, It would mean that the image is not cached. If you ran it more often, then the time should improve.

Since the rag web browser is used 2-5k runs per day, we need to caution we non-standard changes.
Therefore, I'm holding with media blocking PR #54 , I need more time to properly tested it

@MQ37
Copy link
Contributor Author

MQ37 commented Apr 2, 2025

@MQ37 If you look at the first 2 log lines of your runs, the Docker pull takes 6+ seconds which is terrible. Usually it takes between 1-2 seconds. You can try a few more tests but I think this is cache miss and unless we make your distroless approach a new standard, we have to pause this.

Did a few more tests and the pull is way faster when I hit the cache. But I noticed the startup time, time from "Starting Docker container." to the printed system info in logs, is slower/unstable (higher std). Which is what we want to optimize for.

Original RAG Browser:
Pull to Start - Mean: 0.173s, Median: 0.179s, Std: 0.029s
Start to Sys - Mean: 1.607s, Median: 1.586s, Std: 0.177s

RAG Browser Distroless:
Pull to Start - Mean: 4.545s, Median: 4.028s, Std: 4.744s
Start to Sys - Mean: 2.123s, Median: 1.597s, Std: 0.981s

@metalwarrior665
Copy link
Member

We need to optimize from Actor start to system info but these parts will depend on multiple teams.

Platform team has been working on optimizing the base Dockerfiles for years so you will definitely need to talk with them. The problem is a bit of a catch-22 as they will not want pre-cache images that are not the base for everybody. And even if we would change it for everyone, there will still be old versions and people not upgrading. But I'm just speculating, the best is to talk to them.

And then tooling team can probably influence how fast Node.js or Python process starts

@MQ37
Copy link
Contributor Author

MQ37 commented Apr 5, 2025

We need to optimize from Actor start to system info but these parts will depend on multiple teams.

Platform team has been working on optimizing the base Dockerfiles for years so you will definitely need to talk with them. The problem is a bit of a catch-22 as they will not want pre-cache images that are not the base for everybody. And even if we would change it for everyone, there will still be old versions and people not upgrading. But I'm just speculating, the best is to talk to them.

And then tooling team can probably influence how fast Node.js or Python process starts

Talked to @jirimoravcik, and the first thing is we need more test runs (thousands) to get statistically significant results. Jirka tried a few runs, and they seem okay to him - https://apify.slack.com/archives/CD0SF6KD4/p1743774919039049?thread_ts=1743622813.549859&cid=CD0SF6KD4. I already have this on my to-do list, so I will run the tests when I have time and report the results.

@github-actions github-actions bot added the t-ai Issues owned by the AI team. label May 6, 2025
@MQ37
Copy link
Contributor Author

MQ37 commented May 6, 2025

Wrote an Actor runtime benchmark and tested current apify/rag-web-browser master build (id mYEmhSzwMdjILx279) vs the distroless one from this branch. Benchmark script: https://github.com/apify/rag-web-browser/blob/a9d93437fe4698d7fd033791e29cf1a00d5cab83/benchmark/runtime.ts

Executed 500 + 2000 runs for each Actor (master and distroless) and measured times from Pulling Docker image of build to Starting Docker container (named pull to start time), from Starting Docker container to System info (named start to system time) and then overall Actor run time in secs from the API run details.

Based on the results the distroless image performs a bit better in start to system time (which is the most important in this benchmark), other measurements are comparable. Key improvement is that the distroless build is only ~600 MB compared to master build ~1.3 GB.

Actor run settings:

  • Memory: 1 GB
  • Input:
{
    query: 'apify ai',
    maxResults: 1,
}

Master build results:

Completed 500 runs
------------------------------------------------------
Average pull to start time: 0.21 seconds
Median pull to start time: 0.17 seconds
Min pull to start time: 0.12 seconds
Max pull to start time: 4.24 seconds
Standard deviation of pull to start times: 0.20 seconds
------------------------------------------------------
Average start to system time: 1.93 seconds
Median start to system time: 1.83 seconds
Min start to system time: 1.41 seconds
Max start to system time: 4.75 seconds
Standard deviation of start to system times: 0.40 seconds
------------------------------------------------------
Average total run time: 9.62 seconds
Median total run time: 7.53 seconds
Min total run time: 4.66 seconds
Max total run time: 50.32 seconds
Standard deviation of total run times: 5.80 seconds

Completed 2000 runs
------------------------------------------------------
Average pull to start time: 0.22 seconds
Median pull to start time: 0.17 seconds
Min pull to start time: 0.11 seconds
Max pull to start time: 5.67 seconds
Standard deviation of pull to start times: 0.30 seconds
------------------------------------------------------
Average start to system time: 1.90 seconds
Median start to system time: 1.80 seconds
Min start to system time: 1.34 seconds
Max start to system time: 5.18 seconds
Standard deviation of start to system times: 0.40 seconds
------------------------------------------------------
Average total run time: 11.01 seconds
Median total run time: 7.67 seconds
Min total run time: 4.45 seconds
Max total run time: 98.41 seconds
Standard deviation of total run times: 8.52 seconds

Distroless build results:

Completed 500 runs
------------------------------------------------------
Average pull to start time: 0.30 seconds
Median pull to start time: 0.18 seconds
Min pull to start time: 0.12 seconds
Max pull to start time: 10.41 seconds
Standard deviation of pull to start times: 0.82 seconds
------------------------------------------------------
Average start to system time: 1.78 seconds
Median start to system time: 1.69 seconds
Min start to system time: 1.23 seconds
Max start to system time: 5.07 seconds
Standard deviation of start to system times: 0.42 seconds
------------------------------------------------------
Average total run time: 9.20 seconds
Median total run time: 7.12 seconds
Min total run time: 4.55 seconds
Max total run time: 39.85 seconds
Standard deviation of total run times: 5.97 seconds

Completed 2000 runs
------------------------------------------------------
Average pull to start time: 0.25 seconds
Median pull to start time: 0.18 seconds
Min pull to start time: 0.11 seconds
Max pull to start time: 13.05 seconds
Standard deviation of pull to start times: 0.59 seconds
------------------------------------------------------
Average start to system time: 1.83 seconds
Median start to system time: 1.75 seconds
Min start to system time: 1.19 seconds
Max start to system time: 3.96 seconds
Standard deviation of start to system times: 0.38 seconds
------------------------------------------------------
Average total run time: 10.98 seconds
Median total run time: 7.69 seconds
Min total run time: 4.66 seconds
Max total run time: 72.11 seconds
Standard deviation of total run times: 8.36 seconds

@jirispilka
Copy link
Collaborator

@MQ37 Thanks for the detailed benchmark!

It would be great if you could summarize the results a bit more clearly, going through all those tables takes extra effort for anyone just trying to get the main takeaway :)

I don’t think we need to show all 500 runs, right? This looks like just a subset of the full 2k runs.
Also, min/max values tend to be quite noisy, using p95 or p99 would give a more meaningful picture.

That said, when looking at the data, the key metric seems to be start to system time and its average with standard deviation is all we need to make a decision. So it boils down to this:

Metric Master Build Distroless Build
Average start to system time 1.90 (±0.40) sec 1.83 (±0.38) sec

If we wanted to be rigorous, we’d run some statistical tests like a t-test, but based on the averages and standard deviations, I don’t see a compelling reason to introduce a new image and diverge from the standard Apify base image. Sorry.

It would be nice to see the full distribution and p95, p99, but based on the code, it looks like you'd need to rerun the entire experiment again. I don't think it is worth it.

@metalwarrior665
Copy link
Member

metalwarrior665 commented May 10, 2025

The important part is to test these runs spaced an hour or longer so you are not hitting the same Actor run worker with already cached images. Because we pre-cache the standard Dockerfiles, master will be much more cached than any non-standard builds (try to build an Actor in Rust and watch it to take 20 seconds to start the container :) so my assumption is that the numbers will get significantly worse for Distroless (something we saw in earlier tests).

Also the image size does not seems to be a determining factor for real runs. When I did a test of Cheerio (± 400 MB) vs browser (± 2 GB), I didn't see any difference in Startup time. So it is much more to do with how our platform optimizes these than the images themselves. Our average start used to be like 5+ seconds (like 3 years ago) and almost all the improvement to current <2 sec is just optimizing the platform (I think mostly caching the right things, some filesystem changes too), rather than images.

I think you make a very good case to eventually use this as a base image for most Actors. At that point we would pre-cache it and the problem with cold starts I mentioned above would not exist. But that will be a long road so you should definitely go through this with the platform & tooling teams.

@MQ37
Copy link
Contributor Author

MQ37 commented May 12, 2025

After discussing with @jirispilka we decided it is not worth it now - maybe in the future. I will keep this as my personal TODO to discuss with the platform and tooling team.

@MQ37 MQ37 closed this May 12, 2025
@metalwarrior665
Copy link
Member

Yeah, this is just one Actor, but your approach could be a significant improvement for all of Apify, so let's not throw it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-ai Issues owned by the AI team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Benchmark the distroless image build
4 participants