feat: unbloat, reduce image size #70

MQ37 · 2025-03-31T13:45:10Z

Closes #72

These changes bring image size from ~1.4 GB to ~600 MB.

matyascimbulka · 2025-03-31T14:16:52Z

@MQ37 Could you give us links to the build as well as some runs (STANDBY and STANDALONE) using it?

MQ37 · 2025-03-31T14:33:03Z

@MQ37 Could you give us links to the build as well as some runs (STANDBY and STANDALONE) using it?

Sure 👍

Build link: https://console.apify.com/actors/B2VM9FhWyxLEMb7tm/builds/1.0.25/log
Normal run: https://console.apify.com/view/runs/osLeK8ehOOosBGuot
Standby run: https://console.apify.com/view/runs/UQmDACw95ugfnTSDm

matyascimbulka

Wow, this is really cool. Reducing the image size by 1GB without reducing functionality is impressive. Thank you.

metalwarrior665 · 2025-03-31T17:56:46Z

Have you checked if this makes sense to do for the Apify Docker base images?

MQ37 · 2025-03-31T18:58:22Z

Have you checked if this makes sense to do for the Apify Docker base images?

Since it uses distroless image where we need to copy all the requirements I don't think it makes sense and is viable to make this a base image for Actors - they can use some specific libs or different versions.

jirispilka · 2025-03-31T19:10:21Z

Yeah, this is great. It will speed up start significantly (when image is not cached)

I think it could make sense to use it as a base image for certain use cases — or to include it in the templates as an example.

Assuming there isn’t some catch I’m not seeing right now? @metalwarrior665 can you think of any?
Or discuss this with someone from tooling?

We definitely need to test it properly before releasing it in the RAG Web Browser.

metalwarrior665 · 2025-04-01T04:39:47Z

This should definitely go through tooling/platform review as the base Dockerfiles are cached/preloaded in some way.

Btw: When we dropped image size in Google Maps from 2 TB to 400 MB (browser -> cheerio), we saw no improvement in startup time. I think the caching just takes care of that. So I would measure what benefit we actually want - usually startup times and build times.

MQ37 · 2025-04-02T13:37:35Z

just found out that when I tested I probably tested locally with the playwright which does not use the docker image and when copyting from builder image I forgot to copy the browser with it's requirements 🤦 When including the browser and all the libs we can realistically at maximum save ~300 MB which I don't know is worth the overhead (we would have to solve loading of the dynamically linked libraries). But we can look into the distroless images for other, simpler use cases for node only Actors.

MQ37 · 2025-04-02T14:25:05Z

just found out that when I tested I probably tested locally with the playwright which does not use the docker image and when copyting from builder image I forgot to copy the browser with it's requirements 🤦 When including the browser and all the libs we can realistically at maximum save ~300 MB which I don't know is worth the overhead (we would have to solve loading of the dynamically linked libraries). But we can look into the distroless images for other, simpler use cases for node only Actors.

Taking this back, hacked it to work (its really hacky) the image is ~600 MB and playwright works 👍 More than 300 MB but way less than original 1.3 GB.

MQ37 · 2025-04-02T15:04:52Z

Build: https://console.apify.com/actors/B2VM9FhWyxLEMb7tm/builds/1.0.37/log

Runs
Cheerio normal: https://console.apify.com/view/runs/ZgqKfUbfwrpJGrPlN
Playwright Firefox: https://console.apify.com/view/runs/HLqHFA2BmTWjVK5IO
Standby both: https://console.apify.com/view/runs/uHWmjNsJG1lI5A2gE

metalwarrior665 · 2025-04-02T17:08:21Z

Cool! Could you still make a post in #product-dev-tools and present this achievement before we merge? We can take this Actor as PoC of the distroless approach but it should be approved by the Tooling team. There might be some assumptions we have about our base Dockerfiles

metalwarrior665 · 2025-04-02T17:14:29Z

@MQ37 If you look at the first 2 log lines of your runs, the Docker pull takes 6+ seconds which is terrible. Usually it takes between 1-2 seconds. You can try a few more tests but I think this is cache miss and unless we make your distroless approach a new standard, we have to pause this.

jirispilka · 2025-04-02T19:21:23Z

Yes, It would mean that the image is not cached. If you ran it more often, then the time should improve.

Since the rag web browser is used 2-5k runs per day, we need to caution we non-standard changes.
Therefore, I'm holding with media blocking PR #54 , I need more time to properly tested it

MQ37 · 2025-04-02T19:29:24Z

@MQ37 If you look at the first 2 log lines of your runs, the Docker pull takes 6+ seconds which is terrible. Usually it takes between 1-2 seconds. You can try a few more tests but I think this is cache miss and unless we make your distroless approach a new standard, we have to pause this.

Did a few more tests and the pull is way faster when I hit the cache. But I noticed the startup time, time from "Starting Docker container." to the printed system info in logs, is slower/unstable (higher std). Which is what we want to optimize for.

Original RAG Browser:
Pull to Start - Mean: 0.173s, Median: 0.179s, Std: 0.029s
Start to Sys - Mean: 1.607s, Median: 1.586s, Std: 0.177s

RAG Browser Distroless:
Pull to Start - Mean: 4.545s, Median: 4.028s, Std: 4.744s
Start to Sys - Mean: 2.123s, Median: 1.597s, Std: 0.981s

metalwarrior665 · 2025-04-04T12:05:41Z

We need to optimize from Actor start to system info but these parts will depend on multiple teams.

Platform team has been working on optimizing the base Dockerfiles for years so you will definitely need to talk with them. The problem is a bit of a catch-22 as they will not want pre-cache images that are not the base for everybody. And even if we would change it for everyone, there will still be old versions and people not upgrading. But I'm just speculating, the best is to talk to them.

And then tooling team can probably influence how fast Node.js or Python process starts

MQ37 · 2025-04-05T11:17:54Z

We need to optimize from Actor start to system info but these parts will depend on multiple teams.

Platform team has been working on optimizing the base Dockerfiles for years so you will definitely need to talk with them. The problem is a bit of a catch-22 as they will not want pre-cache images that are not the base for everybody. And even if we would change it for everyone, there will still be old versions and people not upgrading. But I'm just speculating, the best is to talk to them.

And then tooling team can probably influence how fast Node.js or Python process starts

Talked to @jirimoravcik, and the first thing is we need more test runs (thousands) to get statistically significant results. Jirka tried a few runs, and they seem okay to him - https://apify.slack.com/archives/CD0SF6KD4/p1743774919039049?thread_ts=1743622813.549859&cid=CD0SF6KD4. I already have this on my to-do list, so I will run the tests when I have time and report the results.

MQ37 · 2025-05-06T13:07:18Z

Wrote an Actor runtime benchmark and tested current apify/rag-web-browser master build (id mYEmhSzwMdjILx279) vs the distroless one from this branch. Benchmark script: https://github.com/apify/rag-web-browser/blob/a9d93437fe4698d7fd033791e29cf1a00d5cab83/benchmark/runtime.ts

Executed 500 + 2000 runs for each Actor (master and distroless) and measured times from Pulling Docker image of build to Starting Docker container (named pull to start time), from Starting Docker container to System info (named start to system time) and then overall Actor run time in secs from the API run details.

Based on the results the distroless image performs a bit better in start to system time (which is the most important in this benchmark), other measurements are comparable. Key improvement is that the distroless build is only ~600 MB compared to master build ~1.3 GB.

Actor run settings:

Memory: 1 GB
Input:

{
    query: 'apify ai',
    maxResults: 1,
}

Master build results:

Completed 500 runs
------------------------------------------------------
Average pull to start time: 0.21 seconds
Median pull to start time: 0.17 seconds
Min pull to start time: 0.12 seconds
Max pull to start time: 4.24 seconds
Standard deviation of pull to start times: 0.20 seconds
------------------------------------------------------
Average start to system time: 1.93 seconds
Median start to system time: 1.83 seconds
Min start to system time: 1.41 seconds
Max start to system time: 4.75 seconds
Standard deviation of start to system times: 0.40 seconds
------------------------------------------------------
Average total run time: 9.62 seconds
Median total run time: 7.53 seconds
Min total run time: 4.66 seconds
Max total run time: 50.32 seconds
Standard deviation of total run times: 5.80 seconds

Completed 2000 runs
------------------------------------------------------
Average pull to start time: 0.22 seconds
Median pull to start time: 0.17 seconds
Min pull to start time: 0.11 seconds
Max pull to start time: 5.67 seconds
Standard deviation of pull to start times: 0.30 seconds
------------------------------------------------------
Average start to system time: 1.90 seconds
Median start to system time: 1.80 seconds
Min start to system time: 1.34 seconds
Max start to system time: 5.18 seconds
Standard deviation of start to system times: 0.40 seconds
------------------------------------------------------
Average total run time: 11.01 seconds
Median total run time: 7.67 seconds
Min total run time: 4.45 seconds
Max total run time: 98.41 seconds
Standard deviation of total run times: 8.52 seconds

Distroless build results:

Completed 500 runs
------------------------------------------------------
Average pull to start time: 0.30 seconds
Median pull to start time: 0.18 seconds
Min pull to start time: 0.12 seconds
Max pull to start time: 10.41 seconds
Standard deviation of pull to start times: 0.82 seconds
------------------------------------------------------
Average start to system time: 1.78 seconds
Median start to system time: 1.69 seconds
Min start to system time: 1.23 seconds
Max start to system time: 5.07 seconds
Standard deviation of start to system times: 0.42 seconds
------------------------------------------------------
Average total run time: 9.20 seconds
Median total run time: 7.12 seconds
Min total run time: 4.55 seconds
Max total run time: 39.85 seconds
Standard deviation of total run times: 5.97 seconds

Completed 2000 runs
------------------------------------------------------
Average pull to start time: 0.25 seconds
Median pull to start time: 0.18 seconds
Min pull to start time: 0.11 seconds
Max pull to start time: 13.05 seconds
Standard deviation of pull to start times: 0.59 seconds
------------------------------------------------------
Average start to system time: 1.83 seconds
Median start to system time: 1.75 seconds
Min start to system time: 1.19 seconds
Max start to system time: 3.96 seconds
Standard deviation of start to system times: 0.38 seconds
------------------------------------------------------
Average total run time: 10.98 seconds
Median total run time: 7.69 seconds
Min total run time: 4.66 seconds
Max total run time: 72.11 seconds
Standard deviation of total run times: 8.36 seconds

jirispilka · 2025-05-07T09:45:17Z

@MQ37 Thanks for the detailed benchmark!

It would be great if you could summarize the results a bit more clearly, going through all those tables takes extra effort for anyone just trying to get the main takeaway :)

I don’t think we need to show all 500 runs, right? This looks like just a subset of the full 2k runs.
Also, min/max values tend to be quite noisy, using p95 or p99 would give a more meaningful picture.

That said, when looking at the data, the key metric seems to be start to system time and its average with standard deviation is all we need to make a decision. So it boils down to this:

Metric	Master Build	Distroless Build
Average start to system time	1.90 (±0.40) sec	1.83 (±0.38) sec

If we wanted to be rigorous, we’d run some statistical tests like a t-test, but based on the averages and standard deviations, I don’t see a compelling reason to introduce a new image and diverge from the standard Apify base image. Sorry.

It would be nice to see the full distribution and p95, p99, but based on the code, it looks like you'd need to rerun the entire experiment again. I don't think it is worth it.

metalwarrior665 · 2025-05-10T07:25:21Z

The important part is to test these runs spaced an hour or longer so you are not hitting the same Actor run worker with already cached images. Because we pre-cache the standard Dockerfiles, master will be much more cached than any non-standard builds (try to build an Actor in Rust and watch it to take 20 seconds to start the container :) so my assumption is that the numbers will get significantly worse for Distroless (something we saw in earlier tests).

Also the image size does not seems to be a determining factor for real runs. When I did a test of Cheerio (± 400 MB) vs browser (± 2 GB), I didn't see any difference in Startup time. So it is much more to do with how our platform optimizes these than the images themselves. Our average start used to be like 5+ seconds (like 3 years ago) and almost all the improvement to current <2 sec is just optimizing the platform (I think mostly caching the right things, some filesystem changes too), rather than images.

I think you make a very good case to eventually use this as a base image for most Actors. At that point we would pre-cache it and the problem with cold starts I mentioned above would not exist. But that will be a long road so you should definitely go through this with the platform & tooling teams.

MQ37 · 2025-05-12T07:26:10Z

After discussing with @jirispilka we decided it is not worth it now - maybe in the future. I will keep this as my personal TODO to discuss with the platform and tooling team.

metalwarrior665 · 2025-05-12T10:36:43Z

Yeah, this is just one Actor, but your approach could be a significant improvement for all of Apify, so let's not throw it out.

Update Dockerfile to use distroless image and streamline build process

8c03adc

MQ37 changed the title ~~feat: unbload, reduce image size~~ feat: unbloat, reduce image size Mar 31, 2025

MQ37 requested review from jirispilka, matyascimbulka and metalwarrior665 and removed request for jirispilka March 31, 2025 13:45

matyascimbulka approved these changes Mar 31, 2025

View reviewed changes

fix dockerfile

a8330b0

MQ37 mentioned this pull request Apr 28, 2025

Benchmark the distroless image build #72

Closed

add runtime benchmark

a9d9343

github-actions bot assigned MQ37 May 6, 2025

github-actions bot added the t-ai Issues owned by the AI team. label May 6, 2025

MQ37 closed this May 12, 2025

feat: unbloat, reduce image size #70

feat: unbloat, reduce image size #70

Uh oh!

Conversation

MQ37 commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

matyascimbulka commented Mar 31, 2025

Uh oh!

MQ37 commented Mar 31, 2025

Uh oh!

matyascimbulka left a comment

Choose a reason for hiding this comment

Uh oh!

metalwarrior665 commented Mar 31, 2025

Uh oh!

MQ37 commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jirispilka commented Mar 31, 2025

Uh oh!

metalwarrior665 commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MQ37 commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MQ37 commented Apr 2, 2025

Uh oh!

MQ37 commented Apr 2, 2025

Uh oh!

metalwarrior665 commented Apr 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

metalwarrior665 commented Apr 2, 2025

Uh oh!

jirispilka commented Apr 2, 2025

Uh oh!

MQ37 commented Apr 2, 2025

Uh oh!

metalwarrior665 commented Apr 4, 2025

Uh oh!

MQ37 commented Apr 5, 2025

Uh oh!

MQ37 commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jirispilka commented May 7, 2025

Uh oh!

metalwarrior665 commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MQ37 commented May 12, 2025

Uh oh!

metalwarrior665 commented May 12, 2025

Uh oh!

Uh oh!

MQ37 commented Mar 31, 2025 •

edited

Loading

MQ37 commented Mar 31, 2025 •

edited

Loading

metalwarrior665 commented Apr 1, 2025 •

edited

Loading

MQ37 commented Apr 2, 2025 •

edited

Loading

metalwarrior665 commented Apr 2, 2025 •

edited

Loading

MQ37 commented May 6, 2025 •

edited

Loading

metalwarrior665 commented May 10, 2025 •

edited

Loading