refactor: Introduce new Apify storage client #470

vdusek · 2025-05-10T09:22:35Z

Description

Integration of the Crawlee v1 changes, mostly new storages & storage clients (introduced in refactor!: Introduce new storage client system crawlee-python#1194).

Issues

Closes: Introduce new Apify storage client #469

Testing

The current test set covers the changes.

src/apify/apify_storage_client/_dataset_client.py

src/apify/apify_storage_client/__init__.py

janbuchar · 2025-06-26T14:19:49Z

docs/03_concepts/code/03_dataset_exports.py

@@ -11,14 +11,14 @@ async def main() -> None:
        await dataset.export_to(
            content_type='csv',
            key='data.csv',
-            to_key_value_store_name='my-cool-key-value-store',
+            to_kvs_name='my-cool-key-value-store',


Is this BC break worth it?

let's evaluate all the potential BCs at the end

Sure. I thought we are nearing that now 😁

src/apify/storage_clients/_apify/_dataset_client.py

janbuchar · 2025-06-26T14:34:51Z

src/apify/storage_clients/_apify/_dataset_client.py

+            view=view,
+        )
+        result = DatasetItemsListPage.model_validate(vars(response))
+        await self._update_metadata()


Doing this after every method call might be costly.

Well, there are two API calls instead of one. And they have to be called consecutively. Not ideal, but since it is Dataset.get_data (which retrieves results), it should not be a crawling performance bottleneck. What would be a better approach anyway? Probably the only way is to make it lazy and update metadata only when someone accesses it. That would require making the metadata getter async, which would result in another significant restructure in all storage clients (including Crawlee, of course). I wouldn't go for it.

But... the metadata were originally (before the storages refactor) retrieved via get_info, right?

That's correct... And maybe it's a good point and we should do it - It won't help us in the current local storage clients, but it will help here and maybe in other potential cloud/DB-based storage clients, where you don't have to maintain the metadata manually.

Then I guess the question is if keeping the metadata up to date is worth the possible performance hit. Since in principle, nothing is keeping the storages from being accessed by multiple clients, the metadata may become stale at virtually any point. So on demand fetching actually makes a lot of sense to me. But I'm ready to hear other opinions.

I'm probably for it (it will be some work of course)... And maybe I would just name it async def get_metadata().

@Pijukatel what do you think?

src/apify/_actor.py

tests/integration/conftest.py

Pijukatel · 2025-06-27T08:24:18Z

src/apify/storage_clients/_apify/_dataset_client.py

+        name: str | None,
+        configuration: Configuration,
+    ) -> ApifyDatasetClient:
+        token = getattr(configuration, 'token', None)


We do getattr even though we know we want to use only Configuration from Apify, but we type check it as Configuration from Crawlee. That looks like a workaround and we are loosing static type safety and need to do runtime checks.

What is the best approach? I guess that making class DatasetClient(Generic[TConfiguration]): and ApifyDatasetClient(DatasetClient[ApifyConfiguration]) would solve this, but will that open whole range of new problems?

(Same for other clients)

Making stuff generic like this feels like an overkill, yeah. Can we just do an isinstance(configuration, apify.Configuration) every time we want to use something that's not in the crawlee config?

Or we could accept apify.Configuration (or pull it from service locator) in ApifyStorageClient and pass it down in create_*_client methods.

Also, maybe we could remove open from the DatasetClient, KeyValueStoreClient and RequestQueueClient interfaces and leave defining that to the implementer.

vdusek self-assigned this May 10, 2025

github-actions bot added this to the 114th sprint - Tooling team milestone May 10, 2025

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label May 10, 2025

vdusek changed the title ~~New apify storage clients~~ refactor: Introduce new Apify storage client May 10, 2025

vdusek mentioned this pull request May 10, 2025

refactor!: Introduce new storage client system apify/crawlee-python#1194

Open

1 task

janbuchar reviewed May 16, 2025

View reviewed changes

src/apify/apify_storage_client/_dataset_client.py Outdated Show resolved Hide resolved

vdusek force-pushed the new-apify-storage-clients branch from d27c080 to 82efd3e Compare June 12, 2025 12:44

github-actions bot added the tested Temporary label used only programatically for some analytics. label Jun 18, 2025

vdusek force-pushed the new-apify-storage-clients branch 2 times, most recently from 067b793 to 104a168 Compare June 23, 2025 09:12

vdusek added 6 commits June 26, 2025 08:22

Rm old Apify storage clients

5c437c9

Add init version of new Apify storage clients

bf55338

Move specific models from Crawlee to SDK

6b2f82b

Adapt to Crawlee v1

38bef68

Adapt to Crawlee v1 (p2)

1f85430

Fix default storage IDs

a3d68a2

vdusek force-pushed the new-apify-storage-clients branch from dc7f0a7 to a3d68a2 Compare June 26, 2025 06:25

vdusek modified the milestones: 114th sprint - Tooling team, 117th sprint - Tooling team Jun 26, 2025

vdusek added 3 commits June 26, 2025 10:42

Fix integration test and Not implemented exception in purge

c77e8d5

Fix unit tests

8731aff

fix lint

8dfaffb

vdusek marked this pull request as ready for review June 26, 2025 13:04

vdusek requested a review from Pijukatel June 26, 2025 13:05

janbuchar self-requested a review June 26, 2025 13:27

vdusek added 2 commits June 26, 2025 16:40

add KVS record_exists not implemented

53fad07

update to apify client 1.12 and implement record exists

5869f8e

janbuchar reviewed Jun 26, 2025

View reviewed changes

vdusek added 2 commits June 27, 2025 08:43

Move default storage IDs to Configuration

82e65fc

opening storages get default id from config

8de950b

Pijukatel reviewed Jun 27, 2025

View reviewed changes

vdusek added 2 commits June 27, 2025 10:50

Addressing more feedback

98b76c5

Fixing integration test test_push_large_data_chunks_over_9mb

7b5ee07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: Introduce new Apify storage client #470

refactor: Introduce new Apify storage client #470

Uh oh!

vdusek commented May 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

janbuchar Jun 26, 2025

Uh oh!

vdusek Jun 27, 2025

Uh oh!

janbuchar Jun 27, 2025

Uh oh!

Uh oh!

janbuchar Jun 26, 2025

Uh oh!

vdusek Jun 27, 2025

Uh oh!

janbuchar Jun 27, 2025

Uh oh!

vdusek Jun 27, 2025 •

edited

Loading

Uh oh!

janbuchar Jun 27, 2025

Uh oh!

vdusek Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

Pijukatel Jun 27, 2025

Uh oh!

janbuchar Jun 27, 2025

Uh oh!

janbuchar Jun 27, 2025

Uh oh!

Uh oh!

refactor: Introduce new Apify storage client #470

Are you sure you want to change the base?

refactor: Introduce new Apify storage client #470

Uh oh!

Conversation

vdusek commented May 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Testing

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vdusek Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vdusek commented May 10, 2025 •

edited

Loading

vdusek Jun 27, 2025 •

edited

Loading