Skip to content

refactor: Introduce new Apify storage client #470

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

vdusek
Copy link
Contributor

@vdusek vdusek commented May 10, 2025

Description

Issues

Testing

  • The current test set covers the changes.

@vdusek vdusek self-assigned this May 10, 2025
@github-actions github-actions bot added this to the 114th sprint - Tooling team milestone May 10, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label May 10, 2025
@vdusek vdusek changed the title New apify storage clients refactor: Introduce new Apify storage client May 10, 2025
@vdusek vdusek force-pushed the new-apify-storage-clients branch from d27c080 to 82efd3e Compare June 12, 2025 12:44
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Jun 18, 2025
@vdusek vdusek force-pushed the new-apify-storage-clients branch 2 times, most recently from 067b793 to 104a168 Compare June 23, 2025 09:12
@vdusek vdusek marked this pull request as ready for review June 26, 2025 13:04
@vdusek vdusek requested a review from Pijukatel June 26, 2025 13:05
@janbuchar janbuchar self-requested a review June 26, 2025 13:27
@@ -11,14 +11,14 @@ async def main() -> None:
await dataset.export_to(
content_type='csv',
key='data.csv',
to_key_value_store_name='my-cool-key-value-store',
to_kvs_name='my-cool-key-value-store',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this BC break worth it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's evaluate all the potential BCs at the end

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. I thought we are nearing that now 😁

view=view,
)
result = DatasetItemsListPage.model_validate(vars(response))
await self._update_metadata()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this after every method call might be costly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, there are two API calls instead of one. And they have to be called consecutively. Not ideal, but since it is Dataset.get_data (which retrieves results), it should not be a crawling performance bottleneck. What would be a better approach anyway? Probably the only way is to make it lazy and update metadata only when someone accesses it. That would require making the metadata getter async, which would result in another significant restructure in all storage clients (including Crawlee, of course). I wouldn't go for it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But... the metadata were originally (before the storages refactor) retrieved via get_info, right?

Copy link
Contributor Author

@vdusek vdusek Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct... And maybe it's a good point and we should do it - It won't help us in the current local storage clients, but it will help here and maybe in other potential cloud/DB-based storage clients, where you don't have to maintain the metadata manually.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then I guess the question is if keeping the metadata up to date is worth the possible performance hit. Since in principle, nothing is keeping the storages from being accessed by multiple clients, the metadata may become stale at virtually any point. So on demand fetching actually makes a lot of sense to me. But I'm ready to hear other opinions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably for it (it will be some work of course)... And maybe I would just name it async def get_metadata().

@Pijukatel what do you think?

name: str | None,
configuration: Configuration,
) -> ApifyDatasetClient:
token = getattr(configuration, 'token', None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do getattr even though we know we want to use only Configuration from Apify, but we type check it as Configuration from Crawlee. That looks like a workaround and we are loosing static type safety and need to do runtime checks.

What is the best approach? I guess that making class DatasetClient(Generic[TConfiguration]): and ApifyDatasetClient(DatasetClient[ApifyConfiguration]) would solve this, but will that open whole range of new problems?

(Same for other clients)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making stuff generic like this feels like an overkill, yeah. Can we just do an isinstance(configuration, apify.Configuration) every time we want to use something that's not in the crawlee config?

Or we could accept apify.Configuration (or pull it from service locator) in ApifyStorageClient and pass it down in create_*_client methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, maybe we could remove open from the DatasetClient, KeyValueStoreClient and RequestQueueClient interfaces and leave defining that to the implementer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce new Apify storage client
3 participants