-
Notifications
You must be signed in to change notification settings - Fork 12
refactor: Introduce new Apify storage client #470
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
d27c080
to
82efd3e
Compare
067b793
to
104a168
Compare
dc7f0a7
to
a3d68a2
Compare
@@ -11,14 +11,14 @@ async def main() -> None: | |||
await dataset.export_to( | |||
content_type='csv', | |||
key='data.csv', | |||
to_key_value_store_name='my-cool-key-value-store', | |||
to_kvs_name='my-cool-key-value-store', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this BC break worth it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's evaluate all the potential BCs at the end
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. I thought we are nearing that now 😁
view=view, | ||
) | ||
result = DatasetItemsListPage.model_validate(vars(response)) | ||
await self._update_metadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this after every method call might be costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, there are two API calls instead of one. And they have to be called consecutively. Not ideal, but since it is Dataset.get_data
(which retrieves results), it should not be a crawling performance bottleneck. What would be a better approach anyway? Probably the only way is to make it lazy and update metadata only when someone accesses it. That would require making the metadata getter async, which would result in another significant restructure in all storage clients (including Crawlee, of course). I wouldn't go for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But... the metadata were originally (before the storages refactor) retrieved via get_info
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct... And maybe it's a good point and we should do it - It won't help us in the current local storage clients, but it will help here and maybe in other potential cloud/DB-based storage clients, where you don't have to maintain the metadata manually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then I guess the question is if keeping the metadata up to date is worth the possible performance hit. Since in principle, nothing is keeping the storages from being accessed by multiple clients, the metadata may become stale at virtually any point. So on demand fetching actually makes a lot of sense to me. But I'm ready to hear other opinions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm probably for it (it will be some work of course)... And maybe I would just name it async def get_metadata()
.
@Pijukatel what do you think?
name: str | None, | ||
configuration: Configuration, | ||
) -> ApifyDatasetClient: | ||
token = getattr(configuration, 'token', None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do getattr even though we know we want to use only Configuration from Apify, but we type check it as Configuration from Crawlee. That looks like a workaround and we are loosing static type safety and need to do runtime checks.
What is the best approach? I guess that making class DatasetClient(Generic[TConfiguration]):
and ApifyDatasetClient(DatasetClient[ApifyConfiguration])
would solve this, but will that open whole range of new problems?
(Same for other clients)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making stuff generic like this feels like an overkill, yeah. Can we just do an isinstance(configuration, apify.Configuration)
every time we want to use something that's not in the crawlee config?
Or we could accept apify.Configuration
(or pull it from service locator) in ApifyStorageClient
and pass it down in create_*_client
methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe we could remove open
from the DatasetClient
, KeyValueStoreClient
and RequestQueueClient
interfaces and leave defining that to the implementer.
Description
Issues
Testing