-
Notifications
You must be signed in to change notification settings - Fork 15
feat: Add specialized ApifyRequestQueueClientSimple
#573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Migrate most actor based tests to normal force cloud rq tests (for future parametrization of the Apify clients)
ApifyRequestQueueClientSimple
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high-level things
- Only one client is consuming the request queue at the time. | ||
- Multiple producers can put requests to the queue, but their forefront requests are not guaranteed to be handled | ||
so quickly as this client does not aggressively fetch the forefront and relies on local head estimation. | ||
- Requests are only added to the queue, never deleted. (Marking as handled is ok.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requests are only added to the queue, never deleted. (Marking as handled is ok.)
? I don't get it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well API has delete endpoint. We do not expose it in RQ, but if someone is calling it while we use this client to work on that RQ, then it will have unpredictable behavior.
https://docs.apify.com/api/v2/request-queue-request-delete
It is not a normal use case, but better to be explicit about it
- Multiple producers can put requests to the queue, but their forefront requests are not guaranteed to be handled | ||
so quickly as this client does not aggressively fetch the forefront and relies on local head estimation. | ||
- Requests are only added to the queue, never deleted. (Marking as handled is ok.) | ||
- Other producers can add new requests, but not modify existing ones (otherwise caching can miss the updates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other producers can add new requests, but not modify existing ones (otherwise caching can miss the updates)
Modify existing ones? What do you mean by that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API has update endpoint. If someone (other producers) are updating existing requests and this client has already cached the requests locally, then the client will use the outdated request.
https://docs.apify.com/api/v2/request-queue-request-put
It is not a normal use case, but better to be explicit about it
```python | ||
from apify.storages import RequestQueue | ||
from apify.storage_clients import ApifyStorageClient | ||
|
||
async def main(): | ||
# Full client | ||
rq_full = await RequestQueue.open(storage_client=ApifyStorageClient(simple_request_queue=False)) | ||
# Default optimized client | ||
rq_simple = await RequestQueue.open(storage_client=ApifyStorageClient()) | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use an example with Actor here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that, but it is just wrapping it with Actor:
I can add it, but it felt like pointless clutter
# Reset the Actor class state. | ||
apify._actor.Actor.__wrapped__.__class__._is_any_instance_initialized = False # type: ignore[attr-defined] | ||
apify._actor.Actor.__wrapped__.__class__._is_rebooting = False # type: ignore[attr-defined] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need it, but I saw a warning log in some tests and realized that we do not isolate the tests so well because is_any_instance_initialized
and _is_rebooting
were leaking from the previous tests
This warning could be observed in tests
WARN Repeated Actor initialization detected - this is non-standard usage, proceed with care
def __init__(self, *, simple_request_queue: bool = True) -> None: | ||
"""Initialize the Apify storage client. | ||
|
||
Args: | ||
simple_request_queue: If True, the `create_rq_client` will always return `ApifyRequestQueueClientSimple`, | ||
if false it will return `ApifyRequestQueueClientFull`. Simple client is suitable for single consumer | ||
scenarios and makes less API calls. Full client is suitable for multiple consumers scenarios at the | ||
cost of higher API usage | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed on Slack, I'm more for:
def __init__(self, *, access: Literal['single', 'shared' = 'single') -> None:
also taking into account potential RQ v3...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
from ._request_queue_client_full import ApifyRequestQueueClientFull | ||
from ._request_queue_client_simple import ApifyRequestQueueClientSimple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the naming of these internal classes...
If we go with the access literal parameter, maybe we can use:
ApifyRequestQueueSingleClient
ApifyRequestQueueSharedClient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
tests/integration/conftest.py
Outdated
@pytest.fixture | ||
async def request_queue_force_cloud(apify_token: str, monkeypatch: pytest.MonkeyPatch) -> AsyncGenerator[RequestQueue]: | ||
@pytest.fixture(params=[False, True]) | ||
async def default_request_queue_apify( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not just request_queue_apify
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, you are removing the Actor tests (!?).
With the current framework, the integration tests are full end-to-end tests running on the platform as Actors.
What you're proposing changes that (!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are several levels of integration tests.
- level - API
- These are, for example, RequestQueue specific tests that do not mock API and are doing real API calls.
- level - Platform
- Actor tests that are integration of Apify platform and also API
- level - Crawlee on Platform
- Crawler running in Actor. Test integration of Crawlee with Apify Platform and API
What I have done to most RQ specific tests is to move them from level 2 and 3 to level 1. I think we should have more tests on lower levels and fewer tests on higher levels
Description
Create two variants of the
ApifyRequestQueueClient
:ApifyRequestQueueClientFull
- current version that supports multiple producers/consumers and locking of requests. More Apify API calls, higher API usage -> more expensive, slower.ApifyRequestQueueClientSimple
- new constrained client for self-consumer and multiple constrained producers. (Detailed constraints in the docs). Fewer Apify API calls, lower API usage -> cheaper, faster.Most of the
ApifyRequestQueueClient
tests were moved away from actor-based tests, so that they can be parametrized for both versions of theApifyRequestQueueClients
and to make local debugging easier.Usage:
RequestQueue with full client:
await RequestQueue.open(storage_client=ApifyStorageClient(simple_request_queue=False))
RequestQueue with simple(default) client:
await RequestQueue.open(storage_client=ApifyStorageClient())
Stats difference:
The full client is doing significantly more API calls and regarding the API usage it is doing 50% more RequestQueue writes and also more RequestQueue reads.
Example rq related stats for crawler started with 1000 requests:
ApifyRequestQueueClientFull:
API calls: 2123
API usage: {'readCount': 1000, 'writeCount': 3000, 'deleteCount': 0, 'headItemReadCount': 0, 'storageBytes': 104035}
ApifyRequestQueueClientSimple:
API calls: 1059
API usage: {'readCount': 3, 'writeCount': 2000, 'deleteCount': 0, 'headItemReadCount': 14, 'storageBytes': 103826}
Issues