Skip to content

Conversation

Pijukatel
Copy link
Contributor

@Pijukatel Pijukatel commented Aug 27, 2025

Description

  • Create two variants of the ApifyRequestQueueClient:

    • ApifyRequestQueueClientFull - current version that supports multiple producers/consumers and locking of requests. More Apify API calls, higher API usage -> more expensive, slower.
    • ApifyRequestQueueClientSimple - new constrained client for self-consumer and multiple constrained producers. (Detailed constraints in the docs). Fewer Apify API calls, lower API usage -> cheaper, faster.
  • Most of the ApifyRequestQueueClient tests were moved away from actor-based tests, so that they can be parametrized for both versions of the ApifyRequestQueueClients and to make local debugging easier.

Usage:

RequestQueue with full client:
await RequestQueue.open(storage_client=ApifyStorageClient(simple_request_queue=False))
RequestQueue with simple(default) client:
await RequestQueue.open(storage_client=ApifyStorageClient())

Stats difference:

The full client is doing significantly more API calls and regarding the API usage it is doing 50% more RequestQueue writes and also more RequestQueue reads.

Example rq related stats for crawler started with 1000 requests:
ApifyRequestQueueClientFull:
API calls: 2123
API usage: {'readCount': 1000, 'writeCount': 3000, 'deleteCount': 0, 'headItemReadCount': 0, 'storageBytes': 104035}

ApifyRequestQueueClientSimple:
API calls: 1059
API usage: {'readCount': 3, 'writeCount': 2000, 'deleteCount': 0, 'headItemReadCount': 14, 'storageBytes': 103826}

Issues

@github-actions github-actions bot added this to the 122nd sprint - Tooling team milestone Aug 27, 2025
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Aug 27, 2025
@Pijukatel Pijukatel changed the title No locking queue feat: No locking queue Aug 27, 2025
Migrate most actor based tests to normal force cloud rq tests (for future parametrization of the Apify clients)
@Pijukatel Pijukatel changed the title feat: No locking queue feat: Add specialized ApifyRequestQueueClientSimple Aug 28, 2025
@Pijukatel Pijukatel requested a review from vdusek August 28, 2025 13:09
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Aug 28, 2025
@Pijukatel Pijukatel requested a review from janbuchar August 28, 2025 13:15
@Pijukatel Pijukatel marked this pull request as ready for review September 19, 2025 08:57
Copy link
Contributor

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high-level things

- Only one client is consuming the request queue at the time.
- Multiple producers can put requests to the queue, but their forefront requests are not guaranteed to be handled
so quickly as this client does not aggressively fetch the forefront and relies on local head estimation.
- Requests are only added to the queue, never deleted. (Marking as handled is ok.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requests are only added to the queue, never deleted. (Marking as handled is ok.)

? I don't get it.

Copy link
Contributor Author

@Pijukatel Pijukatel Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well API has delete endpoint. We do not expose it in RQ, but if someone is calling it while we use this client to work on that RQ, then it will have unpredictable behavior.
https://docs.apify.com/api/v2/request-queue-request-delete

It is not a normal use case, but better to be explicit about it

- Multiple producers can put requests to the queue, but their forefront requests are not guaranteed to be handled
so quickly as this client does not aggressively fetch the forefront and relies on local head estimation.
- Requests are only added to the queue, never deleted. (Marking as handled is ok.)
- Other producers can add new requests, but not modify existing ones (otherwise caching can miss the updates)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other producers can add new requests, but not modify existing ones (otherwise caching can miss the updates)

Modify existing ones? What do you mean by that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API has update endpoint. If someone (other producers) are updating existing requests and this client has already cached the requests locally, then the client will use the outdated request.
https://docs.apify.com/api/v2/request-queue-request-put

It is not a normal use case, but better to be explicit about it

Comment on lines +63 to +72
```python
from apify.storages import RequestQueue
from apify.storage_clients import ApifyStorageClient

async def main():
# Full client
rq_full = await RequestQueue.open(storage_client=ApifyStorageClient(simple_request_queue=False))
# Default optimized client
rq_simple = await RequestQueue.open(storage_client=ApifyStorageClient())
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use an example with Actor here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about that, but it is just wrapping it with Actor:
I can add it, but it felt like pointless clutter

Comment on lines +54 to +56
# Reset the Actor class state.
apify._actor.Actor.__wrapped__.__class__._is_any_instance_initialized = False # type: ignore[attr-defined]
apify._actor.Actor.__wrapped__.__class__._is_rebooting = False # type: ignore[attr-defined]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need it, but I saw a warning log in some tests and realized that we do not isolate the tests so well because is_any_instance_initialized and _is_rebooting were leaking from the previous tests

This warning could be observed in tests
WARN Repeated Actor initialization detected - this is non-standard usage, proceed with care

Comment on lines 29 to 37
def __init__(self, *, simple_request_queue: bool = True) -> None:
"""Initialize the Apify storage client.

Args:
simple_request_queue: If True, the `create_rq_client` will always return `ApifyRequestQueueClientSimple`,
if false it will return `ApifyRequestQueueClientFull`. Simple client is suitable for single consumer
scenarios and makes less API calls. Full client is suitable for multiple consumers scenarios at the
cost of higher API usage
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed on Slack, I'm more for:

def __init__(self, *, access: Literal['single', 'shared' = 'single') -> None:

also taking into account potential RQ v3...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 11 to 12
from ._request_queue_client_full import ApifyRequestQueueClientFull
from ._request_queue_client_simple import ApifyRequestQueueClientSimple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the naming of these internal classes...

If we go with the access literal parameter, maybe we can use:

  • ApifyRequestQueueSingleClient
  • ApifyRequestQueueSharedClient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@pytest.fixture
async def request_queue_force_cloud(apify_token: str, monkeypatch: pytest.MonkeyPatch) -> AsyncGenerator[RequestQueue]:
@pytest.fixture(params=[False, True])
async def default_request_queue_apify(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not just request_queue_apify?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, you are removing the Actor tests (!?).

With the current framework, the integration tests are full end-to-end tests running on the platform as Actors.

What you're proposing changes that (!).

Copy link
Contributor Author

@Pijukatel Pijukatel Sep 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are several levels of integration tests.

  1. level - API
  • These are, for example, RequestQueue specific tests that do not mock API and are doing real API calls.
  1. level - Platform
  • Actor tests that are integration of Apify platform and also API
  1. level - Crawlee on Platform
  • Crawler running in Actor. Test integration of Crawlee with Apify Platform and API

What I have done to most RQ specific tests is to move them from level 2 and 3 to level 1. I think we should have more tests on lower levels and fewer tests on higher levels

@Pijukatel Pijukatel requested a review from vdusek September 19, 2025 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants