Skip to content

Conversation

vdusek
Copy link
Contributor

@vdusek vdusek commented Jan 29, 2025

@vdusek vdusek added this to the 107th sprint - Tooling team milestone Jan 29, 2025
@vdusek vdusek requested a review from janbuchar January 29, 2025 08:54
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Jan 29, 2025
@@ -78,6 +78,11 @@ async def process_request(self, request: Request, spider: Spider) -> None:
Raises:
ValueError: If username and password are not provided in the proxy URL.
"""
# Do not use proxy for robots.txt, as it causes 403 Forbidden.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like... universally, everywhere? I don't mind it, it just seems weird.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is a problem of Apify proxies, I don't know, but it results in the following:

[scrapy.downloadermiddlewares.robotstxt] ERROR Error downloading <GET https://console.apify.com/robots.txt>: Could not open CONNECT tunnel with proxy proxy.apify.com:8000 [{'status': 403, 'reason': b'Forbidden'}] ({"spider": "<TitleSpider 'title_spider' at 0x7f2bc3aee660>"})
      Traceback (most recent call last):
        File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/twisted/internet/defer.py", line 2013, in _inlineCallbacks
          result = context.run(
              cast(Failure, result).throwExceptionIntoGenerator, gen
          )
        File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/twisted/python/failure.py", line 467, in throwExceptionIntoGenerator
          return g.throw(self.value.with_traceback(self.tb))
                 ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/home/vdusek/Projects/apify-sdk-python/.venv/lib/python3.13/site-packages/scrapy/core/downloader/middleware.py", line 68, in process_request
          return (yield download_func(request, spider))
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      scrapy.core.downloader.handlers.http11.TunnelError: Could not open CONNECT tunnel with proxy proxy.apify.com:8000 [{'status': 403, 'reason': b'Forbidden'}]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Humph. But the connect call should happen way before the path part of the URL matters, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that's strange. I'm not sure why we can't connect when it comes to robots.txt, while other URLs works. I've reverted the changes and kept only the storage client fix.

# Use the ApifyStorageClient if the Actor is running on the Apify platform,
# otherwise use the MemoryStorageClient.
storage_client = (
ApifyStorageClient.from_config(config) if config.is_at_home else MemoryStorageClient.from_config(config)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is supposed to happen in Actor.init, right? Why duplicate it here?

Copy link
Contributor Author

@vdusek vdusek Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of the nested event loop, otherwise, it will result in:

RuntimeError: <asyncio.locks.Event object at 0x7c2d640c8fc0 [unset]> is bound to a different event loop

when using Apify client.

@vdusek vdusek merged commit 3363478 into master Jan 29, 2025
27 checks passed
@vdusek vdusek deleted the fixing-scrapy branch January 29, 2025 14:37
@honzajavorek
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants