Skip to content

safe_url_string: escape additional characters #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 24, 2022

Conversation

Gallaecio
Copy link
Member

@Gallaecio Gallaecio commented Nov 23, 2022

Changes:

  • Make safe_url_string percent-encode any character that is not considered safe on any of the URL standards we know to be in use by modern servers:

    • RFC 2396 + RFC 2732, as interpreted by Java 8’s java.net.URI class
    • RFC 3986
    • The URL living standard

    As a result, :;= are now percent-encoded in userinfo, |[] in paths, queries and fragments, and, following the URL living standard, ' is also percent-encoded in the query depending on the URL scheme.

    The only exception is %, which we should probably encode as %25 when not followed by 2 hexadecimal digits, but doing so would require major changes to the current safe_url_string implementation that are out of the scope of this change.

  • Add extra tests for safe_url_string from Make safe_url_string safer #201, which highlight pending issues in the current implementation, to be addressed separately.

Fixes #80.

@Gallaecio Gallaecio changed the title Make safe_url_string safer safe_url_string: escape additional characters Nov 23, 2022
@codecov
Copy link

codecov bot commented Nov 23, 2022

Codecov Report

Merging #203 (0994e08) into master (e2c7b62) will increase coverage by 0.16%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #203      +/-   ##
==========================================
+ Coverage   95.98%   96.14%   +0.16%     
==========================================
  Files           6        8       +2     
  Lines         473      493      +20     
  Branches       90       92       +2     
==========================================
+ Hits          454      474      +20     
  Misses          9        9              
  Partials       10       10              
Impacted Files Coverage Δ
w3lib/_infra.py 100.00% <100.00%> (ø)
w3lib/_url.py 100.00% <100.00%> (ø)
w3lib/url.py 98.68% <100.00%> (+0.05%) ⬆️

@Gallaecio Gallaecio mentioned this pull request Nov 23, 2022
10 tasks
@kmike
Copy link
Member

kmike commented Nov 23, 2022

Hey @Gallaecio!

Make safe_url_string percent-encode any character that is not considered safe on any of the URL standards we know to be in use by modern servers

Could you please confirm that my understanding is correct?

  1. When user clicks a link on a web page, the URL is serialized, i.e. converted to ASCII string according to the rules described in the URL living standard (percent-escaped, etc.), and then sent to the remote web server.
  2. The goal of safe_url_string is the same: convert URL from some arbitrary representation to the safe representation which all servers would understand.
  3. Eventually we aim to use the Living Standard algorithm in safe_url_string, to ensure that the behavior is 100% compatible with the web browser behavior.
  4. Changes in this PR bring safe_url_string implementation closer to https://url.spec.whatwg.org/, even if it doesn't implement the complete algorithm.

@Gallaecio
Copy link
Member Author

  1. When user clicks a link on a web page, the URL is serialized, i.e. converted to ASCII string according to the rules described in the URL living standard (percent-escaped, etc.), and then sent to the remote web server.

Correct. Same when the URL is written by the user directly into the browser address bar. And even when the address bar shows the URL with e.g. non-ASCII characters, in the browser developer tools, on the network tab, you can see the actual URL sent, which includes additional percent-encoding.

  1. The goal of safe_url_string is the same: convert URL from some arbitrary representation to the safe representation which all servers would understand.

The goal of safe_url_string is indeed the one you describe.

But I would argue that it is not exactly the goal of the URL living standard. The URL living standard does not seem to try to support servers that follow older standards, and instead seems to expect servers to adapt.

Servers that require some characters escaped, even though they do not require escaping according to the standard, must, for example, make sure that all instances of URLs with those characters in their website are escaped, so that users follow the escaped URLs. So, if your server does not support unescaped pipes in URL paths, you can still have URLs with pipes, but they must appear encoded so that, when users follow them, browsers send the pipe percent-encoded. And because the URL standard expect encoded characters to remain encoded, even if they are safe characters according to the standard, things should work.

For example, if you enter http://example.com/%7C on the Firefox address bar, it will send a request for http://example.com/%7C, but the address bar will show http://example.com/|. If you refresh the page, http://example.com/%7C will be used again. If you copy the URL, http://example.com/%7C will be copied. But if you click on the address bar and press Enter, http://example.com/| will be sent (this seems like a bug). And if you paste http://example.com/|, http://example.com/| will be sent to the server, in line with the standard not requiring the escaping of |.

Chrome seems to try to play it safer, and always encode |, just as safe_url_string does after this change, and not in line with the URL living standard, although compatible with the standard (more on this below). Note that Firefox is not following the standard either, as the standard URL rendering rules say nothing about percent-decoding safe characters, and Firefox does it.

So safe_url_string diverges from the URL living standard by building URLs that are safe by additional standards, apparently in line with Chrome.

URLs where we percent-encode additional characters (to make URLs safe by the definition of additional, older standards) are still compatible with the URL living standard, since its URL parsing does not percent-decode what is already percent-encoded, even if it is a safe character. (our current implementation does percent-decode safe characters, but changing that is not trivial, and should be handled separately, e.g. through #203).

  1. Eventually we aim to use the Living Standard algorithm in safe_url_string, to ensure that the behavior is 100% compatible with the web browser behavior.

I would say we aim to have a behavior 100% compatible with the URL living standard, but our aim with safe_url_string is compatibility with web browser behavior and known server limitations. We aim to parse URLs based on the URL living standard, but to serialize them with extra rules to make them safe for servers that do not follow the URL living standard.

As servers move on and stop relying on older standards, we can move on and remove support for those, getting closer to the URL living standard serialization rules. However:

  • It may be hard to decide when it is time to drop support for an old standard. But maybe we can write a broad crawl spider to detect incompatibilities, run it against top domains, and decide based on that.
  • It is also possible, if the URL living standard stops percent-decoding certain characters in the future, that we need to make safe_url_string keep percent-encoding those until we are confident enough that no servers out there require those characters to be percent-encoded.
  1. Changes in this PR bring safe_url_string implementation closer to https://url.spec.whatwg.org/, even if it doesn't implement the complete algorithm.

Yes and no.

The characters that we escape in userinfo bring us closer to the URL living standard. The escaping of single quotes depending on the URL schema does the same.

However, the extra characters that we escape on path, query and fragment do not require escaping in the URL living standard, we escape them to align with restrictions from older standards that we know are still used by servers. Incidentally, it also brings us closer to Chrome, which seems to have a similar aim to safe_url_string.

@kmike kmike merged commit 17191b8 into scrapy:master Nov 24, 2022
@kmike
Copy link
Member

kmike commented Nov 24, 2022

Thanks @Gallaecio, I learned a lot about URLs today!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipe symbol ("|") is not percent encoded
2 participants