safe_url_string: escape additional characters #203

Gallaecio · 2022-11-23T12:18:15Z

Changes:

Make safe_url_string percent-encode any character that is not considered safe on any of the URL standards we know to be in use by modern servers:
- RFC 2396 + RFC 2732, as interpreted by Java 8’s java.net.URI class
- RFC 3986
- The URL living standard
As a result, :;= are now percent-encoded in userinfo, |[] in paths, queries and fragments, and, following the URL living standard, ' is also percent-encoded in the query depending on the URL scheme.

The only exception is %, which we should probably encode as %25 when not followed by 2 hexadecimal digits, but doing so would require major changes to the current safe_url_string implementation that are out of the scope of this change.
Add extra tests for safe_url_string from Make safe_url_string safer #201, which highlight pending issues in the current implementation, to be addressed separately.

Fixes #80.

codecov · 2022-11-23T12:23:34Z

Codecov Report

Merging #203 (0994e08) into master (e2c7b62) will increase coverage by 0.16%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #203      +/-   ##
==========================================
+ Coverage   95.98%   96.14%   +0.16%     
==========================================
  Files           6        8       +2     
  Lines         473      493      +20     
  Branches       90       92       +2     
==========================================
+ Hits          454      474      +20     
  Misses          9        9              
  Partials       10       10

Impacted Files	Coverage Δ
w3lib/_infra.py	`100.00% <100.00%> (ø)`
w3lib/_url.py	`100.00% <100.00%> (ø)`
w3lib/url.py	`98.68% <100.00%> (+0.05%)`	⬆️

w3lib/url.py

kmike · 2022-11-23T16:36:30Z

Hey @Gallaecio!

Make safe_url_string percent-encode any character that is not considered safe on any of the URL standards we know to be in use by modern servers

Could you please confirm that my understanding is correct?

When user clicks a link on a web page, the URL is serialized, i.e. converted to ASCII string according to the rules described in the URL living standard (percent-escaped, etc.), and then sent to the remote web server.
The goal of safe_url_string is the same: convert URL from some arbitrary representation to the safe representation which all servers would understand.
Eventually we aim to use the Living Standard algorithm in safe_url_string, to ensure that the behavior is 100% compatible with the web browser behavior.
Changes in this PR bring safe_url_string implementation closer to https://url.spec.whatwg.org/, even if it doesn't implement the complete algorithm.

Co-authored-by: Mikhail Korobov <[email protected]>

Gallaecio · 2022-11-24T10:12:43Z

When user clicks a link on a web page, the URL is serialized, i.e. converted to ASCII string according to the rules described in the URL living standard (percent-escaped, etc.), and then sent to the remote web server.

Correct. Same when the URL is written by the user directly into the browser address bar. And even when the address bar shows the URL with e.g. non-ASCII characters, in the browser developer tools, on the network tab, you can see the actual URL sent, which includes additional percent-encoding.

The goal of safe_url_string is the same: convert URL from some arbitrary representation to the safe representation which all servers would understand.

The goal of safe_url_string is indeed the one you describe.

But I would argue that it is not exactly the goal of the URL living standard. The URL living standard does not seem to try to support servers that follow older standards, and instead seems to expect servers to adapt.

Servers that require some characters escaped, even though they do not require escaping according to the standard, must, for example, make sure that all instances of URLs with those characters in their website are escaped, so that users follow the escaped URLs. So, if your server does not support unescaped pipes in URL paths, you can still have URLs with pipes, but they must appear encoded so that, when users follow them, browsers send the pipe percent-encoded. And because the URL standard expect encoded characters to remain encoded, even if they are safe characters according to the standard, things should work.

For example, if you enter http://example.com/%7C on the Firefox address bar, it will send a request for http://example.com/%7C, but the address bar will show http://example.com/|. If you refresh the page, http://example.com/%7C will be used again. If you copy the URL, http://example.com/%7C will be copied. But if you click on the address bar and press Enter, http://example.com/| will be sent (this seems like a bug). And if you paste http://example.com/|, http://example.com/| will be sent to the server, in line with the standard not requiring the escaping of |.

Chrome seems to try to play it safer, and always encode |, just as safe_url_string does after this change, and not in line with the URL living standard, although compatible with the standard (more on this below). Note that Firefox is not following the standard either, as the standard URL rendering rules say nothing about percent-decoding safe characters, and Firefox does it.

So safe_url_string diverges from the URL living standard by building URLs that are safe by additional standards, apparently in line with Chrome.

URLs where we percent-encode additional characters (to make URLs safe by the definition of additional, older standards) are still compatible with the URL living standard, since its URL parsing does not percent-decode what is already percent-encoded, even if it is a safe character. (our current implementation does percent-decode safe characters, but changing that is not trivial, and should be handled separately, e.g. through #203).

Eventually we aim to use the Living Standard algorithm in safe_url_string, to ensure that the behavior is 100% compatible with the web browser behavior.

I would say we aim to have a behavior 100% compatible with the URL living standard, but our aim with safe_url_string is compatibility with web browser behavior and known server limitations. We aim to parse URLs based on the URL living standard, but to serialize them with extra rules to make them safe for servers that do not follow the URL living standard.

As servers move on and stop relying on older standards, we can move on and remove support for those, getting closer to the URL living standard serialization rules. However:

It may be hard to decide when it is time to drop support for an old standard. But maybe we can write a broad crawl spider to detect incompatibilities, run it against top domains, and decide based on that.
It is also possible, if the URL living standard stops percent-decoding certain characters in the future, that we need to make safe_url_string keep percent-encoding those until we are confident enough that no servers out there require those characters to be percent-encoded.

Changes in this PR bring safe_url_string implementation closer to https://url.spec.whatwg.org/, even if it doesn't implement the complete algorithm.

Yes and no.

The characters that we escape in userinfo bring us closer to the URL living standard. The escaping of single quotes depending on the URL schema does the same.

However, the extra characters that we escape on path, query and fragment do not require escaping in the URL living standard, we escape them to align with restrictions from older standards that we know are still used by servers. Incidentally, it also brings us closer to Chrome, which seems to have a similar aim to safe_url_string.

kmike · 2022-11-24T16:19:29Z

Thanks @Gallaecio, I learned a lot about URLs today!

Make safe_url_string safer

4537ff5

Gallaecio changed the title ~~Make safe_url_string safer~~ safe_url_string: escape additional characters Nov 23, 2022

Gallaecio mentioned this pull request Nov 23, 2022

Make safe_url_string safer #201

Closed

10 tasks

kmike reviewed Nov 23, 2022

View reviewed changes

w3lib/url.py Outdated Show resolved Hide resolved

bytes.translate: specify the delete parameter name for readability

7d96536

Co-authored-by: Mikhail Korobov <[email protected]>

.flake8: move unsupported inline comment

0994e08

kmike merged commit 17191b8 into scrapy:master Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

safe_url_string: escape additional characters #203

safe_url_string: escape additional characters #203

Uh oh!

Gallaecio commented Nov 23, 2022 •

edited

Loading

Uh oh!

codecov bot commented Nov 23, 2022 •

edited

Loading

Uh oh!

Uh oh!

kmike commented Nov 23, 2022

Uh oh!

Gallaecio commented Nov 24, 2022

Uh oh!

kmike commented Nov 24, 2022

Uh oh!

Uh oh!

safe_url_string: escape additional characters #203

safe_url_string: escape additional characters #203

Uh oh!

Conversation

Gallaecio commented Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

kmike commented Nov 23, 2022

Uh oh!

Gallaecio commented Nov 24, 2022

Uh oh!

kmike commented Nov 24, 2022

Uh oh!

Uh oh!

Gallaecio commented Nov 23, 2022 •

edited

Loading

codecov bot commented Nov 23, 2022 •

edited

Loading