Skip to content

urlparse goes wrong with IP:port without scheme #38644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kruegi mannequin opened this issue Jun 13, 2003 · 19 comments
Closed

urlparse goes wrong with IP:port without scheme #38644

kruegi mannequin opened this issue Jun 13, 2003 · 19 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@kruegi
Copy link
Mannequin

kruegi mannequin commented Jun 13, 2003

BPO 754016
Nosy @birkenfeld, @birkenfeld, @facundobatista, @orsenthil, @miss-islington
PRs
  • bpo-27657: Fix urlparse() with numeric paths #661
  • [3.7] bpo-27657: Fix urlparse() with numeric paths (GH-661) #16837
  • [3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) #16839
  • Files
  • issue754016-py26.patch
  • 754016.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/orsenthil'
    closed_at = <Date 2010-08-04.04:56:30.468>
    created_at = <Date 2003-06-13.15:15:34.000>
    labels = ['type-bug', 'library']
    title = 'urlparse goes wrong with IP:port without scheme'
    updated_at = <Date 2019-10-18.15:23:21.979>
    user = 'https://bugs.python.org/kruegi'

    bugs.python.org fields:

    activity = <Date 2019-10-18.15:23:21.979>
    actor = 'orsenthil'
    assignee = 'orsenthil'
    closed = True
    closed_date = <Date 2010-08-04.04:56:30.468>
    closer = 'orsenthil'
    components = ['Library (Lib)']
    creation = <Date 2003-06-13.15:15:34.000>
    creator = 'kruegi'
    dependencies = []
    files = ['10752', '18362']
    hgrepos = []
    issue_num = 754016
    keywords = ['patch']
    message_count = 19.0
    messages = ['16377', '16378', '16379', '16380', '16381', '16382', '67890', '68526', '68533', '68535', '68841', '69095', '85877', '110334', '112725', '112758', '354892', '354897', '354906']
    nosy_count = 9.0
    nosy_names = ['georg.brandl', 'georg.brandl', 'facundobatista', 'jjlee', 'sjones', 'kruegi', 'orsenthil', 'dstanek', 'elachuni', 'miss-islington']
    pr_nums = ['661', '16837', '16839']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue754016'
    versions = ['Python 3.1', 'Python 2.7', 'Python 3.2']

    @kruegi
    Copy link
    Mannequin Author

    kruegi mannequin commented Jun 13, 2003

    urlparse doesnt work if IP and port are given without
    scheme:

    >>> urlparse.urlparse('1.2.3.4:80','http')
    ('1.2.3.4', '', '80', '', '', '')

    should be:

    >>> urlparse.urlparse('1.2.3.4:80','http')
    ('http', '1.2.3.4', '80', '', '', '')

    @kruegi kruegi mannequin assigned birkenfeld Jun 13, 2003
    @kruegi kruegi mannequin added the stdlib Python modules in the Lib dir label Jun 13, 2003
    @kruegi kruegi mannequin assigned birkenfeld Jun 13, 2003
    @kruegi kruegi mannequin added the stdlib Python modules in the Lib dir label Jun 13, 2003
    @sjones
    Copy link
    Mannequin

    sjones mannequin commented Jun 14, 2003

    Logged In: YES
    user_id=589306

    urlparse.urlparse takes a url of the format:
    <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

    And returns a 6-tuple of the format:
    (scheme, netloc, path, params, query, fragment).

    An example from the library refrence takes:

    @sjones
    Copy link
    Mannequin

    sjones mannequin commented Jun 14, 2003

    Logged In: YES
    user_id=589306

    Sorry, previous comment got cut off...

    urlparse.urlparse takes a url of the format:
    <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

    And returns a 6-tuple of the format:
    (scheme, netloc, path, params, query, fragment).

    An example from the library refrence takes:
    urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')

    And produces:
    ('http', 'www.cwi.nl:80', '/%7Eguido/Python.html', '',
    '', '')

    --------------------------------

    Note that there isn't a field for the port number in the
    6-tuple. Instead, it is included in the netloc. Urlparse
    should handle your example as:

    >>> urlparse.urlparse('1.2.3.4:80','http') 
    ('http', '1.2.3.4:80', '', '', '', '')

    Instead, it gives the incorrect output as you indicated.

    @sjones
    Copy link
    Mannequin

    sjones mannequin commented Jun 14, 2003

    Logged In: YES
    user_id=589306

    Ok, I researched this a bit, and the situation isn't as
    simple as it first appears. The RFC that urlparse tries to
    follow is at http://www.faqs.org/rfcs/rfc1808.html and
    notice specifically sections 2.1 and 2.2.

    It seems to me that the source code follows rfc1808
    religiously, and in that sense it does the correct thing.
    According to the RFC, the netloc should begin with a '//',
    and since your example didn't include one then it technical
    was an invalid URL. Here is another example where it seems
    Python fails to do the right thing:

    >>> urlparse.urlparse('python.org')
    ('', '', 'python.org', '', '', '')
    >>> urlparse.urlparse('python.org', 'http')
    ('http', '', 'python.org', '', '', '')

    Note that it is putting 'python.org' as the path and not the
    netloc. So the problem isn't limited to just when you use a
    scheme parameter and/or a port number. Now if we put '//' at
    the beginning, we get:

    >>> urlparse.urlparse('//python.org')
    ('', 'python.org', '', '', '', '')
    >>> urlparse.urlparse('//python.org', 'http')
    ('http', 'python.org', '', '', '', '')

    So here it does the correct thing.

    There are two problems though. First, it is common for
    browsers and other software to just take a URL without a
    scheme and '://' and assume it is http for the user. While
    the URL is technically not correct, it is still common
    usage. Also, urlparse does take a scheme parameter. It seems
    as though this parameter should be used with a scheme-less
    URL to give it a default one like web browsers do.

    So somebody needs to make a decision. Should urlparse follow
    the RFC religiously and require '//' in front of netlocs? If
    so, I think the documentation should give an example showing
    this and showing how to use the 'scheme' parameter. Or
    should urlparse use the more commonly used form of a URL
    where '//' is omitted when the scheme is omitted? If so,
    urlparse.py will need to be changed. Or maybe another
    fuction should be added to cover whichever behaviour
    urlparse doesn't cover.

    In any case, you can temporarily solve your problem by
    making sure that URL's without a scheme have '//' at the
    front. So your example becomes:

    >>> urlparse.urlparse('//1.2.3.4:80', 'http')
    ('http', '1.2.3.4:80', '', '', '', '')

    @facundobatista
    Copy link
    Member

    Logged In: YES
    user_id=752496

    The problem is still present in Py2.3.4.

    IMO, it should support dirs without the "http://" or raise
    an error if it's not present (never fail silently!).

    @birkenfeld
    Copy link
    Member

    Logged In: YES
    user_id=1188172

    Will look into it. Should be easy to fix.

    @orsenthil
    Copy link
    Member

    Attaching the patch to fix this issue. I deliberated upon this for a
    while and came up with the approach to:

    1. fix the port issue, wherein urlparse should technically recognize the
      ':' separator for port from ':' after scheme.

    2. And Doc fix wherein, it is advised that in the absence of a scheme,
      use the net_loc as //net_loc (following RCF 1808).

    If we go for any other fix, like internally pre-pending // when user has
    not specified the scheme (like in many pratical purpose), then we stand
    at chance of breaking a number of tests ( cases where url is 'g'(path
    only),';x' (path with params) and cases where relative url is g:h)

    Let me know your thoughts on this.

    >>> urlparse('1.2.3.4:80')
    ParseResult(scheme='', netloc='', path='1.2.3.4:80', params='',
    query='', fragment='')
    >>> urlparse('http://www.python.org:80/~guido/foo?query#fun')
    ParseResult(scheme='http', netloc='www.python.org:80',
    path='/~guido/foo', params='', query='query', fragment='fun')
    >>>

    @facundobatista
    Copy link
    Member

    Senthil, your patch is wrong, see:

    >>> import urlparse
    >>> urlparse.urlparse('1.2.3.4:80','http')
    ParseResult(scheme='http', netloc='', path='1.2.3.4:80', params='',
    query='', fragment='')

    The netloc should be "1.2.3.4:80", note the composition of an URL:

    <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

    Please fix it and test it applying the patch to the test I'm submitting
    here...

    @elachuni
    Copy link
    Mannequin

    elachuni mannequin commented Jun 21, 2008

    I agree with facundobatista that the patch is bad, but for a different
    reason: it now breaks with:

    >>> import urlparse
    >>> urlparse.urlparse ('http:')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/anthony/svn/python26/Lib/urlparse.py", line 108, in urlparse
    tuple = urlsplit(url, scheme, allow_fragments)
      File "/home/anthony/svn/python26/Lib/urlparse.py", line 148, in urlsplit
    if i > 0 and not url[i+1].isdigit():
    IndexError: string index out of range

    I'm afraid that it it's not evident that the expected behavior isn't
    evident.

    Take for example:

    >>> import urlparse
    >>> urlparse.urlparse('some.com', 'http')
    ParseResult(scheme='http', netloc='', path='some.com', params='',
    query='', fragment='')

    Is the url referring to the some.com domain or to a windows executable file?

    If you're using urlparse to parse only absolute urls then probably you
    want the first component to be considered a net_loc, but not if you're
    thinking of accepting also relative urls.

    It would probably be better to be explicit and raise an exception if the
    url is invalid, so that the user can prepend a '//' and resubmit if
    needed. Also we'd probably stop seeing bugreports about this issue :)

    @facundobatista
    Copy link
    Member

    I agree with Anthony here, because if you let people write without the
    "//" at the beginning, you'll never know if they're entering a net
    location or a relative path.

    So, the better behaviour to be as explicit as possible should be:

    >>> urlparse.urlparse('1.2.3.4:80','http')
    Traceback!!! ValueError(<nice message here>)
    
    >>> urlparse.urlparse('//1.2.3.4:80','http')
    ('http', '1.2.3.4:80', '', '', '', '')

    So, to close this issue, we should fix the code to behave like indicated
    in the first case.

    What do you think?

    @orsenthil
    Copy link
    Member

    I am attaching the modified patch, which addresses the port issue
    properly and handles 'http:', 'https:' only URLS. Also included the
    tests for it.

    Facundo, I gave sufficient thought on raising an Exception for URLS not
    staring with '//', and I am -1 for it.

    As urlparse module is used for handling both absolute URLs as well as
    relative URLS, this suggestion IMHO, would break the urlparse handling
    of all relative urls. For e.g, cases which are mentioned in the RFC 1808
    (Section 5.1 Normal Examples).

    The way to inform the users to use '//net_loc' when they want net_loc,
    would be Docs/Help message (included in the patch) and otherwise
    urlparse following RFC1808, will treat it as the path.

    This case may seem absurd when 'www.python.org' is treated as path but
    perfect for parsing relative urls like just 'a'. More over this makes
    sense when we have relative urls with parameters and query, for
    e.g.'g:h','?x'

    Another way to handle this would be split urlparse into two methods:
    urlparse.absparse()
    urlparse.relparse()
    and let the user decide what he wants to do.
    I am passing a message to Web-SIG to discuss this further.

    Irrespective of this, if the patch looks okay for "handling the port
    issue" for 2.6 with the "Doc/Help message", then we should close this
    bug and take the discussion further in Web-SIG. (I shall provide the
    patch for 3.0 as well)

    Comments Please.

    @facundobatista
    Copy link
    Member

    I think this last patch is ok, but the third case that was raised in the
    web-sig should be addressed:

    """
    There's even a 3rd case: HTTP's Request-URI. For example, '//path' must
    be treated as an abs_path consisting of two path_segments ['', 'path'],
    not a net_loc, since the Request_URI must be one of ("*" | absoluteURI |
    abs_path | authority).
    """

    Please, address this new detail, and I'd commit this. Thanks!

    @devdanzin devdanzin mannequin added type-bug An unexpected behavior, bug, or error labels Feb 13, 2009
    @orsenthil
    Copy link
    Member

    Facundo, I re-looked at this issue (after a long time; sorry for that)
    and I think that the patch is good to be applied as it is for this issue.

    The Web-SIG discussion, which you pointed to in the comment
    (http://mail.python.org/pipermail/web-sig/2008-June/003458.html)
    suggests a more "general case" of thinking and parsing of the urls.

    The suggestion is specifically for "//path" kind of urls, which is will
    be parsed into authority (as per RFC2396) or into netloc as in the patch.

    My suggestion is to fix this issue with patch and if the corner case
    like "//path" comes up, then I shall look the modifications required as
    there is No RFC Specifications and Tests defining those cases.

    Your comments?

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 14, 2010

    The patch will need to be reworked for the 2.7, 3.1 and 3.2 branches.

    @dstanek
    Copy link
    Mannequin

    dstanek mannequin commented Aug 3, 2010

    I've reworked the patch so that it applied against the py3k branch. It's been attached to this issue and is also available here: http://codereview.appspot.com/1910044.

    @orsenthil
    Copy link
    Member

    Fixed in revision 83700 (release27-maint). r83701(py3k) and r83702(release31-maint).

    David, thanks for reworking on the patch. Couple of comments

    • I made change to the original patch where I checked 'https:' and 'http:' kind of url with url.endswith(':') instead of len(url) == i+1 // I had a hard time to figure out why I had this way in the first place.

    • In py3k, the urlparse.urlparse is changed to urllib.parse.urlparse. So that changes were required in the tests.

    @orsenthil orsenthil assigned orsenthil and unassigned facundobatista Aug 4, 2010
    @orsenthil
    Copy link
    Member

    New changeset 5a88d50 by Senthil Kumaran (Tim Graham) in branch 'master':
    bpo-27657: Fix urlparse() with numeric paths (#661)
    5a88d50

    @miss-islington
    Copy link
    Contributor

    New changeset 82b5f6b by Miss Islington (bot) in branch '3.7':
    bpo-27657: Fix urlparse() with numeric paths (GH-661)
    82b5f6b

    @orsenthil
    Copy link
    Member

    New changeset 0f3187c by Senthil Kumaran in branch '3.8':
    [3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (bpo-16839)
    0f3187c

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants