Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 9c2ff15

Browse filesBrowse files
stratakismiss-islingtonillia-vgpshead
authored
[3.8] gh-102153: Start stripping C0 control and space chars in urlsplit (GH-102508) (GH-104575) (GH-104592) (#104593) (#104895)
`urllib.parse.urlsplit` has already been respecting the WHATWG spec a bit GH-25595. This adds more sanitizing to respect the "Remove any leading C0 control or space from input" [rule](https://url.spec.whatwg.org/GH-url-parsing:~:text=Remove%20any%20leading%20and%20trailing%20C0%20control%20or%20space%20from%20input.) in response to [CVE-2023-24329](https://nvd.nist.gov/vuln/detail/CVE-2023-24329). I simplified the docs by eliding the state of the world explanatory paragraph in this security release only backport. (people will see that in the mainline /3/ docs) (cherry picked from commit d7f8a5f) (cherry picked from commit 2f630e1) (cherry picked from commit 610cc0a) (cherry picked from commit f48a96a) Co-authored-by: Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> Co-authored-by: Illia Volochii <illia.volochii@gmail.com> Co-authored-by: Gregory P. Smith [Google] <greg@krypto.org>
1 parent c43c50e commit 9c2ff15
Copy full SHA for 9c2ff15

File tree

4 files changed

+111
-3
lines changed
Filter options

4 files changed

+111
-3
lines changed

‎Doc/library/urllib.parse.rst

Copy file name to clipboardExpand all lines: Doc/library/urllib.parse.rst
+36-2Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,10 @@ or on combining URL components into a URL string.
147147
ParseResult(scheme='http', netloc='www.cwi.nl:80', path='/%7Eguido/Python.html',
148148
params='', query='', fragment='')
149149

150+
.. warning::
151+
152+
:func:`urlparse` does not perform validation. See :ref:`URL parsing
153+
security <url-parsing-security>` for details.
150154

151155
.. versionchanged:: 3.2
152156
Added IPv6 URL parsing capabilities.
@@ -312,8 +316,14 @@ or on combining URL components into a URL string.
312316
``#``, ``@``, or ``:`` will raise a :exc:`ValueError`. If the URL is
313317
decomposed before parsing, no error will be raised.
314318

315-
Following the `WHATWG spec`_ that updates RFC 3986, ASCII newline
316-
``\n``, ``\r`` and tab ``\t`` characters are stripped from the URL.
319+
Following some of the `WHATWG spec`_ that updates RFC 3986, leading C0
320+
control and space characters are stripped from the URL. ``\n``,
321+
``\r`` and tab ``\t`` characters are removed from the URL at any position.
322+
323+
.. warning::
324+
325+
:func:`urlsplit` does not perform validation. See :ref:`URL parsing
326+
security <url-parsing-security>` for details.
317327

318328
.. versionchanged:: 3.6
319329
Out-of-range port numbers now raise :exc:`ValueError`, instead of
@@ -326,6 +336,9 @@ or on combining URL components into a URL string.
326336
.. versionchanged:: 3.8.10
327337
ASCII newline and tab characters are stripped from the URL.
328338

339+
.. versionchanged:: 3.8.17
340+
Leading WHATWG C0 control and space characters are stripped from the URL.
341+
329342
.. _WHATWG spec: https://url.spec.whatwg.org/#concept-basic-url-parser
330343

331344
.. function:: urlunsplit(parts)
@@ -402,6 +415,27 @@ or on combining URL components into a URL string.
402415
or ``scheme://host/path``). If *url* is not a wrapped URL, it is returned
403416
without changes.
404417

418+
.. _url-parsing-security:
419+
420+
URL parsing security
421+
--------------------
422+
423+
The :func:`urlsplit` and :func:`urlparse` APIs do not perform **validation** of
424+
inputs. They may not raise errors on inputs that other applications consider
425+
invalid. They may also succeed on some inputs that might not be considered
426+
URLs elsewhere. Their purpose is for practical functionality rather than
427+
purity.
428+
429+
Instead of raising an exception on unusual input, they may instead return some
430+
component parts as empty strings. Or components may contain more than perhaps
431+
they should.
432+
433+
We recommend that users of these APIs where the values may be used anywhere
434+
with security implications code defensively. Do some verification within your
435+
code before trusting a returned component part. Does that ``scheme`` make
436+
sense? Is that a sensible ``path``? Is there anything strange about that
437+
``hostname``? etc.
438+
405439
.. _parsing-ascii-encoded-bytes:
406440

407441
Parsing ASCII Encoded Bytes

‎Lib/test/test_urlparse.py

Copy file name to clipboardExpand all lines: Lib/test/test_urlparse.py
+60-1Lines changed: 60 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -660,14 +660,73 @@ def test_urlsplit_remove_unsafe_bytes(self):
660660
self.assertEqual(p.scheme, "https")
661661
self.assertEqual(p.geturl(), "https://www.python.org/javascript:alert('msg')/?query=something#fragment")
662662

663+
def test_urlsplit_strip_url(self):
664+
noise = bytes(range(0, 0x20 + 1))
665+
base_url = "http://User:Pass@www.python.org:080/doc/?query=yes#frag"
666+
667+
url = noise.decode("utf-8") + base_url
668+
p = urllib.parse.urlsplit(url)
669+
self.assertEqual(p.scheme, "http")
670+
self.assertEqual(p.netloc, "User:Pass@www.python.org:080")
671+
self.assertEqual(p.path, "/doc/")
672+
self.assertEqual(p.query, "query=yes")
673+
self.assertEqual(p.fragment, "frag")
674+
self.assertEqual(p.username, "User")
675+
self.assertEqual(p.password, "Pass")
676+
self.assertEqual(p.hostname, "www.python.org")
677+
self.assertEqual(p.port, 80)
678+
self.assertEqual(p.geturl(), base_url)
679+
680+
url = noise + base_url.encode("utf-8")
681+
p = urllib.parse.urlsplit(url)
682+
self.assertEqual(p.scheme, b"http")
683+
self.assertEqual(p.netloc, b"User:Pass@www.python.org:080")
684+
self.assertEqual(p.path, b"/doc/")
685+
self.assertEqual(p.query, b"query=yes")
686+
self.assertEqual(p.fragment, b"frag")
687+
self.assertEqual(p.username, b"User")
688+
self.assertEqual(p.password, b"Pass")
689+
self.assertEqual(p.hostname, b"www.python.org")
690+
self.assertEqual(p.port, 80)
691+
self.assertEqual(p.geturl(), base_url.encode("utf-8"))
692+
693+
# Test that trailing space is preserved as some applications rely on
694+
# this within query strings.
695+
query_spaces_url = "https://www.python.org:88/doc/?query= "
696+
p = urllib.parse.urlsplit(noise.decode("utf-8") + query_spaces_url)
697+
self.assertEqual(p.scheme, "https")
698+
self.assertEqual(p.netloc, "www.python.org:88")
699+
self.assertEqual(p.path, "/doc/")
700+
self.assertEqual(p.query, "query= ")
701+
self.assertEqual(p.port, 88)
702+
self.assertEqual(p.geturl(), query_spaces_url)
703+
704+
p = urllib.parse.urlsplit("www.pypi.org ")
705+
# That "hostname" gets considered a "path" due to the
706+
# trailing space and our existing logic... YUCK...
707+
# and re-assembles via geturl aka unurlsplit into the original.
708+
# django.core.validators.URLValidator (at least through v3.2) relies on
709+
# this, for better or worse, to catch it in a ValidationError via its
710+
# regular expressions.
711+
# Here we test the basic round trip concept of such a trailing space.
712+
self.assertEqual(urllib.parse.urlunsplit(p), "www.pypi.org ")
713+
714+
# with scheme as cache-key
715+
url = "//www.python.org/"
716+
scheme = noise.decode("utf-8") + "https" + noise.decode("utf-8")
717+
for _ in range(2):
718+
p = urllib.parse.urlsplit(url, scheme=scheme)
719+
self.assertEqual(p.scheme, "https")
720+
self.assertEqual(p.geturl(), "https://www.python.org/")
721+
663722
def test_attributes_bad_port(self):
664723
"""Check handling of invalid ports."""
665724
for bytes in (False, True):
666725
for parse in (urllib.parse.urlsplit, urllib.parse.urlparse):
667726
for port in ("foo", "1.5", "-1", "0x10"):
668727
with self.subTest(bytes=bytes, parse=parse, port=port):
669728
netloc = "www.example.net:" + port
670-
url = "http://" + netloc
729+
url = "http://" + netloc + "/"
671730
if bytes:
672731
netloc = netloc.encode("ascii")
673732
url = url.encode("ascii")

‎Lib/urllib/parse.py

Copy file name to clipboardExpand all lines: Lib/urllib/parse.py
+12Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,10 @@
2525
scenarios for parsing, and for backward compatibility purposes, some
2626
parsing quirks from older RFCs are retained. The testcases in
2727
test_urlparse.py provides a good indicator of parsing behavior.
28+
29+
The WHATWG URL Parser spec should also be considered. We are not compliant with
30+
it either due to existing user code API behavior expectations (Hyrum's Law).
31+
It serves as a useful guide when making changes.
2832
"""
2933

3034
import re
@@ -77,6 +81,10 @@
7781
'0123456789'
7882
'+-.')
7983

84+
# Leading and trailing C0 control and space to be stripped per WHATWG spec.
85+
# == "".join([chr(i) for i in range(0, 0x20 + 1)])
86+
_WHATWG_C0_CONTROL_OR_SPACE = '\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f '
87+
8088
# Unsafe bytes to be removed per WHATWG spec
8189
_UNSAFE_URL_BYTES_TO_REMOVE = ['\t', '\r', '\n']
8290

@@ -431,6 +439,10 @@ def urlsplit(url, scheme='', allow_fragments=True):
431439
url, scheme, _coerce_result = _coerce_args(url, scheme)
432440
url = _remove_unsafe_bytes_from_url(url)
433441
scheme = _remove_unsafe_bytes_from_url(scheme)
442+
# Only lstrip url as some applications rely on preserving trailing space.
443+
# (https://url.spec.whatwg.org/#concept-basic-url-parser would strip both)
444+
url = url.lstrip(_WHATWG_C0_CONTROL_OR_SPACE)
445+
scheme = scheme.strip(_WHATWG_C0_CONTROL_OR_SPACE)
434446
allow_fragments = bool(allow_fragments)
435447
key = url, scheme, allow_fragments, type(url), type(scheme)
436448
cached = _parse_cache.get(key, None)
+3Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
:func:`urllib.parse.urlsplit` now strips leading C0 control and space
2+
characters following the specification for URLs defined by WHATWG in
3+
response to CVE-2023-24329. Patch by Illia Volochii.

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.