gh-54873: Add support for namespaces prefixes to xml.sax.expatreader#118317
gh-54873: Add support for namespaces prefixes to xml.sax.expatreader#118317ukarroum wants to merge 10 commits intopython:mainpython/cpython:mainfrom
xml.sax.expatreader#118317Conversation
…ader from PyXml (0.8.4)
picnixz
left a comment
There was a problem hiding this comment.
Thanks, but as this is a new feature:
- Better tests must be written. Why are we using a new file?
- A What's New entry must be added in addition to the NEWS entry.
- Documentation must be added.
- PyXML is no longer maintained. What about other XML parsers?
Lib/test/test_xml_expatreader.py
Outdated
| self.parser.setContentHandler(h) | ||
| self.parser.feed("<Q:E xmlns:Q='http://example.org/testuri'/>") | ||
| self.parser.close() | ||
| print("self.assertEqual") |
There was a problem hiding this comment.
A shamelessly forgot debug print.
I removed it
Lib/test/test_xml_expatreader.py
Outdated
| self.parser.feed("<Q:E xmlns:Q='http://example.org/testuri'/>") | ||
| self.parser.close() | ||
| print("self.assertEqual") | ||
| self.assertFalse(h.qname is None) |
There was a problem hiding this comment.
This is not sufficient. We should match against the exact qname.
Lib/test/test_xml_expatreader.py
Outdated
| @@ -0,0 +1,24 @@ | ||
| import unittest |
There was a problem hiding this comment.
Why are we having a new test file? there should already be a testfile for expat reader.
There was a problem hiding this comment.
I assumed all xml unit tests were in the test_xml_* files. Looks like there is a test_sax.py.
Moved test there.
| @@ -0,0 +1,2 @@ | ||
| Backported namespaces prefixes support for xml.sax.expatreader from PyXml |
There was a problem hiding this comment.
A What's New entry should be written; this is a new feature. Also, I don't think we should indicate PyXML either. Please look at other changelog entries involving pyexpat to have an idea of how to write them.
There was a problem hiding this comment.
Done.
reason why I added the "from PyXml" is that I didn't write the code from scratch.
I backported / copied it from the existing PyXml implementation.
Not sure how to correctly credit the original library.
| elif name == feature_namespace_prefixes: | ||
| if state: | ||
| raise SAXNotSupportedException( | ||
| "expat does not report namespace prefixes") |
There was a problem hiding this comment.
Was it a limitation from the C extension module? was it a Expat version limitation?
There was a problem hiding this comment.
I did a quick git blame, and looks like this specific check was added in this commit: 18476a3 (from 2002).
and in this commit Expat version used is 1.95 (which should support namespace prefixes), so I would say the limitation was probably just in the python wrapper.
| pair = name.split() | ||
| if len(pair) == 1: | ||
| # no namespace | ||
| elem_qname = name |
There was a problem hiding this comment.
I have no idea whether this is correct or not here. Is there some specs that we could follow for the implementation?
There was a problem hiding this comment.
if you're referring specificaly to elem_qname, there is the spec: https://www.w3.org/TR/xml-names/#ns-qualnames.
There was a problem hiding this comment.
I also added another test to test the "else" branch.
|
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be poked with soft cushions! |
|
|
||
| from xml.sax import make_parser, ContentHandler, \ | ||
| SAXException, SAXReaderNotAvailable, SAXParseException | ||
| SAXException, SAXReaderNotAvailable, SAXParseException, handler |
There was a problem hiding this comment.
Why import handler when ContentHandler is already imported? why not use the latter directly?
There was a problem hiding this comment.
I replaced the handler.ContentHandler, with ContentHandler.
but I still need to keep the handler import for the namespace/namespace features.
There was a problem hiding this comment.
How about?:
| SAXException, SAXReaderNotAvailable, SAXParseException, handler | |
| SAXException, SAXReaderNotAvailable, SAXParseException | |
| from xml.sax.handler import feature_namespaces, feature_namespace_prefixes |
| parser.feed("<E xmlns='http://example.org/testuri'/>") | ||
| parser.close() | ||
| self.assertEqual(h.qname, "E") | ||
|
|
There was a problem hiding this comment.
Don't leave 3 blank lines, only 2 is sufficient.
Lib/test/test_sax.py
Outdated
| parser.feed("<E xmlns='http://example.org/testuri'/>") | ||
| parser.close() |
There was a problem hiding this comment.
Done, but I didn't add the True, it seems that the parse method only accept one argument.
There was a problem hiding this comment.
Oh. Wait, @hartwork should we do feed() + close() or parse()?
There was a problem hiding this comment.
@picnixz hi!
My understanding is that:
create_parserisxml.sax.expatreader.create_parserand it creates anxml.sax.xmlreader.XMLReader, first sentence in the docsxml.sax.xmlreader.XMLReaderprovides.parsethat parses in one single go, one single parameter, no booleanisFinalxml.sax.xmlreader.XMLReaderdoes not provide.feedor.closebecause that needsIncrementalParser, not a plainXMLReader.- (
xml.sax.expatreader.create_parsercreatesxml.sax.expatreader.ExpatParserthat is an instance of bothIncrementalParserandXMLReaderfor me in practice.) - I therefore consider
parser.parse("<E xmlns='http://example.org/testuri'/>")to be better suited here because availability of.feedand.closeis not guaranteed on interface level. - (If you want to stick to
.feedhere or elsewhere maybe addself.assertIsInstance(parser, IncrementalParser)prior to a call to communicate that expectation and have the test fail meaningfully for regressions in the future.)
What do you think?
Lib/xml/sax/expatreader.py
Outdated
| self._entity_stack = [] | ||
| self._external_ges = 0 | ||
| self._interning = None | ||
| self._namespace_prefixes = 1 |
There was a problem hiding this comment.
Changed to use a new namespacePrefixesHandling argument.
| pair = name.split() | ||
| if len(pair) == 1: | ||
| # no namespace | ||
| elem_qname = name |
Doc/whatsnew/3.15.rst
Outdated
|
|
||
| .. _billion laughs: https://en.wikipedia.org/wiki/Billion_laughs_attack | ||
|
|
||
| * Add support for namespace prefixes. |
There was a problem hiding this comment.
This should be also documented in the expat docs.
There was a problem hiding this comment.
do you mean I should add a link to the C libexpat doc ?
EDIT: I think you probably meant to document this here: https://docs.python.org/3/library/pyexpat.html
|
Sorry, I meant: keep the NEWS entry and add the What's New entry (IOW, both of them must be added). |
|
cc @hartwork as the maintainer of Expat |
xml.sax.expatreader
hartwork
left a comment
There was a problem hiding this comment.
I believe I understand the ideas behind this pull request now. I have compared with what PyXML 0.8.4 did and will also attach my local play code here (that is derived from the test case in this pull request):
- Download: sax_debug.py
I am optimistic that this pull request will be in mergable shape soon. Here is what I found:
| * Add support for `namespace prefixes <https://www.w3.org/TR/xml-names/#dt-prefix>`_. | ||
| (Contributed by Yassir Karroum in :gh:`118317`.) | ||
|
|
There was a problem hiding this comment.
This is adding to section xml.parsers.expat while the changes in here are patching code elsewhere. (I'm unsure when/if news files should be written to directly. I have only seen things go through news files myself.)
|
|
||
| from xml.sax import make_parser, ContentHandler, \ | ||
| SAXException, SAXReaderNotAvailable, SAXParseException | ||
| SAXException, SAXReaderNotAvailable, SAXParseException, handler |
There was a problem hiding this comment.
How about?:
| SAXException, SAXReaderNotAvailable, SAXParseException, handler | |
| SAXException, SAXReaderNotAvailable, SAXParseException | |
| from xml.sax.handler import feature_namespaces, feature_namespace_prefixes |
| def startElementNS(self, name, qname, attrs): | ||
| self.qname = qname | ||
|
|
||
| for xml_s, expected_qname in zip(["<Q:E xmlns:Q='http://example.org/testuri'/>", "<E xmlns='http://example.org/testuri'/>", "<E />"], ["Q:E", "E", "E"]): |
There was a problem hiding this comment.
I would consider this variant much more readable:
| for xml_s, expected_qname in zip(["<Q:E xmlns:Q='http://example.org/testuri'/>", "<E xmlns='http://example.org/testuri'/>", "<E />"], ["Q:E", "E", "E"]): | |
| for xml_s, expected_qname in ( | |
| ("<Q:E xmlns:Q='http://example.org/testuri'/>", "Q:E"), | |
| ("<E xmlns='http://example.org/testuri'/>", "E"), | |
| ("<E />", "E"), | |
| ): |
| """SAX driver for the pyexpat C module.""" | ||
|
|
||
| def __init__(self, namespaceHandling=0, bufsize=2**16-20): | ||
| def __init__(self, namespaceHandling=0, namespacePrefixesHandling=0, bufsize=2**16-20): |
There was a problem hiding this comment.
Just thinking aloud: if setFeature can toggle this after initialization and PyXML 0.8.4 did not have this, maybe we do not need this direct channel and can just initialize with = 0 below. So my vote for simplification.
| elif name == feature_namespace_prefixes: | ||
| self._namespace_prefixes = state |
There was a problem hiding this comment.
My vote for moving this one further down — right after elif name == feature_external_pes: — to match the left side of the diff.
| elem_qname = name | ||
| pair = (None, name) | ||
| elif len(pair) == 3: | ||
| elem_qname = "%s:%s" % (pair[2], pair[1]) |
There was a problem hiding this comment.
Note to self: This is the key value provider in this pull request. pair[2] was previously not exposed.
| pair = name.split() | ||
| if len(pair) == 1: | ||
| # no namespace | ||
| elem_qname = name |
| @@ -0,0 +1,2 @@ | ||
| Backported namespaces prefixes support for xml.sax.expatreader from PyXml |
There was a problem hiding this comment.
| Backported namespaces prefixes support for xml.sax.expatreader from PyXml | |
| Backported namespaces prefixes support for xml.sax.expatreader from PyXML |
| qnames[apair] = qname | ||
|
|
||
| self._cont_handler.startElementNS(pair, None, | ||
| self._cont_handler.startElementNS(pair, elem_qname, |
There was a problem hiding this comment.
This will never pass None even with support for namespace prefixes disabled.
The following patch is arguably a bit silly, but it keeps things backwards compatible (and demos the idea):
| self._cont_handler.startElementNS(pair, elem_qname, | |
| if not self._namespace_prefixes: | |
| elem_qname = None | |
| self._cont_handler.startElementNS(pair, elem_qname, |
| for xml_s, expected_qname in zip(["<Q:E xmlns:Q='http://example.org/testuri'/>", "<E xmlns='http://example.org/testuri'/>", "<E />"], ["Q:E", "E", "E"]): | ||
| parser = create_parser() | ||
| parser.setFeature(handler.feature_namespaces, 1) | ||
| parser.setFeature(handler.feature_namespace_prefixes, 1) |
There was a problem hiding this comment.
I believe there should also be a test for the behavior with this disabled where all QNames returned are None as previously.

Uh oh!
There was an error while loading. Please reload this page.