Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[DomCrawler] Support classes from the new DOM extension #54383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: 7.3
Choose a base branch
Loading
from

Conversation

alexandre-daubois
Copy link
Member

@alexandre-daubois alexandre-daubois commented Mar 23, 2024

Q A
Branch? 7.2
Bug fix? no
New feature? yes
Deprecations? no
Issues Fix #53666
License MIT

New classes using the native parser are located in NativeCrawler\ subnamespace. Existing classes and new ones have mutualized code in each corresponding trait to avoid duplication where possible.

All tests covering the current classes have been transposed and adapted to new classes.

Remaining failing tests will be solved in the next PHP 8.4 nightly.

@alexandre-daubois alexandre-daubois force-pushed the dom-crawler-84 branch 3 times, most recently from a326cbc to 414de14 Compare March 29, 2024 09:41
src/Symfony/Component/DomCrawler/AbstractUriElement.php Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/CrawlerInterface.php Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/DomCrawler.php Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/DomCrawler.php Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/DomCrawler.php Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/Crawler.php Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/CrawlerInterface.php Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/Tests/ImageTest.php Outdated Show resolved Hide resolved
@javiereguiluz javiereguiluz added the ❄️ Feature Freeze Important Pull Requests to finish before the next Symfony "feature freeze" label Apr 2, 2024
@alexandre-daubois alexandre-daubois force-pushed the dom-crawler-84 branch 6 times, most recently from 36b8982 to cc6945b Compare April 4, 2024 08:13
@nielsdos
Copy link

nielsdos commented Apr 4, 2024

(Commented on the GBK test case too, but I know that GH notifications can be weird so just pointing it out so the comment isn't lost).

The intention with discoverNamespace is to check what namespace URI corresponds to which namespace prefix, correct?
In that case, the current code (even for the "old DOM") doesn't make much sense to me to be honest.
Namespaces are scoped, but you seem to want to list all namespaces in the document, and only keep the last one of that list. So that means on a document like this, only nestedchild2 can be returned because the last namespace uri of prefix a is urn:not_a:

<root>
  <child1 xmlns:a="urn:a">
    <a:nestedchild1/>
  </child1>
  <child1 xmlns:a="urn:not_a">
    <a:nestedchild2/>
  </child1>
</root>

That's why the default registerNodeNamespaces property of (DOM\|DOM)XPath is scoped to the context node.
If it were scoped to a context node, you could use the builtin lookupNamespaceURI function of DOM\Element to get the uri from a prefix.

Anyway, to answer the question of "why doesn't it work?", I wrote about this in the following two threads: veewee/xml#74 (comment) and nielsdos/php-src#93 (comment)
The TL;DR is that it's not defined by spec what should be returned for the namespace axis query. Modern implementations such as browsers just return the empty list, so I did too. Reintroducing a return value of DomNameSpaceNode is wrong imo because it's not a node, and has caused many problems with confusion for users and confusion for static analysis before. There are potential replacement functions on the way in a final (yet-to-be-written) DOM RFC for 8.4.

@alexandre-daubois
Copy link
Member Author

Thanks @nielsdos for this complete explanation! 🚀

I think waiting for nielsdos/php-src#93 would help a lot in what's missing now in this PR. In the meantime, I'll have a look at what's possible with the current API.

@fabpot fabpot removed the ❄️ Feature Freeze Important Pull Requests to finish before the next Symfony "feature freeze" label May 2, 2024
@fabpot fabpot modified the milestones: 7.1, 7.2 May 2, 2024
@nielsdos
Copy link

nielsdos commented Jun 4, 2024

https://wiki.php.net/rfc/dom_additions_84 is under discussion which contains the features you need to finish this, as well as some extras such as css selector support.

I'd like to point out one thing that may bite later about this PR: you're using the NO_DEFAULT_NS compatibility flag. This flag exists to ease migration and won't put elements in their proper namespace, it will put them in no namespace to be exact. Presumably you do this because your CSS selector translation does not take into account namespaces? Note that not having the proper namespace set on elements has quite a few subtle consequences. Depending on if something is inside the html namespace the behaviour of some methods alter (as defined by spec). For example automatic lowercasing of attribute names will not happen with the compatibility flag on because the element holding the attribute is no longer in the html namespace. (See https://dom.spec.whatwg.org/#concept-element-attributes-get-by-name step 1 for example)

@alexandre-daubois
Copy link
Member Author

Amazing, thank you for keeping me updated! I'll have a look at the NO_DEFAULT_NS thing. I remember adding it to be consistent with the old behavior, but as the new parser would be opt-in, maybe this type of break could be introduced. I'll have a closer look at it. Good luck with your RFC, no doubt it will go smoothly!

@alexandre-daubois alexandre-daubois force-pushed the dom-crawler-84 branch 2 times, most recently from f27f754 to ae36c49 Compare June 30, 2024 10:40
@alexandre-daubois
Copy link
Member Author

alexandre-daubois commented Jul 16, 2024

I extracted common parts in traits for every class as you advised @derrabus, as well as tests. It really simplifies the whole thing for sure. I think it is ready for review now, even if I'm not sure how a review should be done in a PR that big.

All classes have now an equivalent in the NativeCrawler sub namespace. They only handle modern nodes, whereas "old" classes only support legacy nodes.

@alexandre-daubois alexandre-daubois force-pushed the dom-crawler-84 branch 5 times, most recently from fa487e7 to 17ccdbb Compare July 17, 2024 08:36
@stof
Copy link
Member

stof commented Aug 9, 2024

Creating totally separate classes independent of the existing ones means that we cannot use the new crawler in the BrowserKit component (as that would be a BC break) and this also implies that we cannot deprecate the old one.

@alexandre-daubois
Copy link
Member Author

I understand, thanks for pointing this out. Which approach would you advise to solve this problem?

@fabpot fabpot modified the milestones: 7.2, 7.3 Nov 20, 2024
@alexandre-daubois alexandre-daubois force-pushed the dom-crawler-84 branch 2 times, most recently from d934f0b to 6fde3a6 Compare January 22, 2025 08:09
@alexandre-daubois
Copy link
Member Author

Hello @derrabus @stof, getting back to this PR. What could be the next steps, or who can we bring into the conversation to go forward on the topic? Thanks!

@Bilge
Copy link
Contributor

Bilge commented Jan 22, 2025

I'm going to assume that, since you created a new DomCrawler class (to replace Crawler), that it was not possible to build support into the existing class. In that case, in what ways is DomCrawler not a drop-in replacement for Crawler?

@stof
Copy link
Member

stof commented Jan 22, 2025

@alexandre-daubois please provide info about which parts make it impossible to add support in Crawler directly. Simplicity of the implementation is not a good argument given the cost of forcing to migrate to a new class in the whole ecosystem (see my previous comment about the impact on BrowserKit)

@alexandre-daubois
Copy link
Member Author

The problem is that the new classes are completely different from the old ones and totally unrelated. Because of this, most method signatures from the existing Crawler cannot be updated to support a union of both old and new classes without breaking BC (thus creating new classes)

@Bilge
Copy link
Contributor

Bilge commented Jan 22, 2025

Why can't the new types just be added as a union with the old types?

@stof
Copy link
Member

stof commented Jan 22, 2025

@alexandre-daubois Creating new classes also breaks BC for BrowserKit if we make it switch to the new class (and if we keep BrowserKit on the old class, the feature is almost useless and won't help us remove the usage of masterminds/html5 in Symfony 8).
Can you list exactly the methods for which we would have a BC break ?

@stof
Copy link
Member

stof commented Jan 22, 2025

@Bilge expanding a parameter type for a non-final method is a BC break, and the Crawler is not final.

@Bilge
Copy link
Contributor

Bilge commented Jan 22, 2025

Would CrawlerTrait not make more sense as a base class than as a trait?

@alexandre-daubois
Copy link
Member Author

alexandre-daubois commented Jan 22, 2025

There are a lot of them. Actually, all methods put in common traits are the untouched one, whereas all methods of classes under the "NativeCrawler" namespace (for example) are the methods that need to be updated to support \DOM\* objects. NativeCrawler\FormField would be a good example.

@stof
Copy link
Member

stof commented Jan 22, 2025

@alexandre-daubois I don't think this is actually true. For instance, addHtmlContent has no BC break at all (the internal implementation is different regarding parsing HTML, but not its API. Same for many other methods.
I'm not asking which methods need to be updated internally, but which methods need BC breaks.

@alexandre-daubois
Copy link
Member Author

alexandre-daubois commented Jan 22, 2025

I'm not asking which methods need to be updated internally, but which methods need BC breaks.

Yes, for example AnstractUriElement::setNode() is protected and would require a BC break. This one's widely used in children classes.

@fabpot
Copy link
Member

fabpot commented Mar 29, 2025

What's the status of this PR @alexandre-daubois?

@alexandre-daubois
Copy link
Member Author

I need to rebase it, but if I remember correctly, the implementation works. We still don't have decided what's the best approach. Currently, traits are used for common parts between the current and the new native crawlers. However, @stof raised that in #54383 (comment) that it will cause problems with the migration of Browserkit.

Actually, some methods requires BC breaks because the PHP 8.4 native parser introduces new classes, incompatible with the old classes.

Maybe the best option is to introduce BC breaks, allowing us to widen types in protected and public methods that requires to (i.e. \DOMElement to \DOMElement|\DOM\Element), and keep only one implementation instead of two with traits for common parts. Then we'll deprecate using legacy classes like \DOMElement in the first Symfony version that requires PHP 8.4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Leverage PHP 8.4's native parsing of HTML5
8 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.