Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

[DomCrawler] Give choice of used parser #49121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Mar 6, 2023
Merged

[DomCrawler] Give choice of used parser #49121

merged 1 commit into from
Mar 6, 2023

Conversation

victor-prdh
Copy link
Contributor

Q A
Branch? 6.3
Bug fix? no
New feature? yes
Deprecations? no
Tickets #48950
License MIT
Doc PR symfony/symfony-docs#...

Hi,

This first commit is more like than a POC and possible implementation of the feature. I would like to have some feedback before start adding tests and doc for this new feature.

Thanks !

@carsonbot
Copy link

Hey!

To help keep things organized, we don't allow "Draft" pull requests. Could you please click the "ready for review" button or close this PR and open a new one when you are done?

Note that a pull request does not have to be "perfect" or "ready for merge" when you first open it. We just want it to be ready for a first review.

Cheers!

Carsonbot

@victor-prdh victor-prdh marked this pull request as ready for review January 26, 2023 17:28
@carsonbot carsonbot added this to the 6.3 milestone Jan 26, 2023
@stof
Copy link
Member

stof commented Jan 27, 2023

The name isHtml5Parser is bad. Even when this is true, the HTML5 parser might not be used if the page content being crawled uses HTML4 or XHTML.

src/Symfony/Component/DomCrawler/Crawler.php Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/Crawler.php Outdated Show resolved Hide resolved
src/Symfony/Component/BrowserKit/AbstractBrowser.php Outdated Show resolved Hide resolved
@victor-prdh
Copy link
Contributor Author

victor-prdh commented Jan 27, 2023

The name isHtml5Parser is bad. Even when this is true, the HTML5 parser might not be used if the page content being crawled uses HTML4 or XHTML.

Oh yes, didn't think about this case ! What’s about isHtml5ParserUsable ?

Thanks for your proofreading.

Copy link
Member

@nicolas-grekas nicolas-grekas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some CS changes suggested by fabbot are false positives and should be ignored.
Can you please also add some test cases?

src/Symfony/Component/BrowserKit/AbstractBrowser.php Outdated Show resolved Hide resolved
src/Symfony/Component/BrowserKit/AbstractBrowser.php Outdated Show resolved Hide resolved
src/Symfony/Component/BrowserKit/AbstractBrowser.php Outdated Show resolved Hide resolved
src/Symfony/Component/BrowserKit/AbstractBrowser.php Outdated Show resolved Hide resolved
src/Symfony/Component/BrowserKit/CHANGELOG.md Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/Crawler.php Outdated Show resolved Hide resolved
src/Symfony/Component/DomCrawler/Crawler.php Outdated Show resolved Hide resolved
Copy link
Member

@nicolas-grekas nicolas-grekas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes. Here is my review as a diff:

diff --git a/src/Symfony/Component/BrowserKit/AbstractBrowser.php b/src/Symfony/Component/BrowserKit/AbstractBrowser.php
index b1d0f0d21b..ec3a7e3523 100644
--- a/src/Symfony/Component/BrowserKit/AbstractBrowser.php
+++ b/src/Symfony/Component/BrowserKit/AbstractBrowser.php
@@ -37,7 +37,7 @@ abstract class AbstractBrowser
     protected $internalResponse;
     protected $response;
     protected $crawler;
-    protected bool $isHtml5ParserUsable = true;
+    protected bool $useHtml5Parser = true;
     protected $insulated = false;
     protected $redirect;
     protected $followRedirects = true;
@@ -209,13 +209,11 @@ abstract class AbstractBrowser
     }
 
     /**
-     * @param bool $isHtml5ParserUsable set usage of masterminds/html5 for crawler
-     *
      * @return $this
      */
-    public function useHtml5Parser(bool $isHtml5ParserUsable): static
+    public function useHtml5Parser(bool $useHtml5Parser): static
     {
-        $this->isHtml5ParserUsable = $isHtml5ParserUsable;
+        $this->useHtml5Parser = $useHtml5Parser;
 
         return $this;
     }
@@ -510,7 +508,7 @@ abstract class AbstractBrowser
             return null;
         }
 
-        $crawler = new Crawler(null, $uri, null, $this->isHtml5ParserUsable());
+        $crawler = new Crawler(null, $uri, null, $this->useHtml5Parser);
         $crawler->addContent($content, $type);
 
         return $crawler;
diff --git a/src/Symfony/Component/DomCrawler/CHANGELOG.md b/src/Symfony/Component/DomCrawler/CHANGELOG.md
index 61a4531704..be1c0ba143 100644
--- a/src/Symfony/Component/DomCrawler/CHANGELOG.md
+++ b/src/Symfony/Component/DomCrawler/CHANGELOG.md
@@ -4,8 +4,7 @@ CHANGELOG
 6.3
 ---
 
- * Add `isHtml5ParserUsable` properties to `Crawler`
-   to check if `Masterminds\HTML5` is usable
+ * Add `$useHtml5Parser` argument to `Crawler`
  * Add `CrawlerSelectorCount` test constraint
  * Add argument `$normalizeWhitespace` to `Crawler::innerText()`
  * Make `Crawler::innerText()` return the first non-empty text
diff --git a/src/Symfony/Component/DomCrawler/Crawler.php b/src/Symfony/Component/DomCrawler/Crawler.php
index 35a31cbe2d..2719340472 100644
--- a/src/Symfony/Component/DomCrawler/Crawler.php
+++ b/src/Symfony/Component/DomCrawler/Crawler.php
@@ -58,28 +58,16 @@ class Crawler implements \Countable, \IteratorAggregate
      */
     private bool $isHtml = true;
 
-    /**
-     * Whether the Crawler use masterminds/html5 or not.
-     */
-    private bool $isHtml5ParserUsable = true;
-
-    private HTML5 $html5Parser;
+    private ?HTML5 $html5Parser = null;
 
     /**
      * @param \DOMNodeList|\DOMNode|\DOMNode[]|string|null $node A Node to use as the base for the crawling
-     * @param $isHtml5ParserUsable Set usage of masterminds/html5 or not
      */
-    public function __construct(\DOMNodeList|\DOMNode|array|string $node = null, string $uri = null, string $baseHref = null, bool $useHtml5Parser = null)
+    public function __construct(\DOMNodeList|\DOMNode|array|string $node = null, string $uri = null, string $baseHref = null, bool $useHtml5Parser = true)
     {
-        $this->isHtml5ParserUsable = $useHtml5Parser ?? $this->isHtml5ParserUsable;
-        
-        if (!class_exists(HTML5::class)) {
-            throw new \LogicException('To parse data with HTML5 parser, install the masterminds/html5 component ("composer require masterminds/html5").');
-        }
-
         $this->uri = $uri;
         $this->baseHref = $baseHref ?: $uri;
-        $this->html5Parser = new HTML5(['disable_html_ns' => true]);
+        $this->html5Parser = $useHtml5Parser ? new HTML5(['disable_html_ns' => true]) : null;
         $this->cachedNamespaces = new \ArrayObject();
 
         $this->add($node);
@@ -591,7 +579,7 @@ class Crawler implements \Countable, \IteratorAggregate
 
     /**
      * Returns only the inner text that is the direct descendent of the current node, excluding any child nodes.
-     * 
+     *
      * @param bool $normalizeWhitespace Whether whitespaces should be trimmed and normalized to single spaces
      */
     public function innerText(/* bool $normalizeWhitespace = true */): string
@@ -633,7 +621,7 @@ class Crawler implements \Countable, \IteratorAggregate
         $node = $this->getNode(0);
         $owner = $node->ownerDocument;
 
-        if ('<!DOCTYPE html>' === $owner->saveXML($owner->childNodes[0])) {
+        if ($this->html5Parser && '<!DOCTYPE html>' === $owner->saveXML($owner->childNodes[0])) {
             $owner = $this->html5Parser;
         }
 
@@ -654,7 +642,7 @@ class Crawler implements \Countable, \IteratorAggregate
         $node = $this->getNode(0);
         $owner = $node->ownerDocument;
 
-        if ('<!DOCTYPE html>' === $owner->saveXML($owner->childNodes[0])) {
+        if ($this->html5Parser && '<!DOCTYPE html>' === $owner->saveXML($owner->childNodes[0])) {
             $owner = $this->html5Parser;
         }
 
@@ -1227,7 +1215,7 @@ class Crawler implements \Countable, \IteratorAggregate
 
     private function canParseHtml5String(string $content): bool
     {
-        if (false === $this->isHtml5ParserUsable) {
+        if (!$this->html5Parser) {
             return false;
         }
 

@victor-prdh
Copy link
Contributor Author

Thanks for the review, i will add the tests ASAP !

@victor-prdh
Copy link
Contributor Author

Hello, I come back to you because I don't know how I can test this :/
I just searched today but I still have no idea how to test what I added ? It's doesn't change the functioning of Crawler, it's just gave to users capabilities to use or not the HTML5Parser so i'm bit perplex about wich tests can i add ?

Thanks for your help

@nicolas-grekas
Copy link
Member

nicolas-grekas commented Feb 24, 2023

We should figure out some HTML string that is parsed differently between both parsers I suppose.

@victor-prdh
Copy link
Contributor Author

Hmm i see, so we can take back HTML of this test and put it on 2 Crawler , 1 with HML5Parser, 1 without and expect a difference about the result !

@victor-prdh
Copy link
Contributor Author

victor-prdh commented Mar 1, 2023

Hello,

just tried to remove the commit Merge branch 'symfony:6.3' into feature/choose-crawler as asked by fabbot, i think it's good but i have now some conflicts :/

Not sure about the request review, i didn't ask anythings but thanks for the review if you read this !

Edit --

It's not good :/ not sure about how i'm supposed to rebase this

@stof
Copy link
Member

stof commented Mar 1, 2023

git revert does not remove the merge commit. It adds a new commit that undo the merge.

See https://symfony.com/doc/current/contributing/code/pull_requests.html#rebase-your-pull-request for the instructions about rebasing your PR.

@victor-prdh
Copy link
Contributor Author

git revert does not remove the merge commit. It adds a new commit that undo the merge.

See https://symfony.com/doc/current/contributing/code/pull_requests.html#rebase-your-pull-request for the instructions about rebasing your PR.

Thanks a lot for all your help, i think i'm good now :)

@nicolas-grekas
Copy link
Member

Thank you @victor-prdh.

@Bilge
Copy link
Contributor

Bilge commented Mar 6, 2023

Anything that defaults to true is a poor default in my (pedantic) view. This might be better inverted to useXmlParser, which defaults to false (and thus uses the HTML 5 parser by default, in keeping BC).

@@ -46,6 +46,20 @@ public function testHtml5ParserWithInvalidHeadedContent(string $content)
self::assertEmpty($crawler->filterXPath('//h1')->text(), '->addHtmlContent failed as expected');
}

public function testHtml5ParserNotSameAsNativeParserForSpecificHtml()
{
// Html who create a bug specific to the DOM extension (see https://github.com/symfony/symfony/issues/28596)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ A test that relies on the presence of a bug to pass is extremely brittle. If the purpose of the test is to ensure the parser is different then that is what the assertion should be checking, even if you have to use reflection to achieve it.

nicolas-grekas added a commit that referenced this pull request Mar 13, 2023
This PR was squashed before being merged into the 6.3 branch.

Discussion
----------

[DomCrawler] Improve html5Parser tests

| Q             | A
| ------------- | ---
| Branch?       | 6.3
| Bug fix?      | no
| New feature?  | no
| Deprecations? | no
| Tickets       |  /
| License       | MIT
| Doc PR        |  /

Hi !
As mentioned by `@Bilge` in #49121 (comment), test to ensure the new html5Parser strategy wasn't the best ! So i worked a little on this subject and i come with this new proposal !

Thanks

Commits
-------

1231d75 [DomCrawler] Improve html5Parser tests
@fabpot fabpot mentioned this pull request May 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.