-
-
Notifications
You must be signed in to change notification settings - Fork 9.6k
[DomCrawler] Give choice of used parser #49121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hey! To help keep things organized, we don't allow "Draft" pull requests. Could you please click the "ready for review" button or close this PR and open a new one when you are done? Note that a pull request does not have to be "perfect" or "ready for merge" when you first open it. We just want it to be ready for a first review. Cheers! Carsonbot |
The name |
Oh yes, didn't think about this case ! What’s about Thanks for your proofreading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some CS changes suggested by fabbot are false positives and should be ignored.
Can you please also add some test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes. Here is my review as a diff:
diff --git a/src/Symfony/Component/BrowserKit/AbstractBrowser.php b/src/Symfony/Component/BrowserKit/AbstractBrowser.php
index b1d0f0d21b..ec3a7e3523 100644
--- a/src/Symfony/Component/BrowserKit/AbstractBrowser.php
+++ b/src/Symfony/Component/BrowserKit/AbstractBrowser.php
@@ -37,7 +37,7 @@ abstract class AbstractBrowser
protected $internalResponse;
protected $response;
protected $crawler;
- protected bool $isHtml5ParserUsable = true;
+ protected bool $useHtml5Parser = true;
protected $insulated = false;
protected $redirect;
protected $followRedirects = true;
@@ -209,13 +209,11 @@ abstract class AbstractBrowser
}
/**
- * @param bool $isHtml5ParserUsable set usage of masterminds/html5 for crawler
- *
* @return $this
*/
- public function useHtml5Parser(bool $isHtml5ParserUsable): static
+ public function useHtml5Parser(bool $useHtml5Parser): static
{
- $this->isHtml5ParserUsable = $isHtml5ParserUsable;
+ $this->useHtml5Parser = $useHtml5Parser;
return $this;
}
@@ -510,7 +508,7 @@ abstract class AbstractBrowser
return null;
}
- $crawler = new Crawler(null, $uri, null, $this->isHtml5ParserUsable());
+ $crawler = new Crawler(null, $uri, null, $this->useHtml5Parser);
$crawler->addContent($content, $type);
return $crawler;
diff --git a/src/Symfony/Component/DomCrawler/CHANGELOG.md b/src/Symfony/Component/DomCrawler/CHANGELOG.md
index 61a4531704..be1c0ba143 100644
--- a/src/Symfony/Component/DomCrawler/CHANGELOG.md
+++ b/src/Symfony/Component/DomCrawler/CHANGELOG.md
@@ -4,8 +4,7 @@ CHANGELOG
6.3
---
- * Add `isHtml5ParserUsable` properties to `Crawler`
- to check if `Masterminds\HTML5` is usable
+ * Add `$useHtml5Parser` argument to `Crawler`
* Add `CrawlerSelectorCount` test constraint
* Add argument `$normalizeWhitespace` to `Crawler::innerText()`
* Make `Crawler::innerText()` return the first non-empty text
diff --git a/src/Symfony/Component/DomCrawler/Crawler.php b/src/Symfony/Component/DomCrawler/Crawler.php
index 35a31cbe2d..2719340472 100644
--- a/src/Symfony/Component/DomCrawler/Crawler.php
+++ b/src/Symfony/Component/DomCrawler/Crawler.php
@@ -58,28 +58,16 @@ class Crawler implements \Countable, \IteratorAggregate
*/
private bool $isHtml = true;
- /**
- * Whether the Crawler use masterminds/html5 or not.
- */
- private bool $isHtml5ParserUsable = true;
-
- private HTML5 $html5Parser;
+ private ?HTML5 $html5Parser = null;
/**
* @param \DOMNodeList|\DOMNode|\DOMNode[]|string|null $node A Node to use as the base for the crawling
- * @param $isHtml5ParserUsable Set usage of masterminds/html5 or not
*/
- public function __construct(\DOMNodeList|\DOMNode|array|string $node = null, string $uri = null, string $baseHref = null, bool $useHtml5Parser = null)
+ public function __construct(\DOMNodeList|\DOMNode|array|string $node = null, string $uri = null, string $baseHref = null, bool $useHtml5Parser = true)
{
- $this->isHtml5ParserUsable = $useHtml5Parser ?? $this->isHtml5ParserUsable;
-
- if (!class_exists(HTML5::class)) {
- throw new \LogicException('To parse data with HTML5 parser, install the masterminds/html5 component ("composer require masterminds/html5").');
- }
-
$this->uri = $uri;
$this->baseHref = $baseHref ?: $uri;
- $this->html5Parser = new HTML5(['disable_html_ns' => true]);
+ $this->html5Parser = $useHtml5Parser ? new HTML5(['disable_html_ns' => true]) : null;
$this->cachedNamespaces = new \ArrayObject();
$this->add($node);
@@ -591,7 +579,7 @@ class Crawler implements \Countable, \IteratorAggregate
/**
* Returns only the inner text that is the direct descendent of the current node, excluding any child nodes.
- *
+ *
* @param bool $normalizeWhitespace Whether whitespaces should be trimmed and normalized to single spaces
*/
public function innerText(/* bool $normalizeWhitespace = true */): string
@@ -633,7 +621,7 @@ class Crawler implements \Countable, \IteratorAggregate
$node = $this->getNode(0);
$owner = $node->ownerDocument;
- if ('<!DOCTYPE html>' === $owner->saveXML($owner->childNodes[0])) {
+ if ($this->html5Parser && '<!DOCTYPE html>' === $owner->saveXML($owner->childNodes[0])) {
$owner = $this->html5Parser;
}
@@ -654,7 +642,7 @@ class Crawler implements \Countable, \IteratorAggregate
$node = $this->getNode(0);
$owner = $node->ownerDocument;
- if ('<!DOCTYPE html>' === $owner->saveXML($owner->childNodes[0])) {
+ if ($this->html5Parser && '<!DOCTYPE html>' === $owner->saveXML($owner->childNodes[0])) {
$owner = $this->html5Parser;
}
@@ -1227,7 +1215,7 @@ class Crawler implements \Countable, \IteratorAggregate
private function canParseHtml5String(string $content): bool
{
- if (false === $this->isHtml5ParserUsable) {
+ if (!$this->html5Parser) {
return false;
}
Thanks for the review, i will add the tests ASAP ! |
Hello, I come back to you because I don't know how I can test this :/ Thanks for your help |
We should figure out some HTML string that is parsed differently between both parsers I suppose. |
Hmm i see, so we can take back HTML of this test and put it on 2 |
Hello, just tried to remove the commit Not sure about the request review, i didn't ask anythings but thanks for the review if you read this ! Edit -- It's not good :/ not sure about how i'm supposed to rebase this |
See https://symfony.com/doc/current/contributing/code/pull_requests.html#rebase-your-pull-request for the instructions about rebasing your PR. |
Thanks a lot for all your help, i think i'm good now :) |
Thank you @victor-prdh. |
Anything that defaults to |
@@ -46,6 +46,20 @@ public function testHtml5ParserWithInvalidHeadedContent(string $content) | ||
self::assertEmpty($crawler->filterXPath('//h1')->text(), '->addHtmlContent failed as expected'); | ||
} | ||
|
||
public function testHtml5ParserNotSameAsNativeParserForSpecificHtml() | ||
{ | ||
// Html who create a bug specific to the DOM extension (see https://github.com/symfony/symfony/issues/28596) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR was squashed before being merged into the 6.3 branch. Discussion ---------- [DomCrawler] Improve html5Parser tests | Q | A | ------------- | --- | Branch? | 6.3 | Bug fix? | no | New feature? | no | Deprecations? | no | Tickets | / | License | MIT | Doc PR | / Hi ! As mentioned by `@Bilge` in #49121 (comment), test to ensure the new html5Parser strategy wasn't the best ! So i worked a little on this subject and i come with this new proposal ! Thanks Commits ------- 1231d75 [DomCrawler] Improve html5Parser tests
Hi,
This first commit is more like than a POC and possible implementation of the feature. I would like to have some feedback before start adding tests and doc for this new feature.
Thanks !