Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Latest commit

 

History

History
History
52 lines (35 loc) · 1.41 KB

File metadata and controls

52 lines (35 loc) · 1.41 KB
Copy raw file
Download raw file
Edit and raw actions
@author jackzhenguo
@desc 
@date 2019/7/21

96 贪心捕获和非贪心捕获

捕获功能非常实用,使用它需要区分一点,贪婪捕获和非贪婪捕获。前者指在满足匹配模式前提下,返回包括尽可能多的字符匹配模式;后者指满足匹配条件下,尽可能少的捕获。

我们伪造一个理想状况下的案例:

htmlContent = """
        <div><div><h2>这是二级标题</h2></div><div><p> 这是一个段落>/p></div></div>
"""

贪心捕获使用(.*),如下所示:

pat = r"<div>(.*)</div>"

result = re.findall(pat,htmlContent)

结果为如下,尽可能长的捕获,而不是遇到第一个</div>时就终止:

['<div><h2>这是二级标题</h2></div><div><p> 这是一个段落>/p></div>']

而非贪心捕获的正则表达式为<div>(.*?)</div>",如下:

pat = r"<div>(.*?)</div>"

result = re.findall(pat,htmlContent)

print(result)

结果为两个元素,遇到第一个</div>时终止,然后继续捕获出第二子串:

['<div><h2>这是二级标题</h2>', 
  '<p> 这是一个段落>/p>']

以上例子仅仅用作演示两者区别,实际的html结构含有换行符等,环境比上面要复杂的多,贪心和非贪心捕获的写法可能不会导致结果不同,但是我们依然需要理解它们的区别。

[上一个例子](95.md) [下一个例子](97.md)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.