Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 67b14ec

Browse filesBrowse files
committed
added extract links from pdf tutorial
1 parent 4d3c786 commit 67b14ec
Copy full SHA for 67b14ec

File tree

Expand file treeCollapse file tree

7 files changed

+44
-0
lines changed
Open diff view settings
Filter options
Expand file treeCollapse file tree

7 files changed

+44
-0
lines changed
Open diff view settings
Collapse file

‎README.md‎

Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
  • Display the source diff
  • Display the rich diff
Original file line numberDiff line numberDiff line change
@@ -88,6 +88,7 @@ This is a repository of all the tutorials of [The Python Code](https://www.thepy
8888
- [How to Extract and Submit Web Forms from a URL using Python](https://www.thepythoncode.com/article/extracting-and-submitting-web-page-forms-in-python). ([code](web-scraping/extract-and-fill-forms))
8989
- [How to Get Domain Name Information in Python](https://www.thepythoncode.com/article/extracting-domain-name-information-in-python). ([code](web-scraping/get-domain-info))
9090
- [How to Extract YouTube Comments in Python](https://www.thepythoncode.com/article/extract-youtube-comments-in-python). ([code](web-scraping/youtube-comments-extractor))
91+
- [How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python). ([code](web-scraping/pdf-url-extractor))
9192

9293
- ### [Python Standard Library](https://www.thepythoncode.com/topic/python-standard-library)
9394
- [How to Transfer Files in the Network using Sockets in Python](https://www.thepythoncode.com/article/send-receive-files-using-sockets-python). ([code](general/transfer-files/))
Collapse file
5.09 MB
Binary file not shown.
Collapse file
757 KB
Binary file not shown.
Collapse file
+4Lines changed: 4 additions & 0 deletions
  • Display the source diff
  • Display the rich diff
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# [How to Extract All PDF Links in Python](https://www.thepythoncode.com/article/extract-pdf-links-with-python)
2+
To run this:
3+
- `pip3 install -r requirements.txt`
4+
- Use `pdf_link_extractor.py` to get clickable links, and `pdf_link_extractor_regex.py` to get links that are in text form.
Collapse file
+15Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import pikepdf # pip3 install pikepdf
2+
3+
file = "1810.04805.pdf"
4+
# file = "1710.05006.pdf"
5+
pdf_file = pikepdf.Pdf.open(file)
6+
urls = []
7+
# iterate over PDF pages
8+
for page in pdf_file.pages:
9+
for annots in page.get("/Annots"):
10+
uri = annots.get("/A").get("/URI")
11+
if uri is not None:
12+
print("[+] URL Found:", uri)
13+
urls.append(uri)
14+
15+
print("[*] Total URLs extracted:", len(urls))
Collapse file
+22Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
import fitz # pip install PyMuPDF
2+
import re
3+
4+
# a regular expression of URLs
5+
url_regex = r"https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)"
6+
# extract raw text from pdf
7+
# file = "1710.05006.pdf"
8+
file = "1810.04805.pdf"
9+
# open the PDF file
10+
with fitz.open(file) as pdf:
11+
text = ""
12+
for page in pdf:
13+
# extract text of each PDF page
14+
text += page.getText()
15+
urls = []
16+
# extract all urls using the regular expression
17+
for match in re.finditer(url_regex, text):
18+
url = match.group()
19+
print("[+] URL Found:", url)
20+
urls.append(url)
21+
print("[*] Total URLs extracted:", len(urls))
22+
Collapse file
+2Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
pikepdf
2+
PyMuPDF

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.