Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit faae04c

Browse filesBrowse files
committed
Added scenario about web scraping using lxml
1 parent 2a9c732 commit faae04c
Copy full SHA for faae04c

File tree

Expand file treeCollapse file tree

1 file changed

+82
-0
lines changed
Filter options
Expand file treeCollapse file tree

1 file changed

+82
-0
lines changed

‎docs/scenarios/scrape.rst

Copy file name to clipboard
+82Lines changed: 82 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,82 @@
1+
HTML Scraping
2+
=============
3+
4+
Web Scraping
5+
------------
6+
7+
Web sites are written using HTML, which means that each web page is a
8+
structured document. Sometimes it would be great to obtain some data from
9+
them and preserve the structure while we're at it, but this isn't always easy
10+
- it's not often that web sites provide their data in comfortable formats
11+
such as `.csv`.
12+
13+
This is where web scraping comes in. Web scraping is the practice of using
14+
computer program to sift through a web page and gather the data that you need
15+
in a format most useful to you.
16+
17+
lxml
18+
----
19+
20+
`lxml <http://lxml.de/>`_ is a pretty extensive library written for parsing
21+
XML and HTML documents, which you can easily install using `pip`. We will
22+
be using its `html` module to get data from this web page: `econpy <http://econpy.pythonanywhere.com/ex/001.html>'_ .
23+
24+
First we shall import the required modules:
25+
26+
.. code-block:: python
27+
28+
from lxml import html
29+
from urllib2 import urlopen
30+
31+
We will use `urllib2.urlopen` to retrieve the web page with our data and
32+
parse it using the `html` module:
33+
34+
.. code-block:: python
35+
36+
page = urlopen('http://econpy.pythonanywhere.com/ex/001.html')
37+
tree = html.fromstring(page.read())
38+
39+
`tree` now contains the whole HTML file in a nice tree structure which
40+
we can go over in many different ways, one of which is using XPath. XPath
41+
is a way of locating information in structured documents such as HTML or XML
42+
pages. A good introduction to XPath is 'here <http://www.w3schools.com/xpath/default.asp>'_ .
43+
One can also use various tools for obtaining the XPath of elements such as
44+
FireBug for Firefox or in Chrome you can right click an element, choose
45+
'Inspect element', highlight the code and the right click again and choose
46+
'Copy XPath'.
47+
48+
After a quick analysis, we see that in our page the data is contained in
49+
two elements - one is a div with title 'buyer-name' and the other is a
50+
span with class 'item-price'. Knowing this we can create the correct XPath
51+
query and use the lxml `xpath` function like this:
52+
53+
.. code-block:: python
54+
55+
#This will create a list of buyers:
56+
buyers = tree.xpath('//div[@title="buyer-name"]/text()')
57+
#This will create a list of prices
58+
prices = tree.xpath('//span[@class="item-price"]/text()')
59+
60+
Lets see what we got exactly:
61+
62+
.. code-block:: python
63+
64+
print 'Buyers: ', buyers
65+
print 'Prices: ', prices
66+
67+
::
68+
Buyers: ['Carson Busses', 'Earl E. Byrd', 'Patty Cakes',
69+
'Derri Anne Connecticut', 'Moe Dess', 'Leda Doggslife', 'Dan Druff',
70+
'Al Fresco', 'Ido Hoe', 'Howie Kisses', 'Len Lease', 'Phil Meup',
71+
'Ira Pent', 'Ben D. Rules', 'Ave Sectomy', 'Gary Shattire',
72+
'Bobbi Soks', 'Sheila Takya', 'Rose Tattoo', 'Moe Tell']
73+
74+
Prices: ['$29.95', '$8.37', '$15.26', '$19.25', '$19.25',
75+
'$13.99', '$31.57', '$8.49', '$14.47', '$15.86', '$11.11',
76+
'$15.98', '$16.27', '$7.50', '$50.85', '$14.26', '$5.68',
77+
'$15.00', '$114.07', '$10.09']
78+
79+
Congratulations! We have successfully scraped all the data we wanted from
80+
a web page using lxml and we have it stored in memory as two lists. Now we
81+
can either continue our work on it, analyzing it using python or we can
82+
export it to a file and share it with friends.

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.