@@ -14,27 +14,29 @@ This is where web scraping comes in. Web scraping is the practice of using
14
14
computer program to sift through a web page and gather the data that you need
15
15
in a format most useful to you.
16
16
17
- lxml
18
- ----
17
+ lxml and Requests
18
+ -----------------
19
19
20
20
`lxml <http://lxml.de/ >`_ is a pretty extensive library written for parsing
21
- XML and HTML documents, which you can easily install using ``pip ``. We will
22
- be using its ``html `` module to get example data from this web page: `econpy.org <http://econpy.pythonanywhere.com/ex/001.html >`_ .
21
+ XML and HTML documents really fast. It even handles messed up tags. We will
22
+ also be using the `Requests <http://docs.python-requests.org/en/latest/ >`_ module instead of the already built-in urlib2
23
+ due to improvements in speed and readability. You can easily install both
24
+ using ``pip install lxml `` and ``pip install requests ``.
23
25
24
- First we shall import the required modules :
26
+ Lets start with the imports :
25
27
26
28
.. code-block :: python
27
29
28
30
from lxml import html
29
- from urllib2 import urlopen
31
+ import requests
30
32
31
- We will use ``urllib2.urlopen `` to retrieve the web page with our data and
32
- parse it using the ``html `` module:
33
+ Next we will use ``requests.get `` to retrieve the web page with our data
34
+ and parse it using the ``html `` module and save the results in `` tree `` :
33
35
34
36
.. code-block :: python
35
37
36
- page = urlopen (' http://econpy.pythonanywhere.com/ex/001.html' )
37
- tree = html.fromstring(page.read() )
38
+ page = requests.get (' http://econpy.pythonanywhere.com/ex/001.html' )
39
+ tree = html.fromstring(page.text )
38
40
39
41
``tree `` now contains the whole HTML file in a nice tree structure which
40
42
we can go over two different ways: XPath and CSSSelect. In this example, I
0 commit comments