Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 81154c5

Browse filesBrowse files
authored
Add files via upload
1 parent 1db3451 commit 81154c5
Copy full SHA for 81154c5

6 files changed

+360
-0
lines changed

‎images/comparison-chart.png

Copy file name to clipboard
427 KB
Loading

‎images/output_normal_lru.png

Copy file name to clipboard
9.43 KB
Loading

‎images/output_normal_memoized.png

Copy file name to clipboard
10.4 KB
Loading

‎readme.md

Copy file name to clipboard
+288Lines changed: 288 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,288 @@
1+
# **Python Cache: How to Speed Up Your Code with Effective Caching**
2+
3+
This article will show you how to use caching in Python with your web
4+
scraping tasks. You can read the [<u>full
5+
article</u>](https://oxylabs.io/blog/python-cache-how-to-use-effectively)
6+
on our blog, where we delve deeper into the different caching
7+
strategies.
8+
9+
## **How to implement a cache in Python**
10+
11+
There are different ways to implement caching in Python for different
12+
caching strategies. Here we’ll see two methods of Python caching for a
13+
simple web scraping example. If you’re new to web scraping, take a look
14+
at our [<u>step-by-step Python web scraping
15+
guide</u>](https://oxylabs.io/blog/python-web-scraping).
16+
17+
### **Install the required libraries**
18+
19+
We’ll use the [<u>requests
20+
library</u>](https://pypi.org/project/requests/) to make HTTP requests
21+
to a website. Install it with
22+
[<u>pip</u>](https://pypi.org/project/pip/) by entering the following
23+
command in your terminal:
24+
25+
python -m pip install requests
26+
27+
Other libraries we’ll use in this project, specifically time and
28+
functools, come natively with Python 3.11.2, so you don’t have to
29+
install them.
30+
31+
### **Method 1: Python caching using a manual decorator**
32+
33+
A [<u>decorator</u>](https://peps.python.org/pep-0318/) in Python is a
34+
function that accepts another function as an argument and outputs a new
35+
function. We can alter the behavior of the original function using a
36+
decorator without changing its source code.
37+
38+
One common use case for decorators is to implement caching. This
39+
involves creating a dictionary to store the function's results and then
40+
saving them in the cache for future use.
41+
42+
Let’s start by creating a simple function that takes a URL as a function
43+
argument, requests that URL, and returns the response text:
44+
45+
def get_html_data(url):
46+
47+
response = requests.get(url)
48+
49+
return response.text
50+
51+
Now, let's move toward creating a memoized version of this function:
52+
53+
def memoize(func):
54+
55+
cache = {}
56+
57+
def wrapper(\*args):
58+
59+
if args in cache:
60+
61+
return cache\[args\]
62+
63+
else:
64+
65+
result = func(\*args)
66+
67+
cache\[args\] = result
68+
69+
return result
70+
71+
return wrapper
72+
73+
@memoize
74+
75+
def get_html_data_cached(url):
76+
77+
response = requests.get(url)
78+
79+
return response.text
80+
81+
The wrapper function determines whether the current input arguments have
82+
been previously cached and, if so, returns the previously cached result.
83+
If not, the code calls the original function and caches the result
84+
before being returned. In this case, we define a memoize decorator that
85+
generates a cache dictionary to hold the results of previous function
86+
calls.
87+
88+
By adding @memoize above the function definition, we can use the memoize
89+
decorator to enhance the get_html_data function. This generates a new
90+
memoized function that we’ve called get_html_data_cached. It only makes
91+
a single network request for a URL and then stores the response in the
92+
cache for further requests.
93+
94+
Let’s use the time module to compare the execution speeds of the
95+
get_html_data function and the memoized get_html_data_cached function:
96+
97+
import time
98+
99+
start_time = time.time()
100+
101+
get_html_data('https://books.toscrape.com/')
102+
103+
print('Time taken (normal function):', time.time() - start_time)
104+
105+
start_time = time.time()
106+
107+
get_html_data_cached('https://books.toscrape.com/')
108+
109+
print('Time taken (memoized function using manual decorator):',
110+
time.time() - start_time)
111+
112+
Here’s what the complete code looks like:
113+
114+
\# Import the required modules
115+
116+
from functools import lru_cache
117+
118+
import time
119+
120+
import requests
121+
122+
\# Function to get the HTML Content
123+
124+
def get_html_data(url):
125+
126+
response = requests.get(url)
127+
128+
return response.text
129+
130+
\# Memoize function to cache the data
131+
132+
def memoize(func):
133+
134+
cache = {}
135+
136+
\# Inner wrapper function to store the data in the cache
137+
138+
def wrapper(\*args):
139+
140+
if args in cache:
141+
142+
return cache\[args\]
143+
144+
else:
145+
146+
result = func(\*args)
147+
148+
cache\[args\] = result
149+
150+
return result
151+
152+
return wrapper
153+
154+
\# Memoized function to get the HTML Content
155+
156+
@memoize
157+
158+
def get_html_data_cached(url):
159+
160+
response = requests.get(url)
161+
162+
return response.text
163+
164+
\# Get the time it took for a normal function
165+
166+
start_time = time.time()
167+
168+
get_html_data('https://books.toscrape.com/')
169+
170+
print('Time taken (normal function):', time.time() - start_time)
171+
172+
\# Get the time it took for a memoized function (manual decorator)
173+
174+
start_time = time.time()
175+
176+
get_html_data_cached('https://books.toscrape.com/')
177+
178+
print('Time taken (memoized function using manual decorator):',
179+
time.time() - start_time)
180+
181+
Here’s the output:
182+
183+
Notice the time difference between the two functions. Both take almost
184+
the same time, but the supremacy of caching lies behind the re-access.
185+
186+
Since we’re making only one request, the memoized function also has to
187+
access data from the main memory. Therefore, with our example, a
188+
significant time difference in execution isn’t expected. However, if you
189+
increase the number of calls to these functions, the time difference
190+
will significantly increase (see [<u>Performance
191+
Comparison</u>](#performance-comparison)). 
192+
193+
### **Method 2: Python caching using LRU cache decorator**
194+
195+
Another method to implement caching in Python is to use the built-in
196+
@lru_cache decorator from functools. This decorator implements cache
197+
using the least recently used (LRU) caching strategy. This LRU cache is
198+
a fixed-size cache, which means it’ll discard the data from the cache
199+
that hasn’t been used recently.
200+
201+
To use the @lru_cache decorator, we can create a new function for
202+
extracting HTML content and place the decorator name at the top. Make
203+
sure to import the functools module before using the decorator: 
204+
205+
from functools import lru_cache
206+
207+
@lru_cache(maxsize=None)
208+
209+
def get_html_data_lru(url):
210+
211+
response = requests.get(url)
212+
213+
return response.text
214+
215+
In the above example, the get_html_data_lru method is memoized using the
216+
@lru_cache decorator. The cache can grow indefinitely when the maxsize
217+
option is set to None.
218+
219+
To use the @lru_cache decorator, just add it above the get_html_data_lru
220+
function. Here’s the complete code sample:
221+
222+
\# Import the required modules
223+
224+
from functools import lru_cache
225+
226+
import time
227+
228+
import requests
229+
230+
\# Function for getting HTML Content
231+
232+
def get_html_data(url):
233+
234+
response = requests.get(url)
235+
236+
return response.text
237+
238+
\# Memoized using LRU Cache
239+
240+
@lru_cache(maxsize=None)
241+
242+
def get_html_data_lru(url):
243+
244+
response = requests.get(url)
245+
246+
return response.text
247+
248+
\# Getting time for Normal function to extract HTML content
249+
250+
start_time = time.time()
251+
252+
get_html_data('https://books.toscrape.com/')
253+
254+
print('Time taken (normal function):', time.time() - start_time)
255+
256+
\# Getting time for Memoized function (LRU cache) to extract HTML
257+
content
258+
259+
start_time = time.time()
260+
261+
get_html_data_lru('https://books.toscrape.com/')
262+
263+
print('Time taken (memoized function with LRU cache):', time.time() -
264+
start_time)
265+
266+
This produced the following output:
267+
268+
### **Performance comparison**
269+
270+
In the following table, we’ve determined the execution times of all
271+
three functions for different numbers of requests to these functions:
272+
273+
| **No. of requests** | **Time taken by normal function** | **Time taken by memoized function (manual decorator)** | **Time taken by memoized function (lru_cache decorator)** |
274+
|---------------------|-----------------------------------|--------------------------------------------------------|-----------------------------------------------------------|
275+
| 1 | 2.1 Seconds | 2.0 Seconds | 1.7 Seconds |
276+
| 10 | 17.3 Seconds | 2.1 Seconds | 1.8 Seconds |
277+
| 20 | 32.2 Seconds | 2.2 Seconds | 2.1 Seconds |
278+
| 30 | 57.3 Seconds | 2.22 Seconds | 2.12 Seconds |
279+
280+
As the number of requests to the functions increases, you can see a
281+
significant reduction in execution times using the caching strategy. The
282+
following comparison chart depicts these results:
283+
284+
The comparison results clearly show that using a caching strategy in
285+
your code can significantly improve overall performance and speed.
286+
287+
Feel free to visit our [<u>blog</u>](https://oxylabs.io/blog) for an
288+
array of intriguing web scraping topics that will keep you hooked!

‎src/lru_caching.py

Copy file name to clipboard
+28Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
# Import the required modules
2+
from functools import lru_cache
3+
import time
4+
import requests
5+
6+
7+
# Function to get the HTML Content
8+
def get_html_data(url):
9+
response = requests.get(url)
10+
return response.text
11+
12+
13+
# Memoized using LRU Cache
14+
@lru_cache(maxsize=None)
15+
def get_html_data_lru(url):
16+
response = requests.get(url)
17+
return response.text
18+
19+
20+
# Get the time it took for a normal function
21+
start_time = time.time()
22+
get_html_data('https://books.toscrape.com/')
23+
print('Time taken (normal function):', time.time() - start_time)
24+
25+
# Get the time it took for a memoized function (LRU cache)
26+
start_time = time.time()
27+
get_html_data_lru('https://books.toscrape.com/')
28+
print('Time taken (memoized function with LRU cache):', time.time() - start_time)

‎src/manual_decorator_caching.py

Copy file name to clipboard
+44Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Import the required modules
2+
from functools import lru_cache
3+
import time
4+
import requests
5+
6+
7+
# Function to get the HTML Content
8+
def get_html_data(url):
9+
response = requests.get(url)
10+
return response.text
11+
12+
13+
# Memoize function to cache the data
14+
def memoize(func):
15+
cache = {}
16+
17+
# Inner wrapper function to store the data in the cache
18+
def wrapper(*args):
19+
if args in cache:
20+
return cache[args]
21+
else:
22+
result = func(*args)
23+
cache[args] = result
24+
return result
25+
26+
return wrapper
27+
28+
29+
# Memoized function to get the HTML Content
30+
@memoize
31+
def get_html_data_cached(url):
32+
response = requests.get(url)
33+
return response.text
34+
35+
36+
# Get the time it took for a normal function
37+
start_time = time.time()
38+
get_html_data('https://books.toscrape.com/')
39+
print('Time taken (normal function):', time.time() - start_time)
40+
41+
# Get the time it took for a memoized function (manual decorator)
42+
start_time = time.time()
43+
get_html_data_cached('https://books.toscrape.com/')
44+
print('Time taken (memoized function using manual decorator):', time.time() - start_time)

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.