|
| 1 | +# **Python Cache: How to Speed Up Your Code with Effective Caching** |
| 2 | + |
| 3 | +This article will show you how to use caching in Python with your web |
| 4 | +scraping tasks. You can read the [<u>full |
| 5 | +article</u>](https://oxylabs.io/blog/python-cache-how-to-use-effectively) |
| 6 | +on our blog, where we delve deeper into the different caching |
| 7 | +strategies. |
| 8 | + |
| 9 | +## **How to implement a cache in Python** |
| 10 | + |
| 11 | +There are different ways to implement caching in Python for different |
| 12 | +caching strategies. Here we’ll see two methods of Python caching for a |
| 13 | +simple web scraping example. If you’re new to web scraping, take a look |
| 14 | +at our [<u>step-by-step Python web scraping |
| 15 | +guide</u>](https://oxylabs.io/blog/python-web-scraping). |
| 16 | + |
| 17 | +### **Install the required libraries** |
| 18 | + |
| 19 | +We’ll use the [<u>requests |
| 20 | +library</u>](https://pypi.org/project/requests/) to make HTTP requests |
| 21 | +to a website. Install it with |
| 22 | +[<u>pip</u>](https://pypi.org/project/pip/) by entering the following |
| 23 | +command in your terminal: |
| 24 | + |
| 25 | +python -m pip install requests |
| 26 | + |
| 27 | +Other libraries we’ll use in this project, specifically time and |
| 28 | +functools, come natively with Python 3.11.2, so you don’t have to |
| 29 | +install them. |
| 30 | + |
| 31 | +### **Method 1: Python caching using a manual decorator** |
| 32 | + |
| 33 | +A [<u>decorator</u>](https://peps.python.org/pep-0318/) in Python is a |
| 34 | +function that accepts another function as an argument and outputs a new |
| 35 | +function. We can alter the behavior of the original function using a |
| 36 | +decorator without changing its source code. |
| 37 | + |
| 38 | +One common use case for decorators is to implement caching. This |
| 39 | +involves creating a dictionary to store the function's results and then |
| 40 | +saving them in the cache for future use. |
| 41 | + |
| 42 | +Let’s start by creating a simple function that takes a URL as a function |
| 43 | +argument, requests that URL, and returns the response text: |
| 44 | + |
| 45 | +def get_html_data(url): |
| 46 | + |
| 47 | +response = requests.get(url) |
| 48 | + |
| 49 | +return response.text |
| 50 | + |
| 51 | +Now, let's move toward creating a memoized version of this function: |
| 52 | + |
| 53 | +def memoize(func): |
| 54 | + |
| 55 | +cache = {} |
| 56 | + |
| 57 | +def wrapper(\*args): |
| 58 | + |
| 59 | +if args in cache: |
| 60 | + |
| 61 | +return cache\[args\] |
| 62 | + |
| 63 | +else: |
| 64 | + |
| 65 | +result = func(\*args) |
| 66 | + |
| 67 | +cache\[args\] = result |
| 68 | + |
| 69 | +return result |
| 70 | + |
| 71 | +return wrapper |
| 72 | + |
| 73 | +@memoize |
| 74 | + |
| 75 | +def get_html_data_cached(url): |
| 76 | + |
| 77 | +response = requests.get(url) |
| 78 | + |
| 79 | +return response.text |
| 80 | + |
| 81 | +The wrapper function determines whether the current input arguments have |
| 82 | +been previously cached and, if so, returns the previously cached result. |
| 83 | +If not, the code calls the original function and caches the result |
| 84 | +before being returned. In this case, we define a memoize decorator that |
| 85 | +generates a cache dictionary to hold the results of previous function |
| 86 | +calls. |
| 87 | + |
| 88 | +By adding @memoize above the function definition, we can use the memoize |
| 89 | +decorator to enhance the get_html_data function. This generates a new |
| 90 | +memoized function that we’ve called get_html_data_cached. It only makes |
| 91 | +a single network request for a URL and then stores the response in the |
| 92 | +cache for further requests. |
| 93 | + |
| 94 | +Let’s use the time module to compare the execution speeds of the |
| 95 | +get_html_data function and the memoized get_html_data_cached function: |
| 96 | + |
| 97 | +import time |
| 98 | + |
| 99 | +start_time = time.time() |
| 100 | + |
| 101 | +get_html_data('https://books.toscrape.com/') |
| 102 | + |
| 103 | +print('Time taken (normal function):', time.time() - start_time) |
| 104 | + |
| 105 | +start_time = time.time() |
| 106 | + |
| 107 | +get_html_data_cached('https://books.toscrape.com/') |
| 108 | + |
| 109 | +print('Time taken (memoized function using manual decorator):', |
| 110 | +time.time() - start_time) |
| 111 | + |
| 112 | +Here’s what the complete code looks like: |
| 113 | + |
| 114 | +\# Import the required modules |
| 115 | + |
| 116 | +from functools import lru_cache |
| 117 | + |
| 118 | +import time |
| 119 | + |
| 120 | +import requests |
| 121 | + |
| 122 | +\# Function to get the HTML Content |
| 123 | + |
| 124 | +def get_html_data(url): |
| 125 | + |
| 126 | +response = requests.get(url) |
| 127 | + |
| 128 | +return response.text |
| 129 | + |
| 130 | +\# Memoize function to cache the data |
| 131 | + |
| 132 | +def memoize(func): |
| 133 | + |
| 134 | +cache = {} |
| 135 | + |
| 136 | +\# Inner wrapper function to store the data in the cache |
| 137 | + |
| 138 | +def wrapper(\*args): |
| 139 | + |
| 140 | +if args in cache: |
| 141 | + |
| 142 | +return cache\[args\] |
| 143 | + |
| 144 | +else: |
| 145 | + |
| 146 | +result = func(\*args) |
| 147 | + |
| 148 | +cache\[args\] = result |
| 149 | + |
| 150 | +return result |
| 151 | + |
| 152 | +return wrapper |
| 153 | + |
| 154 | +\# Memoized function to get the HTML Content |
| 155 | + |
| 156 | +@memoize |
| 157 | + |
| 158 | +def get_html_data_cached(url): |
| 159 | + |
| 160 | +response = requests.get(url) |
| 161 | + |
| 162 | +return response.text |
| 163 | + |
| 164 | +\# Get the time it took for a normal function |
| 165 | + |
| 166 | +start_time = time.time() |
| 167 | + |
| 168 | +get_html_data('https://books.toscrape.com/') |
| 169 | + |
| 170 | +print('Time taken (normal function):', time.time() - start_time) |
| 171 | + |
| 172 | +\# Get the time it took for a memoized function (manual decorator) |
| 173 | + |
| 174 | +start_time = time.time() |
| 175 | + |
| 176 | +get_html_data_cached('https://books.toscrape.com/') |
| 177 | + |
| 178 | +print('Time taken (memoized function using manual decorator):', |
| 179 | +time.time() - start_time) |
| 180 | + |
| 181 | +Here’s the output: |
| 182 | + |
| 183 | +Notice the time difference between the two functions. Both take almost |
| 184 | +the same time, but the supremacy of caching lies behind the re-access. |
| 185 | + |
| 186 | +Since we’re making only one request, the memoized function also has to |
| 187 | +access data from the main memory. Therefore, with our example, a |
| 188 | +significant time difference in execution isn’t expected. However, if you |
| 189 | +increase the number of calls to these functions, the time difference |
| 190 | +will significantly increase (see [<u>Performance |
| 191 | +Comparison</u>](#performance-comparison)). |
| 192 | + |
| 193 | +### **Method 2: Python caching using LRU cache decorator** |
| 194 | + |
| 195 | +Another method to implement caching in Python is to use the built-in |
| 196 | +@lru_cache decorator from functools. This decorator implements cache |
| 197 | +using the least recently used (LRU) caching strategy. This LRU cache is |
| 198 | +a fixed-size cache, which means it’ll discard the data from the cache |
| 199 | +that hasn’t been used recently. |
| 200 | + |
| 201 | +To use the @lru_cache decorator, we can create a new function for |
| 202 | +extracting HTML content and place the decorator name at the top. Make |
| 203 | +sure to import the functools module before using the decorator: |
| 204 | + |
| 205 | +from functools import lru_cache |
| 206 | + |
| 207 | +@lru_cache(maxsize=None) |
| 208 | + |
| 209 | +def get_html_data_lru(url): |
| 210 | + |
| 211 | +response = requests.get(url) |
| 212 | + |
| 213 | +return response.text |
| 214 | + |
| 215 | +In the above example, the get_html_data_lru method is memoized using the |
| 216 | +@lru_cache decorator. The cache can grow indefinitely when the maxsize |
| 217 | +option is set to None. |
| 218 | + |
| 219 | +To use the @lru_cache decorator, just add it above the get_html_data_lru |
| 220 | +function. Here’s the complete code sample: |
| 221 | + |
| 222 | +\# Import the required modules |
| 223 | + |
| 224 | +from functools import lru_cache |
| 225 | + |
| 226 | +import time |
| 227 | + |
| 228 | +import requests |
| 229 | + |
| 230 | +\# Function for getting HTML Content |
| 231 | + |
| 232 | +def get_html_data(url): |
| 233 | + |
| 234 | +response = requests.get(url) |
| 235 | + |
| 236 | +return response.text |
| 237 | + |
| 238 | +\# Memoized using LRU Cache |
| 239 | + |
| 240 | +@lru_cache(maxsize=None) |
| 241 | + |
| 242 | +def get_html_data_lru(url): |
| 243 | + |
| 244 | +response = requests.get(url) |
| 245 | + |
| 246 | +return response.text |
| 247 | + |
| 248 | +\# Getting time for Normal function to extract HTML content |
| 249 | + |
| 250 | +start_time = time.time() |
| 251 | + |
| 252 | +get_html_data('https://books.toscrape.com/') |
| 253 | + |
| 254 | +print('Time taken (normal function):', time.time() - start_time) |
| 255 | + |
| 256 | +\# Getting time for Memoized function (LRU cache) to extract HTML |
| 257 | +content |
| 258 | + |
| 259 | +start_time = time.time() |
| 260 | + |
| 261 | +get_html_data_lru('https://books.toscrape.com/') |
| 262 | + |
| 263 | +print('Time taken (memoized function with LRU cache):', time.time() - |
| 264 | +start_time) |
| 265 | + |
| 266 | +This produced the following output: |
| 267 | + |
| 268 | +### **Performance comparison** |
| 269 | + |
| 270 | +In the following table, we’ve determined the execution times of all |
| 271 | +three functions for different numbers of requests to these functions: |
| 272 | + |
| 273 | +| **No. of requests** | **Time taken by normal function** | **Time taken by memoized function (manual decorator)** | **Time taken by memoized function (lru_cache decorator)** | |
| 274 | +|---------------------|-----------------------------------|--------------------------------------------------------|-----------------------------------------------------------| |
| 275 | +| 1 | 2.1 Seconds | 2.0 Seconds | 1.7 Seconds | |
| 276 | +| 10 | 17.3 Seconds | 2.1 Seconds | 1.8 Seconds | |
| 277 | +| 20 | 32.2 Seconds | 2.2 Seconds | 2.1 Seconds | |
| 278 | +| 30 | 57.3 Seconds | 2.22 Seconds | 2.12 Seconds | |
| 279 | + |
| 280 | +As the number of requests to the functions increases, you can see a |
| 281 | +significant reduction in execution times using the caching strategy. The |
| 282 | +following comparison chart depicts these results: |
| 283 | + |
| 284 | +The comparison results clearly show that using a caching strategy in |
| 285 | +your code can significantly improve overall performance and speed. |
| 286 | + |
| 287 | +Feel free to visit our [<u>blog</u>](https://oxylabs.io/blog) for an |
| 288 | +array of intriguing web scraping topics that will keep you hooked! |
0 commit comments