Using multiprocessing to speed up website requests in Python - python

I am using the requests module to download the content of many websites, which looks something like this:
import requests
for i in range(1000):
url = base_url + f"{i}.anything"
r = requests.get(url)
Of course this is simplified, but basically the base url is always the same, I only want to download an image, for example.
This takes very long due to the amount of iterations. The internet connection is not the problem, but rather the amount of time it takes to start a request etc.
So I was thinking about something like multiprocessing, because this task is basically always the same and I could imagine it to be a lot faster when multiprocessed.
Is this somehow doable?
Thanks in advance!

I would suggest that in this case, the lightweight thread would be better. When I ran the request on a certain URL 5 times, the result was:
Threads: Finished in 0.24 second(s)
MultiProcess: Finished in 0.77 second(s)
Your implementation can be something like this:
import concurrent.futures
import requests
from bs4 import BeautifulSoup
import time
def access_url(url,No):
print(f"{No}:==> {url}")
response=requests.get(url)
soup=BeautifulSoup(response.text,features='lxml')
return ("{} : {}".format(No, str(soup.title)[7:50]))
if __name__ == "__main__":
test_url="http://bla bla.com/"
base_url=test_url
THREAD_MULTI_PROCESSING= True
start = time.perf_counter() # calculate the time
url_list=[base_url for i in range(5)] # setting parameter for function as a list so map can be used.
url_counter=[i for i in range(5)] # setting parameter for function as a list so map can be used.
if THREAD_MULTI_PROCESSING:
with concurrent.futures.ThreadPoolExecutor() as executor: # In this case thread would be better
results = executor.map(access_url,url_list,url_counter)
for result in results:
print(result)
end = time.perf_counter() # calculate finish time
print(f'Threads: Finished in {round(end - start,2)} second(s)')
start = time.perf_counter()
PROCESS_MULTI_PROCESSING=True
if PROCESS_MULTI_PROCESSING:
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(access_url,url_list,url_counter)
for result in results:
print(result)
end = time.perf_counter()
print(f'Threads: Finished in {round(end - start,2)} second(s)')
I think you will see better performance in your case.

Related

Mulithreading hangs when used with requests module and large number of threads

I have about 2000 urls that I am trying to scrape using the requests module. To speed up the process, I am using the ThreadPoolExecutor from concurrent.futures. The execution hangs in the middle when I run this and the issue is inconsistent too. Sometimes, it finishes smoothly within 2 minutes but other times, it just gets stuck at a point for over 30 mins and I eventually have to kill the process.
# scraper.py
def get_content(url):
try:
res = requests.get(url)
res = res.content
return res
except:
return ""
# main.py
from scraper import get_content
if __name__ == "__main__":
# content > an empty list for output
# urls > a list of urls
with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
results = executor.map(get_content, urls)
for res in results:
content = content.append(res)
print(content)
I want to understand how to debug this. Why and where is it getting stuck? And also, why is it inconsistent?

How does Django handles multiple request

This is not a duplicate of this question
I am trying to understand how django handles multiple requests. According to this answer django is supposed to be blocking parallel requests. But I have found this is not exactly true, at least for django 3.1. I am using django builtin sever.
So, in my code(view.py) I have a blocking code block that is only triggered in a particular situation. It takes a very long to complete the request for this case. This is the code for view.py
from django.shortcuts import render
import numpy as np
def insertionSort(arr):
for i in range(1, len(arr)):
key = arr[i]
j = i-1
while j >=0 and key < arr[j] :
arr[j+1] = arr[j]
j -= 1
arr[j+1] = key
def home(request):
a = request.user.username
print(a)
id = int(request.GET.get('id',''))
if id ==1:
arr = np.arange(100000)
arr = arr[::-1]
insertionSort(arr)
# print ("Sorted array is:")
# for i in range(len(arr)):
# print ("%d" %arr[i])
return render(request,'home/home.html')
so only for id=1 it will execute the blocking code block. But for other cases, it is supposed to work normally.
Now, what I found is, if I make two multiple requests, one with id=1 and another with id=2, second request does not really get blocked but takes longer time to get data from django. It takes ~2.5s to complete if there is another parallel blocking request. Otherwise, it takes ~0.02s to get data.
These are my python codes to make the request:
malicious request:
from concurrent.futures import as_completed
from pprint import pprint
from requests_futures.sessions import FuturesSession
session = FuturesSession()
futures=[session.get(f'http://127.0.0.1:8000/?id=1') for i in range(3)]
start = time.time()
for future in as_completed(futures):
resp = future.result()
# pprint({
# 'url': resp.request.url,
# 'content': resp.json(),
# })
roundtrip = time.time() - start
print (roundtrip)
Normal request:
import logging
import threading
import time
import requests
if __name__ == "__main__":
# start = time.time()
while(True):
print(requests.get("http://127.0.0.1:8000/?id=2").elapsed.total_seconds())
time.sleep(2)
I will be grateful if anyone can explain how Django is serving the parallel requests in this case.
There is an option to use --nothreading when you start the server. From what you described it's possible the blocking task finished in 2 seconds. Easier way to test is to just use time.sleep(10) for testing purposes.

Multicore processing on scraper function

I was hoping to speed up my scraper by using multiple cores so multiple cores could scrape from the URLs in a list I have using a predefined function scrape. How would I do this?
Here is my current code:
for x in URLs['identifier'][1:365]:
test= scrape(x)
results = test.get_results
results['identifier'] = x
final= final.append(results)
Something like this, (or you can also use Scrapy)
It will easily allow you to make a lot of requests in parallel provided the server can handle it as well;
# it's just a wrapper around concurrent.futures ThreadPoolExecutor with a nice tqdm progress bar!
from tqdm.contrib.concurrent import thread_map, process_map # for multi-threading, multi-processing respectively)
def chunk_list(lst, size):
for i in range(0, len(lst), size):
yield lst[i:i + size]
for idx, my_chunk in enumerate(chunk_list(huge_list, size=2**12)):
for response in thread_map(<which_func_to_call>, my_chunk, max_workers=your_cpu_cores+6)):
# which_func_to_call -> wrap the returned response json obj in this, etc
# do something with the response now..
# make sure to cache the chunk results as well (in case you are having lot of them)
OR
Using the pool from multi-processing module in Python..
from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
base_url = 'http://quotes.toscrape.com/page/'
all_urls = list()
def generate_urls():
# better to yield them as well if you already have the URL's list etc..
for i in range(1,11):
all_urls.append(base_url + str(i))
def scrape(url):
res = requests.get(url)
print(res.status_code, res.url)
generate_urls()
p = Pool(10)
p.map(scrape, all_urls)
p.terminate()
p.join()

How to send multiple 'GET' requests using get function of requests library? [duplicate]

This question already has answers here:
What is the fastest way to send 100,000 HTTP requests in Python?
(21 answers)
Closed 2 years ago.
I want to fetch data (JSON files only) from multiple URLs using requests.get(). The URLs are saved in a pandas dataframe column and I am saving the response in JSON files locally.
i=0
start = time()
for url in pd_url['URL']:
time_1 = time()
r_1 = requests.get(url, headers = headers).json()
filename = './jsons1/'+str(i)+'.json'
with open(filename, 'w') as f:
json.dump(r_1, f)
i+=1
time_taken = time()-start
print('time taken:', time_taken)
Currently, I have written code to get data one by one from each URL using for loop as shown above. However, that code is taking too much time to execute. Is there any way to send multiple requests at once and make this thing run faster?
Also, What are the possible factors that are delaying the responses?
I have an internet connection with low latency and enough speed to 'theoretically' execute above operation in less than 20 seconds. Still, the above code takes 145-150 seconds every time I run it. My target is to complete this execution in maximum 30 seconds. Please suggest workarounds.
It sounds like you want multi-threading so use the ThreadPoolExecutor in the standard library. This can be found in the concurrent.futures package.
import concurrent.futures
def make_request(url, headers):
resp = requests.get(url, headers=headers).json()
return resp
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = (executor.submit(make_request, url, headers) for url in pd_url['URL'])
for idx, future in enumerate(concurrent.futures.as_completed(futures)):
try:
data = future.result()
except Exception as exc:
print(f"Generated an exception: {exc}")
with open(f"./jsons1/{idx}.json", 'w') as f:
json.dump(data, f)
You can increase or decrease the number of threads, specified as max_workers, as you see fit.
You can make use of multiple threads to parallelize your fetching. This article presents one possible way of doing that using the ThreadPoolExecutor class from the concurrent.futures module.
It looks like #gold_cy posted pretty much the same answer while I was working on this, but for posterity, here's my example. I've taken your code and modified it to use the executor, and I've modified it slightly to run locally despite not having handy access to a list of JSON urls.
I'm using a list of 100 URLs, and it takes about 125 seconds to fetch the list serially, and about 27 seconds using 10 workers. I added a timeout on requests to prevent broken servers from holding everything up, and I added some code to handle errors responses.
import json
import pandas
import requests
import time
from concurrent.futures import ThreadPoolExecutor
def fetch_url(data):
index, url = data
print('fetching', url)
try:
r = requests.get(url, timeout=10)
except requests.exceptions.ConnectTimeout:
return
if r.status_code != 200:
return
filename = f'./data/{index}.json'
with open(filename, 'w') as f:
json.dump(r.text, f)
pd_url = pandas.read_csv('urls.csv')
start = time.time()
with ThreadPoolExecutor(max_workers=10) as runner:
for _ in runner.map(fetch_url, enumerate(pd_url['URL'])):
pass
runner.shutdown()
time_taken = time.time()-start
print('time taken:', time_taken)
Also, What are the possible factors that are delaying the responses?
The response time of the remote server is going to be the major bottleneck.

Processing Result outside For Loop in Python

I have this simple code which fetches page via urllib:
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
result= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
Now I can read the result via result.read() but I was wondering if all this functionality can be done outside the for loop. Because other URLs to be fetched will wait until all the result has been processed.
I want to process result outside the for loop. Can this be done?
One of the ways to do this maybe to have result as a dictionary. What you can do is:
result = {}
for eachBrowser in browser_list:
result[eachBrowser]= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
and use result[BrowserName] outside the loop.
Hope this helps.
If you simply wants to access all results outside the loop just append all results to a array or dictionary as above answer.
Or if you trying to speed up your task try multithreading.
import threading
class myThread (threading.Thread):
def __init__(self, result):
threading.Thread.__init__(self)
self.result=result
def run(self):
// process your result(as self.result) here
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
result= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
myThread(result).start() // it will start processing result on another thread and continue loop without any waiting
Its a simple way of multithrading. It may break depending on your result processing. Consider reading the documentation and some examples before you try.
You can use threads for this:
import threading
import urllib2
from urlparse import urljoin
def worker(url):
res = urllib2.urlopen(url)
data = res.read()
res.close()
browser_list = ['Chrome', 'Mozilla', 'Safari', 'Internet Explorer', 'Opera']
user_string_url='http://www.useragentstring.com/'
for browser in browser_list:
url = urljoin(user_string_url, browser)
threading.Thread(target=worker,args=[url]).start()
# wait for everyone to complete
for thread in threading.enumerate():
if thread == threading.current_thread(): continue
thread.join()
Are you using python3?, if so, you can use futures for this task:
from urllib.request import urlopen
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor
browser_list = ['Chrome','Mozilla','Safari','Internet+Explorer','Opera']
user_string_url = "http://www.useragentstring.com/pages/"
def process_request(url, future):
print("Processing:", url)
print("Reading data")
print(future.result().read())
with ThreadPoolExecutor(max_workers=10) as executor:
submit = executor.submit
for browser in browser_list:
url = urljoin(user_string_url, browser) + '/'
submit(process_request, url, submit(urlopen, url))
You could also do this with yield:
def collect_browsers():
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
yield eachBrowser, urllib2.urlopen(urljoin(user_string_url,eachBrowser))
def process_browsers():
for browser, result in collect_browsers():
do_something (result)
This is still a synchronous call (browser 2 will not fire until browser 1 is processed) but you can keep the logic for dealing with the results separate from the logic managing the connections. You could of course also use threads to handle the processing asynchronously with or without yield
Edit
Just re-read OP and should repeat that yield doesn't provide multi-threaded, asynchronous execution in case that was not clear in my first answer!

Categories

Resources