I'm working on a python script that iterates on a list of urls containing .mp3 files; the aim is to extract the content-length from each url through a head request, using the requests library.
However, I noticed that the head requests slow down significantly the script; isolating the piece of code involved, I get a time of execution of 1.5 minutes (200 urls/requests):
import requests
import time
print("start\n\n")
t1 = time.time()
for n in range(200):
response = requests.head("url.mp3")
print(response,"\n")
t2 = time.time()
print("\n\nend\n\n")
print("time: ",t2-t1,"s")
A good solution for you could be grequests
import grequests
requests = (grequests.get('http://127.0.0.1/%i.mp3' % i) for u in range(200))
for code grequests.map(rs):
print 'Status code %i' % code.status_code
Related
I have about 2000 urls that I am trying to scrape using the requests module. To speed up the process, I am using the ThreadPoolExecutor from concurrent.futures. The execution hangs in the middle when I run this and the issue is inconsistent too. Sometimes, it finishes smoothly within 2 minutes but other times, it just gets stuck at a point for over 30 mins and I eventually have to kill the process.
# scraper.py
def get_content(url):
try:
res = requests.get(url)
res = res.content
return res
except:
return ""
# main.py
from scraper import get_content
if __name__ == "__main__":
# content > an empty list for output
# urls > a list of urls
with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
results = executor.map(get_content, urls)
for res in results:
content = content.append(res)
print(content)
I want to understand how to debug this. Why and where is it getting stuck? And also, why is it inconsistent?
I am using the requests module to download the content of many websites, which looks something like this:
import requests
for i in range(1000):
url = base_url + f"{i}.anything"
r = requests.get(url)
Of course this is simplified, but basically the base url is always the same, I only want to download an image, for example.
This takes very long due to the amount of iterations. The internet connection is not the problem, but rather the amount of time it takes to start a request etc.
So I was thinking about something like multiprocessing, because this task is basically always the same and I could imagine it to be a lot faster when multiprocessed.
Is this somehow doable?
Thanks in advance!
I would suggest that in this case, the lightweight thread would be better. When I ran the request on a certain URL 5 times, the result was:
Threads: Finished in 0.24 second(s)
MultiProcess: Finished in 0.77 second(s)
Your implementation can be something like this:
import concurrent.futures
import requests
from bs4 import BeautifulSoup
import time
def access_url(url,No):
print(f"{No}:==> {url}")
response=requests.get(url)
soup=BeautifulSoup(response.text,features='lxml')
return ("{} : {}".format(No, str(soup.title)[7:50]))
if __name__ == "__main__":
test_url="http://bla bla.com/"
base_url=test_url
THREAD_MULTI_PROCESSING= True
start = time.perf_counter() # calculate the time
url_list=[base_url for i in range(5)] # setting parameter for function as a list so map can be used.
url_counter=[i for i in range(5)] # setting parameter for function as a list so map can be used.
if THREAD_MULTI_PROCESSING:
with concurrent.futures.ThreadPoolExecutor() as executor: # In this case thread would be better
results = executor.map(access_url,url_list,url_counter)
for result in results:
print(result)
end = time.perf_counter() # calculate finish time
print(f'Threads: Finished in {round(end - start,2)} second(s)')
start = time.perf_counter()
PROCESS_MULTI_PROCESSING=True
if PROCESS_MULTI_PROCESSING:
with concurrent.futures.ProcessPoolExecutor() as executor:
results = executor.map(access_url,url_list,url_counter)
for result in results:
print(result)
end = time.perf_counter()
print(f'Threads: Finished in {round(end - start,2)} second(s)')
I think you will see better performance in your case.
This is not a duplicate of this question
I am trying to understand how django handles multiple requests. According to this answer django is supposed to be blocking parallel requests. But I have found this is not exactly true, at least for django 3.1. I am using django builtin sever.
So, in my code(view.py) I have a blocking code block that is only triggered in a particular situation. It takes a very long to complete the request for this case. This is the code for view.py
from django.shortcuts import render
import numpy as np
def insertionSort(arr):
for i in range(1, len(arr)):
key = arr[i]
j = i-1
while j >=0 and key < arr[j] :
arr[j+1] = arr[j]
j -= 1
arr[j+1] = key
def home(request):
a = request.user.username
print(a)
id = int(request.GET.get('id',''))
if id ==1:
arr = np.arange(100000)
arr = arr[::-1]
insertionSort(arr)
# print ("Sorted array is:")
# for i in range(len(arr)):
# print ("%d" %arr[i])
return render(request,'home/home.html')
so only for id=1 it will execute the blocking code block. But for other cases, it is supposed to work normally.
Now, what I found is, if I make two multiple requests, one with id=1 and another with id=2, second request does not really get blocked but takes longer time to get data from django. It takes ~2.5s to complete if there is another parallel blocking request. Otherwise, it takes ~0.02s to get data.
These are my python codes to make the request:
malicious request:
from concurrent.futures import as_completed
from pprint import pprint
from requests_futures.sessions import FuturesSession
session = FuturesSession()
futures=[session.get(f'http://127.0.0.1:8000/?id=1') for i in range(3)]
start = time.time()
for future in as_completed(futures):
resp = future.result()
# pprint({
# 'url': resp.request.url,
# 'content': resp.json(),
# })
roundtrip = time.time() - start
print (roundtrip)
Normal request:
import logging
import threading
import time
import requests
if __name__ == "__main__":
# start = time.time()
while(True):
print(requests.get("http://127.0.0.1:8000/?id=2").elapsed.total_seconds())
time.sleep(2)
I will be grateful if anyone can explain how Django is serving the parallel requests in this case.
There is an option to use --nothreading when you start the server. From what you described it's possible the blocking task finished in 2 seconds. Easier way to test is to just use time.sleep(10) for testing purposes.
This question already has answers here:
What is the fastest way to send 100,000 HTTP requests in Python?
(21 answers)
Closed 2 years ago.
I want to fetch data (JSON files only) from multiple URLs using requests.get(). The URLs are saved in a pandas dataframe column and I am saving the response in JSON files locally.
i=0
start = time()
for url in pd_url['URL']:
time_1 = time()
r_1 = requests.get(url, headers = headers).json()
filename = './jsons1/'+str(i)+'.json'
with open(filename, 'w') as f:
json.dump(r_1, f)
i+=1
time_taken = time()-start
print('time taken:', time_taken)
Currently, I have written code to get data one by one from each URL using for loop as shown above. However, that code is taking too much time to execute. Is there any way to send multiple requests at once and make this thing run faster?
Also, What are the possible factors that are delaying the responses?
I have an internet connection with low latency and enough speed to 'theoretically' execute above operation in less than 20 seconds. Still, the above code takes 145-150 seconds every time I run it. My target is to complete this execution in maximum 30 seconds. Please suggest workarounds.
It sounds like you want multi-threading so use the ThreadPoolExecutor in the standard library. This can be found in the concurrent.futures package.
import concurrent.futures
def make_request(url, headers):
resp = requests.get(url, headers=headers).json()
return resp
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = (executor.submit(make_request, url, headers) for url in pd_url['URL'])
for idx, future in enumerate(concurrent.futures.as_completed(futures)):
try:
data = future.result()
except Exception as exc:
print(f"Generated an exception: {exc}")
with open(f"./jsons1/{idx}.json", 'w') as f:
json.dump(data, f)
You can increase or decrease the number of threads, specified as max_workers, as you see fit.
You can make use of multiple threads to parallelize your fetching. This article presents one possible way of doing that using the ThreadPoolExecutor class from the concurrent.futures module.
It looks like #gold_cy posted pretty much the same answer while I was working on this, but for posterity, here's my example. I've taken your code and modified it to use the executor, and I've modified it slightly to run locally despite not having handy access to a list of JSON urls.
I'm using a list of 100 URLs, and it takes about 125 seconds to fetch the list serially, and about 27 seconds using 10 workers. I added a timeout on requests to prevent broken servers from holding everything up, and I added some code to handle errors responses.
import json
import pandas
import requests
import time
from concurrent.futures import ThreadPoolExecutor
def fetch_url(data):
index, url = data
print('fetching', url)
try:
r = requests.get(url, timeout=10)
except requests.exceptions.ConnectTimeout:
return
if r.status_code != 200:
return
filename = f'./data/{index}.json'
with open(filename, 'w') as f:
json.dump(r.text, f)
pd_url = pandas.read_csv('urls.csv')
start = time.time()
with ThreadPoolExecutor(max_workers=10) as runner:
for _ in runner.map(fetch_url, enumerate(pd_url['URL'])):
pass
runner.shutdown()
time_taken = time.time()-start
print('time taken:', time_taken)
Also, What are the possible factors that are delaying the responses?
The response time of the remote server is going to be the major bottleneck.
I'm building a thing that gathers data from a site. Sometimes it has to go through >10,000 pages, and opening each one with urllib2.urlopen() takes time. I'm not very hopeful about this, but does anyone know of a faster way to get html from a site?
my code is this :
import urllib, json, time
import requests
##########################
start_time = time.time()
##########################
query = "hill"
queryEncode = urllib.quote(query)
url = 'https://www.googleapis.com/customsearch/v1?key={{MY API KEY}}&cx={{cxKey}}:omuauf_lfve&fields=queries(request(totalResults))&q='+queryEncode
response = urllib.urlopen(url)
data = json.loads(str(response.read()))
##########################
elapsed_time = time.time() - start_time
print " url to json time : " + str(elapsed_time)
##########################
And the output is
url to json time : 4.46600008011
[Finished in 4.7s]