Asynchronously get and store images in python - python

The following code is a sample of non-asynchronous code, is there any way to get the images asynchronously?
import urllib
for x in range(0,10):
urllib.urlretrieve("http://test.com/file %s.png" % (x), "temp/file %s.png" % (x))
I have also seen the Grequests library but I couldn't figure much if that is possible or how to do it from the documentation.

You don't need any third party library. Just create a thread for every request, start the threads, and then wait for all of them to finish in the background, or continue your application while the images are being downloaded.
import threading
results = []
def getter(url, dest):
results.append(urllib.urlretreave(url, dest))
threads = []
for x in range(0,10):
t = threading.Thread(target=getter, args=('http://test.com/file %s.png' % x,
'temp/file %s.png' % x))
t.start()
threads.append(t)
# wait for all threads to finish
# You can continue doing whatever you want and
# join the threads when you finally need the results.
# They will fatch your urls in the background without
# blocking your main application.
map(lambda t: t.join(), threads)
Optionally you can create a thread pool that will get urls and dests from a queue.
If you're using Python 3 it's already implemented for you in the futures module.

Something like this should help you
import grequests
urls = ['url1', 'url2', ....] # this should be the list of urls
requests = (grequests.get(u) for u in urls)
responses = grequests.map(requests)
for response in responses:
if 199 < response.status_code < 400:
name = generate_file_name() # generate some name for your image file with extension like example.jpg
with open(name, 'wb') as f: # or save to S3 or something like that
f.write(response.content)
Here only the downloading of images would be parallel but writing each image content to a file would be sequential so you can create a thread or do something else to make it parallel or asynchronous

Related

Mulithreading hangs when used with requests module and large number of threads

I have about 2000 urls that I am trying to scrape using the requests module. To speed up the process, I am using the ThreadPoolExecutor from concurrent.futures. The execution hangs in the middle when I run this and the issue is inconsistent too. Sometimes, it finishes smoothly within 2 minutes but other times, it just gets stuck at a point for over 30 mins and I eventually have to kill the process.
# scraper.py
def get_content(url):
try:
res = requests.get(url)
res = res.content
return res
except:
return ""
# main.py
from scraper import get_content
if __name__ == "__main__":
# content > an empty list for output
# urls > a list of urls
with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
results = executor.map(get_content, urls)
for res in results:
content = content.append(res)
print(content)
I want to understand how to debug this. Why and where is it getting stuck? And also, why is it inconsistent?

Multicore processing on scraper function

I was hoping to speed up my scraper by using multiple cores so multiple cores could scrape from the URLs in a list I have using a predefined function scrape. How would I do this?
Here is my current code:
for x in URLs['identifier'][1:365]:
test= scrape(x)
results = test.get_results
results['identifier'] = x
final= final.append(results)
Something like this, (or you can also use Scrapy)
It will easily allow you to make a lot of requests in parallel provided the server can handle it as well;
# it's just a wrapper around concurrent.futures ThreadPoolExecutor with a nice tqdm progress bar!
from tqdm.contrib.concurrent import thread_map, process_map # for multi-threading, multi-processing respectively)
def chunk_list(lst, size):
for i in range(0, len(lst), size):
yield lst[i:i + size]
for idx, my_chunk in enumerate(chunk_list(huge_list, size=2**12)):
for response in thread_map(<which_func_to_call>, my_chunk, max_workers=your_cpu_cores+6)):
# which_func_to_call -> wrap the returned response json obj in this, etc
# do something with the response now..
# make sure to cache the chunk results as well (in case you are having lot of them)
OR
Using the pool from multi-processing module in Python..
from multiprocessing import Pool
import requests
from bs4 import BeautifulSoup
base_url = 'http://quotes.toscrape.com/page/'
all_urls = list()
def generate_urls():
# better to yield them as well if you already have the URL's list etc..
for i in range(1,11):
all_urls.append(base_url + str(i))
def scrape(url):
res = requests.get(url)
print(res.status_code, res.url)
generate_urls()
p = Pool(10)
p.map(scrape, all_urls)
p.terminate()
p.join()

How to send multiple 'GET' requests using get function of requests library? [duplicate]

This question already has answers here:
What is the fastest way to send 100,000 HTTP requests in Python?
(21 answers)
Closed 2 years ago.
I want to fetch data (JSON files only) from multiple URLs using requests.get(). The URLs are saved in a pandas dataframe column and I am saving the response in JSON files locally.
i=0
start = time()
for url in pd_url['URL']:
time_1 = time()
r_1 = requests.get(url, headers = headers).json()
filename = './jsons1/'+str(i)+'.json'
with open(filename, 'w') as f:
json.dump(r_1, f)
i+=1
time_taken = time()-start
print('time taken:', time_taken)
Currently, I have written code to get data one by one from each URL using for loop as shown above. However, that code is taking too much time to execute. Is there any way to send multiple requests at once and make this thing run faster?
Also, What are the possible factors that are delaying the responses?
I have an internet connection with low latency and enough speed to 'theoretically' execute above operation in less than 20 seconds. Still, the above code takes 145-150 seconds every time I run it. My target is to complete this execution in maximum 30 seconds. Please suggest workarounds.
It sounds like you want multi-threading so use the ThreadPoolExecutor in the standard library. This can be found in the concurrent.futures package.
import concurrent.futures
def make_request(url, headers):
resp = requests.get(url, headers=headers).json()
return resp
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
futures = (executor.submit(make_request, url, headers) for url in pd_url['URL'])
for idx, future in enumerate(concurrent.futures.as_completed(futures)):
try:
data = future.result()
except Exception as exc:
print(f"Generated an exception: {exc}")
with open(f"./jsons1/{idx}.json", 'w') as f:
json.dump(data, f)
You can increase or decrease the number of threads, specified as max_workers, as you see fit.
You can make use of multiple threads to parallelize your fetching. This article presents one possible way of doing that using the ThreadPoolExecutor class from the concurrent.futures module.
It looks like #gold_cy posted pretty much the same answer while I was working on this, but for posterity, here's my example. I've taken your code and modified it to use the executor, and I've modified it slightly to run locally despite not having handy access to a list of JSON urls.
I'm using a list of 100 URLs, and it takes about 125 seconds to fetch the list serially, and about 27 seconds using 10 workers. I added a timeout on requests to prevent broken servers from holding everything up, and I added some code to handle errors responses.
import json
import pandas
import requests
import time
from concurrent.futures import ThreadPoolExecutor
def fetch_url(data):
index, url = data
print('fetching', url)
try:
r = requests.get(url, timeout=10)
except requests.exceptions.ConnectTimeout:
return
if r.status_code != 200:
return
filename = f'./data/{index}.json'
with open(filename, 'w') as f:
json.dump(r.text, f)
pd_url = pandas.read_csv('urls.csv')
start = time.time()
with ThreadPoolExecutor(max_workers=10) as runner:
for _ in runner.map(fetch_url, enumerate(pd_url['URL'])):
pass
runner.shutdown()
time_taken = time.time()-start
print('time taken:', time_taken)
Also, What are the possible factors that are delaying the responses?
The response time of the remote server is going to be the major bottleneck.

Run different tasks using multiprocessing Pool

I'm writing a code in python that suppose to get all links that was founded in one url save them as key:value pairs (url:[links]) & then go over those links and do the same thing over & over until I got enough keys.
I've already done this with a list of threads and removing them from the list when they were done running but i want to use thread pool for easy maintenance
To do this i make 2 functions:
get the content from the url and return it
extract the links from the content and return them
Now I want to mange those tasks with thread pool but i don't know how to do it properly because i don't know how to control the flow.
I can extract the links only after the get request returned the html page.
these are the functions I will use:
def extract_links(response):
arr_starts = [m.start() for m in re.finditer('href="https://', response.content)]
arr_ends = []
links = []
for start in arr_starts:
end_index = response.content.find('"', start + 6)
arr_ends.append(end_index)
for i in range(len(arr_starts)):
link = response.content[arr_starts[i] + 6:arr_ends[i]]
links.append(link)
def get_page(url):
return requests.get(url)
and this is the code i did it the first time:
first_url = r'https://blablabla'
hash_links = {}
thread_list = []
web_crawl(first_url, hash_links)
while len(hash_links.keys()) < 30:
if len(thread_list) < MAX_THREAD_COUNT:
for urls in hash_links.values():
for url in urls:
if url not in hash_links:
new_tread = threading.Thread(target=web_crawl, args=(url, hash_links))
thread_list.append(new_tread)
new_tread.start()
new_tread.join()
else:
for t in thread_list:
if not t.isAlive():
t.handled = True
thread_list = [t for t in thread_list if not t.handled]
for key in hash_links.keys():
print key + ':'
for link in hash_links[key]:
print '----' + link
``
Your problem seems to be that of producing content from a URL and then processing the links found in that URL as keys, also scheduling those links for processing, but do that in parallel using the thread pool and semaphore objects.
If that is the case, I would point you to this article for semaphore objects and the thread pool.
Also, your problem sounds to me a lot like something that would benefit from a producer-consumer architecture, so I would also recommend this article.

Fastest way to read and process 100,000 URLs in Python

I have a file with 100,000 URLs that I need to request then process. The processing takes a non-negligible amount of time compared to the request, so simply using multithreading seems to only give me a partial speed-up. From what I have read, I think using the multiprocessing module, or something similar, would offer a more substantial speed-up because I could use multiple cores. I'm guessing I want to use some multiple processes, each with multiple threads, but I'm not sure how to do that.
Here is my current code, using threading (based on What is the fastest way to send 100,000 HTTP requests in Python?):
from threading import Thread
from Queue import Queue
import requests
from bs4 import BeautifulSoup
import sys
concurrent = 100
def worker():
while True:
url = q.get()
html = get_html(url)
process_html(html)
q.task_done()
def get_html(url):
try:
html = requests.get(url, timeout=5, headers={'Connection':'close'}).text
return html
except:
print "error", url
return None
def process_html(html):
if html == None:
return
soup = BeautifulSoup(html)
text = soup.get_text()
# do some more processing
# write the text to a file
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=worker)
t.daemon = True
t.start()
try:
for url in open('text.txt'):
q.put(url.strip())
q.join()
except KeyboardInterrupt:
sys.exit(1)
If the file isn't bigger than your available memory, instead of opening it with the "open" method use mmap ( https://docs.python.org/3/library/mmap.html ). It will give the same speed as if you were working with memory and not a file.
with open("test.txt") as f:
mmap_file = mmap.mmap(f.fileno(), 0)
# code that does what you need
mmap_file.close()

Categories

Resources