Fastest way to read and process 100,000 URLs in Python - python

I have a file with 100,000 URLs that I need to request then process. The processing takes a non-negligible amount of time compared to the request, so simply using multithreading seems to only give me a partial speed-up. From what I have read, I think using the multiprocessing module, or something similar, would offer a more substantial speed-up because I could use multiple cores. I'm guessing I want to use some multiple processes, each with multiple threads, but I'm not sure how to do that.
Here is my current code, using threading (based on What is the fastest way to send 100,000 HTTP requests in Python?):
from threading import Thread
from Queue import Queue
import requests
from bs4 import BeautifulSoup
import sys
concurrent = 100
def worker():
while True:
url = q.get()
html = get_html(url)
process_html(html)
q.task_done()
def get_html(url):
try:
html = requests.get(url, timeout=5, headers={'Connection':'close'}).text
return html
except:
print "error", url
return None
def process_html(html):
if html == None:
return
soup = BeautifulSoup(html)
text = soup.get_text()
# do some more processing
# write the text to a file
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=worker)
t.daemon = True
t.start()
try:
for url in open('text.txt'):
q.put(url.strip())
q.join()
except KeyboardInterrupt:
sys.exit(1)

If the file isn't bigger than your available memory, instead of opening it with the "open" method use mmap ( https://docs.python.org/3/library/mmap.html ). It will give the same speed as if you were working with memory and not a file.
with open("test.txt") as f:
mmap_file = mmap.mmap(f.fileno(), 0)
# code that does what you need
mmap_file.close()

Related

Mulithreading hangs when used with requests module and large number of threads

I have about 2000 urls that I am trying to scrape using the requests module. To speed up the process, I am using the ThreadPoolExecutor from concurrent.futures. The execution hangs in the middle when I run this and the issue is inconsistent too. Sometimes, it finishes smoothly within 2 minutes but other times, it just gets stuck at a point for over 30 mins and I eventually have to kill the process.
# scraper.py
def get_content(url):
try:
res = requests.get(url)
res = res.content
return res
except:
return ""
# main.py
from scraper import get_content
if __name__ == "__main__":
# content > an empty list for output
# urls > a list of urls
with concurrent.futures.ThreadPoolExecutor(max_workers=1000) as executor:
results = executor.map(get_content, urls)
for res in results:
content = content.append(res)
print(content)
I want to understand how to debug this. Why and where is it getting stuck? And also, why is it inconsistent?

Upload and Download using onedrive business at same time with multiple simultaneous request takes much time

I want to convert my docx to pdf using onedrive, so i uploaded my docx in onedrive and download it on same function. I am using python django webserver.
def myfunctionname(token,filecontent):
headers = {"Content-Type": "text/plain"}
txt = fileContent
graph_client = OAuth2Session(token=token)
drive_url = "mywholeurl"
upload = graph_client.put(drive_url, data=txt, headers=headers)
download = graph_client.get(drive_url + '?format=pdf')
return download.url
It took me 5 seconds to upload and download for one request but when i do 20 requests at same time to complete all requests it took around 40 seconds, for 50 concurrent requests it took me around 80 seconds.
I was expecting to get all results in same 5 seconds for any number of requests. Can you explain where i am doing wrong?
Few points you can consider while implementing functionality like this
1) Do not download the file immediately after upload.
2) Firstly have an operation for uploading files and utilize queue to add the url for the uploaded file like below
import sys
import os
import urllib
import threading
from Queue import Queue
class DownloadThread(threading.Thread):
def __init__(self, queue, destfolder):
super(DownloadThread, self).__init__()
self.queue = queue
self.destfolder = destfolder
self.daemon = True
def run(self):
while True:
url = self.queue.get()
try:
self.download_url(url)
except Exception,e:
print " Error: %s"%e
self.queue.task_done()
def download_url(self, url):
# change it to a different way if you require
name = url.split('/')[-1]
dest = os.path.join(self.destfolder, name)
print "[%s] Downloading %s -> %s"%(self.ident, url, dest)
urllib.urlretrieve(url, dest)
def download(urls, destfolder, numthreads=4):
queue = Queue()
for url in urls:
queue.put(url)
for i in range(numthreads):
t = DownloadThread(queue, destfolder)
t.start()
queue.join()
if __name__ == "__main__":
download(sys.argv[1:], "/tmp")
3) Last and most importantly, implement Multi-threading while downloading files. Multi threading needs to be implemented while uploading files too.
Check this link for multi-threading in python.
Alternatively try this.
Reference:
http://dag.wiee.rs/home-made/unoconv/
Hope this helps.

Effectively requesting and processing multiple HTML files with Python

I am writing a tool which fetches multiple HTML files and processes them as text:
for url in url_list:
url_response = requests.get(url)
text = url_response.text
# Process text here (put in database, search, etc)
The problem is that this is pretty slow. If I just needed a simple resonse I could have used grequests, but since I need to get the content of the HTML file, that seems not to be an option. How can I fasten this up?
Thanks in regard!
import requests
from multiprocessing import Pool
def process_html(url):
url_response = requests.get(url)
text = url_response.text
print(text[:500])
print('-' * 30)
urls = [
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
'http://www.apple.com',
'http://www.yahoo.com',
'http://www.google.com',
]
with Pool(None) as p: #None => uses cpu.count()
p.map(process_html, urls) #This blocks until all return values from process_html() have been collected.
Use a thread for each request:
import threading
import urllib2
url_list = ["url1", "url2"]
def fetch_url(url):
url_response = requests.get(url)
text = url_response.text
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()
You need to use threading and put requests.get(...) to fetch each URL in diff threads i.e. in parallel.
See these two answers on SO for example and usage:
Python - very simple multithreading parallel URL fetching (without queue)
Multiple requests using urllib2.urlopen() at the same time
import threading
import urllib2
url_list = ["url1", "url2"]
def fetch_url(url):
url_response = requests.get(url)
text = url_response.text
threads = [threading.Thread(target=fetch_url, args=(url,)) for url in url_list]
for thread in threads:
thread.start()
for thread in threads:
thread.join()

Asynchronously get and store images in python

The following code is a sample of non-asynchronous code, is there any way to get the images asynchronously?
import urllib
for x in range(0,10):
urllib.urlretrieve("http://test.com/file %s.png" % (x), "temp/file %s.png" % (x))
I have also seen the Grequests library but I couldn't figure much if that is possible or how to do it from the documentation.
You don't need any third party library. Just create a thread for every request, start the threads, and then wait for all of them to finish in the background, or continue your application while the images are being downloaded.
import threading
results = []
def getter(url, dest):
results.append(urllib.urlretreave(url, dest))
threads = []
for x in range(0,10):
t = threading.Thread(target=getter, args=('http://test.com/file %s.png' % x,
'temp/file %s.png' % x))
t.start()
threads.append(t)
# wait for all threads to finish
# You can continue doing whatever you want and
# join the threads when you finally need the results.
# They will fatch your urls in the background without
# blocking your main application.
map(lambda t: t.join(), threads)
Optionally you can create a thread pool that will get urls and dests from a queue.
If you're using Python 3 it's already implemented for you in the futures module.
Something like this should help you
import grequests
urls = ['url1', 'url2', ....] # this should be the list of urls
requests = (grequests.get(u) for u in urls)
responses = grequests.map(requests)
for response in responses:
if 199 < response.status_code < 400:
name = generate_file_name() # generate some name for your image file with extension like example.jpg
with open(name, 'wb') as f: # or save to S3 or something like that
f.write(response.content)
Here only the downloading of images would be parallel but writing each image content to a file would be sequential so you can create a thread or do something else to make it parallel or asynchronous

Processing Result outside For Loop in Python

I have this simple code which fetches page via urllib:
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
result= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
Now I can read the result via result.read() but I was wondering if all this functionality can be done outside the for loop. Because other URLs to be fetched will wait until all the result has been processed.
I want to process result outside the for loop. Can this be done?
One of the ways to do this maybe to have result as a dictionary. What you can do is:
result = {}
for eachBrowser in browser_list:
result[eachBrowser]= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
and use result[BrowserName] outside the loop.
Hope this helps.
If you simply wants to access all results outside the loop just append all results to a array or dictionary as above answer.
Or if you trying to speed up your task try multithreading.
import threading
class myThread (threading.Thread):
def __init__(self, result):
threading.Thread.__init__(self)
self.result=result
def run(self):
// process your result(as self.result) here
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
result= urllib2.urlopen(urljoin(user_string_url,eachBrowser))
myThread(result).start() // it will start processing result on another thread and continue loop without any waiting
Its a simple way of multithrading. It may break depending on your result processing. Consider reading the documentation and some examples before you try.
You can use threads for this:
import threading
import urllib2
from urlparse import urljoin
def worker(url):
res = urllib2.urlopen(url)
data = res.read()
res.close()
browser_list = ['Chrome', 'Mozilla', 'Safari', 'Internet Explorer', 'Opera']
user_string_url='http://www.useragentstring.com/'
for browser in browser_list:
url = urljoin(user_string_url, browser)
threading.Thread(target=worker,args=[url]).start()
# wait for everyone to complete
for thread in threading.enumerate():
if thread == threading.current_thread(): continue
thread.join()
Are you using python3?, if so, you can use futures for this task:
from urllib.request import urlopen
from urllib.parse import urljoin
from concurrent.futures import ThreadPoolExecutor
browser_list = ['Chrome','Mozilla','Safari','Internet+Explorer','Opera']
user_string_url = "http://www.useragentstring.com/pages/"
def process_request(url, future):
print("Processing:", url)
print("Reading data")
print(future.result().read())
with ThreadPoolExecutor(max_workers=10) as executor:
submit = executor.submit
for browser in browser_list:
url = urljoin(user_string_url, browser) + '/'
submit(process_request, url, submit(urlopen, url))
You could also do this with yield:
def collect_browsers():
browser_list= ['Chrome','Mozilla','Safari','Internet Explorer','Opera']
user_string_url="http://www.useragentstring.com/pages/"
for eachBrowser in browser_list:
yield eachBrowser, urllib2.urlopen(urljoin(user_string_url,eachBrowser))
def process_browsers():
for browser, result in collect_browsers():
do_something (result)
This is still a synchronous call (browser 2 will not fire until browser 1 is processed) but you can keep the logic for dealing with the results separate from the logic managing the connections. You could of course also use threads to handle the processing asynchronously with or without yield
Edit
Just re-read OP and should repeat that yield doesn't provide multi-threaded, asynchronous execution in case that was not clear in my first answer!

Categories

Resources