Best way to download files simultaneously with Python? - python

I'm trying to send simultaneous get requests with the Python requests module.
While searching for a solution I've come across lots of different approaches, including grequests, gevent.monkey, requests futures, threading, multi-processing...
I'm a little overwhelmed and not sure which one to pick, regarding speed and code-readibility.
The task is to download < 400 files as fast as possible, all from the same server. Ideally it should output the status for the downloads in terminal, e. g. print an error or success message per request.

def download(webpage):
requests.get(webpage)
# Whatever else you need to do to download your resource, put it in here
urls = ['https://www.example.com', 'https://www.google.com','https://yahoo.com'] # Populate with resources you wish to download
threads = {}
if __name__ == '__main__':
for i in urls:
print(i)
threads[i] = threading.Thread(target=download, args=(i,))
for i in threads:
threads[i].start()
for i in threads:
threads[i].join()
print('successfully done.')
The above code contains a function called download that represents whatever code you have to run to download the resource you're looking to download. Then a list is made populated with urls you wish to download - change these values as you please. This is assembled in to a second dictionary that contains the threads. This is so that you can have as many urls in the url dictionary as you want, and a separate thread is made for each of them. The threads are each started, then joined.

I would use threading as it is not necessary to run the downloads on multiple cores like multiprocessing does.
So write a function where requests.get() is in it and then start as a thread.
But remember that your internet connection has to be fast enough, otherwise it wouldn't be worth it.

Related

Scraping multiple webpages at once with Selenium

I am using selenium and Python to do a big project. I have to go through 320.000 webpages (320K) one by one and scrape details and then sleep for a second and move on.
Like bellow:
links = ["https://www.thissite.com/page=1","https://www.thissite.com/page=2", "https://www.thissite.com/page=3"]
for link in links:
browser.get(link )
scrapedinfo = browser.find_elements_by_xpath("*//div/productprice").text
open("file.csv","a+").write(scrapedinfo)
time.sleep(1)
The greatest problem : too slow!
With this script I will take days or maybe weeks.
Is there a way to increase speed? Such as, by visiting multiple
links at the same time and scraping all at once?
I have spent hours finding answers on google and Stackoverflow and only found about multiprocessing.
But, I am unable to apply it in my script.
Threading approach
You should start with threading.Thread and it will give you a considerable performance boost (explained here). Also threads are lighter than processes. You can use a futures.ThreadPoolExecutor with each thread using its own webdriver. Consider also adding the headless option for your webdriver. Example bellow using a chrome-webdriver:
from concurrent import futures
def selenium_work(url):
chromeOptions = webdriver.ChromeOptions()
chromeOptions.add_argument("--headless")
driver = webdriver.Chrome(options=chromeOptions)
#<actual work that needs to be done be selenium>
# default number of threads is optimized for cpu cores
# but you can set with `max_workers` like `futures.ThreadPoolExecutor(max_workers=...)`
with futures.ThreadPoolExecutor() as executor:
# store the url for each thread as a dict, so we can know which thread fails
future_results = { url : executor.submit(selenium_work, links) for url in links }
for url, future in future_results.items():
try:
future.result() # can use `timeout` to wait max seconds for each thread
except Exception as exc: # can give a exception in some thread
print('url {:0} generated an exception: {:1}'.format(url, exc))
Consider also storing the chrome-driver instance initialized on each thread using threading.local(). From here they reported a reasonable performance improvement.
Consider if using BeautifulSoup direct on the page from selenium can give some other speed-up. It's a very fast and stablished package. Example something like driver.get(url) ... soup = BeautifulSoup(driver.page_source,"lxml") ... result = soup.find('a')
Other approaches
Although I personally not saw much benefits on using concurrent.futures.ProcessPoolExecutor() you could experiment on that. In fact it was slower than threads on my experiments on Windows. Also on Windows you have many limitations for python Process.
Consider if your use case can be satisfied by using arsenic a asynchronous webdriver client built on asyncio. That really sound promissing, though having many limitations.
Consider if Requests-Html solves your problems with javascript load. Since it claims Full JavaScript support! In that case you could use it with BeautifulSoup on a standard data scraping methodology.
You can use the paralel execution. Devide the list of sites for e.g in ten TC that are going to use same code, just method names will be different (method1, method2,method3,...). You will increse the speed. Number of the browsers depends on your hardver performances.
See more on https://www.guru99.com/sessions-parallel-run-and-dependency-in-selenium.html
Main thing is to use Test NG and edit .xml file and set how many threads you want to use.Like this:
<suite name="TestSuite" thread-count="10" parallel="methods" >
If you are not scraping too security oriented website against bots, it is better to use Requests, it will reduce your time from days to couple of hours and implement multi-threading with multi-processing. Steps are too long to go over, here is just some idea:
def threader_run(data):
futures = []
with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor:
for i in data:
futures.append(executor.submit(scrapper,i))
for future in concurrent.futures.as_completed(futures):
print(future.result())
data = {}
data['process1'] = []
data['process2'] = []
data['process3'] = []
if __name__ == "__main__":
for x in data:
jobs = []
p = Process(target=threader_run,args(data[x],))
jobs.append(p)
p.start()
print(f'Started - {x}')
Basically, what this is doing is first have all the links compiled then split them into 3 arrays for running 3 processes simultaneously (you could run more processes depending on your cpu cores and how data intensive these jobs are). After that split those arrays further could be more than 10 even 100 depending on your project size. This will run threadpool which have maximum 8 workers and then it will run your final function.
Here with 3 process and 8 workers you are looking at 24 times speed boost. however, Use of Requests library is necessary if you use selenium for this, normal Computers/laptops will freeze. Because this would mean 24 browsers running simultaneously.

Python: running multiple web requests in parallel

I'm new to Python and I have a basic question but I'm struggling to find an answer online, because a lot of the examples online seem to refer to deprecated APIs, so sorry if this has been asked before.
I'm looking for a way to execute multiple (similar) web requests in parallel, and retrieve the result in a list.
The synchronous version I have right now is something like:
urls = ['http://example1.org', 'http://example2.org', '...']
def getResult(urls):
result = []
for url in urls:
result.append(get(url).json())
return result
I'm looking for the asynchronous equivalent (where all the requests are made in parallel, but I then wait for all of them to be finished before returning the global result).
From what I saw I have to use async/await and aiohttp but the examples seemed way too complicated for the simple task I'm looking for.
Thanks
I am going to try to explain the simplest possible way to achieve what you want. Im sure there are more cleaner/better ways to do this but here it goes.
You could preform what you want using the python "threading" library. You can use it to create separate threads for each request and then run all the threads concurrently and get an answer.
Since you are new to python, to simplify things further I am using a global list called RESULTS to store in the results of the get(url) rather than returning them from the function.
import threading
RESULTS=[] #List to store the results
#Request Single Url Result and store in global RESULTS
def getSingleResult(url):
global RESULTS
RESULTS.append( ( url, get(url).json()) )
#Your Original Function
def getResult(urls)
ths=[]
for url in urls:
th=threading.Thread(target=getSingleResult, args=(url,)) #Create a Thread
th.start() #Start it
ths.append(th) #Add it to a thread list
for th in ths:
th.join() #Wait for all threads to finish
The usage of the global results is to make it easier rather than collecting results from the threads directly. If you wish to do that you can check out this answer How to get the return value from a thread in python?
Of course one thing to note that multi-threading in python doesnt provide true parallelism but rather concurrency especially if you are using the standard python implementation due to what is known as the Global Interpreter Lock
However for your use case it would still provide for you the speed up you need.

speeding up urlib.urlretrieve

I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :
import urllib
urllib.urlretrieve(link, filename)
I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.
For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):
import socket
socket.setdefaulttimeout(5)
Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?
my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.
Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.
To add multi-threading, do something like the following, using the multiprocessing package:
1) encapsulate the url retrieving in a function:
import import urllib.request
def geturl(link,i):
try:
urllib.request.urlretrieve(link, str(i)+".jpg")
except:
pass
2) then create a collection with all urls as well as names you want for the downloaded pictures:
urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]
3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)
then use the pool.starmap() method and pass the function and the arguments of the function.
results = pool.starmap(geturl, zip(links, d))
note: pool.starmap() works only in Python 3
When a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request (this is called a context switch) and is not resumed until the I/O operation is completed.
Context switching is quite a heavy operation. It requires us to save the state of our program (losing any sort of caching we had at the CPU level) and give up the use of the CPU. Later, when we are allowed to run again, we must spend time reinitializing our program on the motherboard and getting ready to resume (of course, all this happens behind the scenes).
With concurrency, on the other hand, we typically have a thing called an “event loop” running that manages what gets to run in our program, and when. In essence, an event loop is simply a list of functions that need to be run. The function at the top of the list gets run, then the next, etc.
The following shows a simple example of an event loop:
from Queue import Queue
from functools import partial
eventloop = None
class EventLoop(Queue):
def start(self):
while True:
function = self.get()
function()
def do_hello():
global eventloop
print "Hello"
eventloop.put(do_world)
def do_world():
global eventloop
print "world"
eventloop.put(do_hello)
if __name__ == "__main__":
eventloop = EventLoop()
eventloop.put(do_hello)
eventloop.start()
If the above seems like something you may use, and you'd also like to see how gevent, tornado, and AsyncIO, can help with your issue, then head out to your (University) library, check out High Performance Python by Micha Gorelick and Ian Ozsvald, and read pp. 181-202.
Note: above code and text are from the book mentioned.

Looping through a paginated api asynchronously

I'm currently ingesting data through an API that returns close to 100,000 documents in a paginated fashion (100 per page). I currently have some code that roughly functions as follows:
while c <= limit:
if not api_url:
break
req = urllib2.Request(api_url)
opener = urllib2.build_opener()
f = opener.open(req)
response = simplejson.load(f)
for item in response['documents']:
# DO SOMETHING HERE
if 'more_url' in response:
api_url = response['more_url']
else:
api_url = None
break
c += 1
Downloading the data this way is really slow and I was wondering if there is any way to loop through the pages in an async way. I have been recommended to take a look at twisted, but I am not entirely sure how to proceed.
What you have here is that you do not know up front about what will be read next unless you will call an API. Think of this like, what you can do in parallel?
I do not know how much you can do in parallel and which tasks, but lets try...
some assumptions:
- you can retrieve data from the API without penalties or limits
- data processing of one page/batch can be done independently one from other
what is slow is an IO - so immediately you can split your code to two parallel running tasks - one that will read data, then put it in the queue and continue reading unless hit limit/empty response or pause if queue is full
then second task, that is taking data from queue, and do something with data
so you can call one task from another
other approach is that you have one task, that is calling other one immediately after data is read, so their execution will be running in parallel but slightly shifted
how I'll implement it? as celery tasks and yes requests
e.g. the second one:
#task
def do_data_process(data):
# do something with data
pass
#task
def parse_one_page(url):
response = requests.get(url)
data = response.json()
if 'more_url' in data:
parse_one_page.delay(data['more_url'])
# and here do data processing in this task
do_data_process(data)
# or call worker and try to do this in other process
# do_data_process.delay(data)
and it is up to you how many tasks you will run in parallel if you will add limits to your code, you can even have workers on multiple machines and have separate queues for parse_one_page and do_data_process
why this approach, not twisted or async?
because you have cpu-bond data processing (json, then data) and for this is better to have separate processes and celery is perfect with them.

Python Socket and Thread pooling, how to get more performance?

I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).
I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:
hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...
I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:
GETSocket class, SocketPool class, ThreadPool and Worker classes
GETSocket class is a minified, "HTTP GET only" version of Python's httplib.
So, I use these classes like that:
sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
self.count += 1
pool.wait_completion()
pass
__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.
What I wonder is, is there any other possible way that I can improve performance of this system?
I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatcher) provided.
Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.
Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.
Any help appreciated, thanks.
Do this.
Use multiprocessing. http://docs.python.org/library/multiprocessing.html.
Write a worker Process which puts all of the URL's into a Queue.
Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.
Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.
I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.
Using pycurl, I gained:
- Consistent responses to my requests (actually my script has to deal with minimum 10k URLs)
- With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)
I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.
Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)
Thanks for the replies, folks.

Categories

Resources