Parallelizing loop for downloading data - python
I'm new to Python. I want to run a simple script in Google App Engine that retrieves many files into an object as quickly as possible. Would parallelization be a smart option and how would I go about doing it? Thanks in advance for the brainstorming
import requests
...
theData=[]
for q in range(0, len(theURLs)):
r = requests.get(theURLs[q])
theData.insert(q,r.text)
In "regular" Python this is pretty simple.
from multiprocessing.pool import ThreadPool
import requests
responses = ThreadPool(10).map(requests.get, urls)
Replace 10 with # of threads that produces best results for you.
However you specified GAE which has restrictions on spawning threads/processes and its own async approach, which consists of using the async functions from the URL Fetch service, something along these lines (untested):
rpcs = [urlfetch.create_rpc() for url in urls]
for (rpc, url) in zip(rpcs, urls):
urlfetch.make_fetch_call(rpc, url)
results = [rpc.get_result() for rpc in rpcs]
You will need to add error handling...
You should make your code more Pythonic by using list comprehensions:
# A list of tuples
theData = [(q,requests.get(theURLs[q]).text) for q in range(0, len(theURLs))]
# ... or ...
# A list of lists
theData = [[q,requests.get(theURLs[q]).text] for q in range(0, len(theURLs))]
If you want to retrieve the files concurrently use the threading library, this website has some good examples, might be good practice:
http://www.tutorialspoint.com/python/python_multithreading.htm
I seriously doubt it. Parallelization can really only speed up calculations, while the bottleneck here is data transfer.
Related
How to execute requests.get without attachment Python
Right now I am trying to execute asynchronous requests without any related tie-in to each other, similar to how FTP can upload / download more than one file at once. I am using the following code: rec = reuests.get("https://url", stream=True) With rec.raw.read() To get responses. But I am wishing to be able to execute this same piece of code much faster with no need to wait for the server to respond, which takes about 2 seconds each time.
The easiest way to do something like that is to use threads. Here is a rough example of one of the ways you might do this. import requests from multiprocessing.dummy import Pool # the exact import depends on your python version pool = Pool(4) # the number represents how many jobs you want to run in parallel. def get_url(url): rec = requests.get(url, stream=True) return rec.raw.read() for result in pool.map(get_url, ["http://url/1", "http://url/2"]: do_things(result)
Lost HTTPS requests with parallel processing
I use the two following class methods to request information from the Questrade API (http://www.questrade.com/api/documentation/rest-operations/market-calls/markets-quotes-id). I have over 11,000 stock symbols where I request the Questrade API with batches of 100 symbols. import requests from joblib import Parallel, delayed def parallel_request(self, elem, result, url, key): response = requests.get(''.join((url, elem)), headers=self.headers) result.extend(response.json().get(key)) Parallel(n_jobs=-1, backend="threading")( delayed(self.parallel_request)(elem, self.symbol_ids_list, self.uri, 'symbols')\ for elem in self.batch_result ) If I make over 110 HTTPS requests with Parallel class, then instead of getting 11,000 output I got 10,500 or 10,600. So I lost data with parallel processing. Be aware that I used two python module here, i.e. joblib (https://github.com/joblib/joblib/issues/651) and requests (https://github.com/requests/requests). The following for loop worked perfectly, so I know my problem is with the Parallel class. for elem in self.batch_result: response = requests.get(''.join((self.uri, elem)), headers=self.headers) self.symbol_ids_list.extend(response.json().get('symbols')) How could I increase the performance of the last for loop without losing data? UPDATE A sample of self.batch_result (simplified result) could be ['AAME,ABAC,ABIL,ABIO,ACERW,ACHN,ACHV,ACRX,ACST,ACTG,ADMA,ADMP,ADOM,ADXS,ADXSW,AEHR,AEMD,AETI,AEY,AEZS,AFMD,AGFSW,AGRX,AGTC,AHPAW,AHPI,AIPT,AKER,AKTX,ALIM,ALJJ,ALQA,ALSK,ALT,AMCN,AMDA,AMMA,AMRH,AMRHW,AMRN,AMRWW,AMTX,ANDAR,ANDAW,ANTH,ANY,APDN,APDNW,APOPW,APPS,APRI,APTO,APVO,APWC,AQB,AQMS,ARCI,ARCW,ARDM,AREX,ARGS,ARLZ,ARQL,ARTW,ARTX,ASFI,ASNA,ASRV,ASTC,ATACR,ATEC,ATHX,ATLC,ATOS,ATRS,AUTO,AVEO,AVGR,AVID,AVXL,AWRE,AXAS,AXON,AXSM,AYTU,AZRX,BASI,BBOX,BBRG,BCACR,BCACW,BCLI,BDSI,BHACR,BHACW,BIOC,BIOL,BIOS,BKEP,BKYI', 'BLDP,BLIN,BLNK,BLNKW,BLPH,BLRX,BMRA,BNSO,BNTC,BNTCW,BOSC,BOXL,BPTH,BRACR,BRACW,BRPAR,BRPAW,BSPM,BSQR,BUR,BURG,BVSN,BVXVW,BWEN,BYFC,CAAS,CADC,CALI,CAPR,CARV,CASI,CASM,CATB,CATS,CBAK,CBLI,CCCL,CCCR,CCIH,CDMO,CDTI,CELGZ,CERCW,CETV,CETX,CETXW,CFBK,CFMS,CFRX,CGEN,CGIX,CGNT,CHCI,CHEK,CHEKW,CHFS,CHKE,CHMA,CHNR,CIDM,CJJD,CKPT,CLDC,CLDX,CLIR,CLIRW,CLNE,CLRB,CLRBW,CLRBZ,CLSN,CLWT,CMSSR,CMSSW,CNACR,CNACW,CNET,CNIT,CNTF,CODA,CODX,COGT,CPAH,CPLP,CPRX,CPSH,CPSS,CPST,CREG,CRIS,CRME,CRNT,CSBR,CTHR,CTIB,CTIC,CTRV,CTXR,CTXRW,CUI', 'CUR,CVONW,CXDC,CXRX,CYCC,CYHHZ,CYRN,CYTR,CYTX,CYTXW,DARE,DCAR,DCIX,DELT,DEST,DFBG,DFFN,DGLY,DHXM,DLPN,DLPNW,DMPI,DOGZ,DOTAR,DOTAW,DRAD,DRIO,DRIOW,DRRX,DRYS,DSKEW,DSWL,DTEA,DTRM,DXLG,DXYN,DYNT,DYSL,EACQW,EAGLW,EARS,EASTW,EBIO,EDAP,EFOI,EGLT,EKSO,ELECW,ELGX,ELON,ELSE,ELTK,EMITF,EMMS,ENG,ENPH,ENT,EPIX,ESEA,ESES,ESTRW,EVEP,EVGN,EVK,EVLV,EVOK,EXFO,EXXI,EYEG,EYEGW,EYES,EYESW,FCEL,FCRE,FCSC,FFHL,FLGT,FLL,FMCIR,FMCIW,FNJN,FNTEW,FORD,FORK,FPAY,FRAN,FRED,FRSX,FSACW,FSNN,FTD,FTEK,FTFT,FUV,FVE,FWP,GALT,GASS,GCVRZ,GEC'] and self.uri is simply 'https://api01.iq.questrade.com/v1/symbols?names=' as seen in the above Questrade API link. UPDATE 2 The Marat's answer was a good try but didn't give me a better result. The first test gave me 31,356 (or 10,452 if I divide that result by 3) instead of 10,900. The second test just gave me 0 or the process block completely. I found out that the Maximum allowed requests per second is 20. Link : http://www.questrade.com/api/documentation/rate-limiting. How could I increase the performance of the last for loop without losing data in considering that new information?
If you are not stuck with using joblib you could try some standard library parallel processing modules. In python2/3 multiprocessing.Pool is available and provides functions for mapping a task across parallel threads. A simplified version would look like this: from multiprocessing import Pool import requests HEADERS = {} # define headers here def parallel_request(symbols): response = requests.get('https://api01.iq.questrade.com/v1/symbols?names={}'.format(symbols), headers=HEADERS) return response.json() if __name__ == '__main__': p = Pool() batch_result = ['AAME,ABAC,ABIL,...', 'BLDP,BLIN,BLNK,...', 'CUR,CVONW,CXDC,...', ...] p.map(parallel_request, batch_result) # will return a list of len(batch_result) responses There are asynchronous and iterable versions of map that you would probably want for larger sized jobs, and of course you could add parameters to your parallel_requests task to avoid hard coding things like I did. A caveat with using Pool is that any arguments passed to it have to be picklable. In python3 the concurrent.futures module actually has a nice example of multithreaded url retrieval in the docs. With a little effort you could replace load_url in that example with your parallel_request function. There is a version of concurrent.futures backported to python2 as the futures module, as well. These might require a bit more work in refactoring, so if there is a solution that sticks with joblib feel free to prefer that. On the off-chance that your problem is a bug in joblib, there are plenty of ways you could do this in a multithreaded fashion with standard library (albeit with some added boilerplate).
Most likely, it happens because some of HTTP calls fail due to network load. To test, change parallel_request: def parallel_request(self, elem, result, url, key): for i in range(3): # 3 retries try: response = requests.get(''.join((url, elem)), headers=self.headers) except IOError: continue result.extend(response.json().get(key)) return Much less likely: list.extend is not thread safe. If the snippet above didn't help, try guarding extend with a lock: import threading ... lock = threading.Lock() def parallel_request(self, elem, result, url, key): response = requests.get(''.join((url, elem)), headers=self.headers) lock.acquire() result.extend(response.json().get(key)) lock.release()
Mutliprocessing Queue vs. Pool
I'm having the hardest time trying to figure out the difference in usage between multiprocessing.Pool and multiprocessing.Queue. To help, this is bit of code is a barebones example of what I'm trying to do. def update(): def _hold(url): soup = BeautifulSoup(url) return soup def _queue(url): soup = BeautifulSoup(url) li = [l for l in soup.find('li')] return True if li else False url = 'www.ur_url_here.org' _hold(url) _queue(url) I'm trying to run _hold() and _queue() at the same time. I'm not trying to have them communicate with each other so there is no need for a Pipe. update() is called every 5 seconds. I can't really rap my head around the difference between creating a pool of workers, or creating a queue of functions. Can anyone assist me? The real _hold() and _queue() functions are much more elaborate than the example so concurrent execution actually is necessary, I just thought this example would suffice for asking the question.
The Pool and the Queue belong to two different levels of abstraction. The Pool of Workers is a concurrent design paradigm which aims to abstract a lot of logic you would otherwise need to implement yourself when using processes and queues. The multiprocessing.Pool actually uses a Queue internally for operating. If your problem is simple enough, you can easily rely on a Pool. In more complex cases, you might need to deal with processes and queues yourself. For your specific example, the following code should do the trick. def hold(url): ... return soup def queue(url): ... return bool(li) def update(url): with multiprocessing.Pool(2) as pool: hold_job = pool.apply_async(hold, args=[url]) queue_job = pool.apply_async(queue, args=[url]) # block until hold_job is done soup = hold_job.get() # block until queue_job is done li = queue_job.get() I'd also recommend you to take a look at the concurrent.futures module. As the name suggest, that is the future proof implementation for the Pool of Workers paradigm in Python. You can easily re-write the example above with that library as what really changes is just the API names.
how to download multiple file simultaneously and join them in python?
I have some split files on a remote server. I have tried downloading them one by one and join them. But it takes a lot of time. I googled and found that simultaneous download might speed up things. The script is on Python. My pseudo is like this: url1 = something url2 = something url3 = something data1 = download(url1) data2 = download(url2) data3 = download(url3) wait for all download to complete join all data and save Could anyone point me to a direction by which I can load files all simultaneously and wait till they are done. I have tried by creating a class. But again I can't figure out how to wait till all complete. I am more interested in Threading and Queue feature and I can import them in my platform. I have tried with Thread and Queue with an example found on this site. Here is the code pastebin.com/KkiMLTqR . But it does not wait or waits forever..not sure
There are 2 ways to do things simultaneously. Or, really, 2-3/4 or so: Multiple threads Or multiple processes, especially if the "things" take a lot of CPU power Or coroutines or greenlets, especially if there are thousands of "things" Or pools of one of the above Event loops (either coded manually) Or hybrid greenlet/event loop systems like gevent. If you have 1000 URLs, you probably don't want to do 1000 requests at the same time. For example, web browsers typically only do something like 8 requests at a time. A pool is a nice way to do only 8 things at a time, so let's do that. And, since you're only doing 8 things at a time, and those things are primarily I/O bound, threads are perfect. I'll implement it with futures. (If you're using Python 2.x, or 3.0-3.1, you will need to install the backport, futures.) import concurrent.futures urls = ['http://example.com/foo', 'http://example.com/bar'] with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor: result = b''.join(executor.map(download, urls)) with open('output_file', 'wb') as f: f.write(result) Of course you need to write the download function, but that's exactly the same function you'd write if you were doing these one at a time. For example, using urlopen (if you're using Python 2.x, use urllib2 instead of urllib.request): def download(url): with urllib.request.urlopen(url) as f: return f.read() If you want to learn how to build a thread pool executor yourself, the source is actually pretty simple, and multiprocessing.pool is another nice example in the stdlib. However, both of those have a lot of excess code (handling weak references to improve memory usage, shutting down cleanly, offering different ways of waiting on the results, propagating exceptions properly, etc.) that may get in your way. If you look around PyPI and ActiveState, you will find simpler designs like threadpool that you may find easier to understand. But here's the simplest joinable threadpool: class ThreadPool(object): def __init__(self, max_workers): self.queue = queue.Queue() self.workers = [threading.Thread(target=self._worker) for _ in range(max_workers)] def start(self): for worker in self.workers: worker.start() def stop(self): for _ in range(self.workers): self.queue.put(None) for worker in self.workers: worker.join() def submit(self, job): self.queue.put(job) def _worker(self): while True: job = self.queue.get() if job is None: break job() Of course the downside of a dead-simple implementation is that it's not as friendly to use as concurrent.futures.ThreadPoolExecutor: urls = ['http://example.com/foo', 'http://example.com/bar'] results = [list() for _ in urls] results_lock = threading.Lock() def download(url, i): with urllib.request.urlopen(url) as f: result = f.read() with results_lock: results[i] = url pool = ThreadPool(max_workers=8) pool.start() for i, url in enumerate(urls): pool.submit(functools.partial(download, url, i)) pool.stop() result = b''.join(results) with open('output_file', 'wb') as f: f.write(result)
You can use an async framwork like twisted. Alternatively this is one thing that Python's threads do ok at. Since you are mostly IO bound
Problems with Speed during web-crawling (Python)
I would love to have this programm improve a lot in speed. It reads +- 12000 pages in 10 minutes. I was wondering if there is something what would help a lot to the speed? I hope you guys know some tips. I am supposed to read +- millions of pages... so that would take way too long :( Here is my code: from eventlet.green import urllib2 import httplib import time import eventlet # Create the URLS in groups of 400 (+- max for eventlet) def web_CreateURLS(): print str(str(time.asctime( time.localtime(time.time()) )).split(" ")[3]) for var_indexURLS in xrange(0, 2000000, 400): var_URLS = [] for var_indexCRAWL in xrange(var_indexURLS, var_indexURLS+400): var_URLS.append("http://www.nu.nl") web_ScanURLS(var_URLS) # Return the HTML Source per URL def web_ReturnHTML(url): try: return [urllib2.urlopen(url[0]).read(), url[1]] except urllib2.URLError: time.sleep(10) print "UrlError" web_ReturnHTML(url) # Analyse the HTML Source def web_ScanURLS(var_URLS): pool = eventlet.GreenPool() try: for var_HTML in pool.imap(web_ReturnHTML, var_URLS): # do something etc.. except TypeError: pass web_CreateURLS()
I like using greenlets.. but I often benefit from using multiple processes spread over lots of systems.. or just one single system letting the OS take care of all the checks and balances of running multiple processes. Check out ZeroMQ at http://zeromq.org/ for some good examples on how to make a dispatcher with a TON of listeners that do whatever the dispatcher says. Alternatively check out execnet for a method of quickly getting started with executing remote or local tasks in parallel. I also use http://spread.org/ a lot and have LOTS of systems listening to a common spread daemon.. it's a very useful message bus where results can be pooled back to and dispatched from a single thread pretty easily. And then of course there is always redis pub/sub or sync. :) "Share the load"