How to execute requests.get without attachment Python - python

Right now I am trying to execute asynchronous requests without any related tie-in to each other, similar to how FTP can upload / download more than one file at once.
I am using the following code:
rec = reuests.get("https://url", stream=True)
With
rec.raw.read()
To get responses.
But I am wishing to be able to execute this same piece of code much faster with no need to wait for the server to respond, which takes about 2 seconds each time.

The easiest way to do something like that is to use threads.
Here is a rough example of one of the ways you might do this.
import requests
from multiprocessing.dummy import Pool # the exact import depends on your python version
pool = Pool(4) # the number represents how many jobs you want to run in parallel.
def get_url(url):
rec = requests.get(url, stream=True)
return rec.raw.read()
for result in pool.map(get_url, ["http://url/1", "http://url/2"]:
do_things(result)

Related

Python Multithreading Rest API

I download Data over a restAPI and wrote a module. The download takes lets say 10sec. During this time, the rest of the script in 'main' and in the module is not running until the download is finished.
How can I change it, e.g. by processing it in another core?
I tried this code but it does not do the trick (same lag). Then I tried to implement this approach and it just gives me errors, as I suspect it 'map' does not work with 'wget.download'?
My code from the module:
from multiprocessing.dummy import Pool as ThreadPool
import urllib.parse
#define the needed data
function='TIME_SERIES_INTRADAY_EXTENDED'
symbol='IBM'
interval='1min'
slice='year1month1'
adjusted='true'
apikey= key[0].rstrip()
#create URL
SCHEME = os.environ.get("API_SCHEME", "https")
NETLOC = os.environ.get("API_NETLOC", "www.alphavantage.co") #query?
PATH = os.environ.get("API_PATH","query")
query = urllib.parse.urlencode(dict(function=function, symbol=symbol, interval=interval, slice=slice, adjusted=adjusted, apikey=apikey))
url = urllib.parse.urlunsplit((SCHEME, NETLOC,PATH, query, ''))
#this is my original code to download the data (working but slow and stopping the rest of the script)
wget.download(url, 'C:\\Users\\x\\Desktop\\Tool\\RAWdata\\test.csv')
#this is my attempt to speed things up via multithreading from code
pool = ThreadPool(4)
if __name__ == '__main__':
futures = []
for x in range(1):
futures.append(pool.apply_async(wget.download, url,'C:\\Users\\x\\Desktop\\Tool\\RAWdata\\test.csv']))
# futures is now a list of 10 futures.
for future in futures:
print(future.get())
any suggestions or do you see the error i make?
ok, i figured it out, so i will leave it here in case someone else needs it.
I made a module called APIcall which has a function APIcall() which uses wget.download() to download my data.
in main, i create a function (called threaded_APIfunc) which calls the APIcall() function in my modul APIcall
import threading
import APIcall
def threaded_APIfunc():
APIcall.APIcall(function, symbol, interval, slice, adjusted, apikey)
print ("Data Download complete for ${}".format(symbol))
and then i run the threaded_APIfunc within a thread like so
threading.Thread(target=threaded_APIfunc).start()
print ('Start Downloading Data for ${}'.format(symbol))
what happends is, that the .csv file gets downloaded in the background, while the main loop doesent wait till the download ir completed, it does the code what comes after the threading right away

Lost HTTPS requests with parallel processing

I use the two following class methods to request information from the Questrade API (http://www.questrade.com/api/documentation/rest-operations/market-calls/markets-quotes-id). I have over 11,000 stock symbols where I request the Questrade API with batches of 100 symbols.
import requests
from joblib import Parallel, delayed
def parallel_request(self, elem, result, url, key):
response = requests.get(''.join((url, elem)), headers=self.headers)
result.extend(response.json().get(key))
Parallel(n_jobs=-1, backend="threading")(
delayed(self.parallel_request)(elem, self.symbol_ids_list, self.uri, 'symbols')\
for elem in self.batch_result
)
If I make over 110 HTTPS requests with Parallel class, then instead of getting 11,000 output I got 10,500 or 10,600. So I lost data with parallel processing. Be aware that I used two python module here, i.e. joblib (https://github.com/joblib/joblib/issues/651) and requests (https://github.com/requests/requests).
The following for loop worked perfectly, so I know my problem is with the Parallel class.
for elem in self.batch_result:
response = requests.get(''.join((self.uri, elem)), headers=self.headers)
self.symbol_ids_list.extend(response.json().get('symbols'))
How could I increase the performance of the last for loop without losing data?
UPDATE
A sample of self.batch_result (simplified result) could be ['AAME,ABAC,ABIL,ABIO,ACERW,ACHN,ACHV,ACRX,ACST,ACTG,ADMA,ADMP,ADOM,ADXS,ADXSW,AEHR,AEMD,AETI,AEY,AEZS,AFMD,AGFSW,AGRX,AGTC,AHPAW,AHPI,AIPT,AKER,AKTX,ALIM,ALJJ,ALQA,ALSK,ALT,AMCN,AMDA,AMMA,AMRH,AMRHW,AMRN,AMRWW,AMTX,ANDAR,ANDAW,ANTH,ANY,APDN,APDNW,APOPW,APPS,APRI,APTO,APVO,APWC,AQB,AQMS,ARCI,ARCW,ARDM,AREX,ARGS,ARLZ,ARQL,ARTW,ARTX,ASFI,ASNA,ASRV,ASTC,ATACR,ATEC,ATHX,ATLC,ATOS,ATRS,AUTO,AVEO,AVGR,AVID,AVXL,AWRE,AXAS,AXON,AXSM,AYTU,AZRX,BASI,BBOX,BBRG,BCACR,BCACW,BCLI,BDSI,BHACR,BHACW,BIOC,BIOL,BIOS,BKEP,BKYI', 'BLDP,BLIN,BLNK,BLNKW,BLPH,BLRX,BMRA,BNSO,BNTC,BNTCW,BOSC,BOXL,BPTH,BRACR,BRACW,BRPAR,BRPAW,BSPM,BSQR,BUR,BURG,BVSN,BVXVW,BWEN,BYFC,CAAS,CADC,CALI,CAPR,CARV,CASI,CASM,CATB,CATS,CBAK,CBLI,CCCL,CCCR,CCIH,CDMO,CDTI,CELGZ,CERCW,CETV,CETX,CETXW,CFBK,CFMS,CFRX,CGEN,CGIX,CGNT,CHCI,CHEK,CHEKW,CHFS,CHKE,CHMA,CHNR,CIDM,CJJD,CKPT,CLDC,CLDX,CLIR,CLIRW,CLNE,CLRB,CLRBW,CLRBZ,CLSN,CLWT,CMSSR,CMSSW,CNACR,CNACW,CNET,CNIT,CNTF,CODA,CODX,COGT,CPAH,CPLP,CPRX,CPSH,CPSS,CPST,CREG,CRIS,CRME,CRNT,CSBR,CTHR,CTIB,CTIC,CTRV,CTXR,CTXRW,CUI', 'CUR,CVONW,CXDC,CXRX,CYCC,CYHHZ,CYRN,CYTR,CYTX,CYTXW,DARE,DCAR,DCIX,DELT,DEST,DFBG,DFFN,DGLY,DHXM,DLPN,DLPNW,DMPI,DOGZ,DOTAR,DOTAW,DRAD,DRIO,DRIOW,DRRX,DRYS,DSKEW,DSWL,DTEA,DTRM,DXLG,DXYN,DYNT,DYSL,EACQW,EAGLW,EARS,EASTW,EBIO,EDAP,EFOI,EGLT,EKSO,ELECW,ELGX,ELON,ELSE,ELTK,EMITF,EMMS,ENG,ENPH,ENT,EPIX,ESEA,ESES,ESTRW,EVEP,EVGN,EVK,EVLV,EVOK,EXFO,EXXI,EYEG,EYEGW,EYES,EYESW,FCEL,FCRE,FCSC,FFHL,FLGT,FLL,FMCIR,FMCIW,FNJN,FNTEW,FORD,FORK,FPAY,FRAN,FRED,FRSX,FSACW,FSNN,FTD,FTEK,FTFT,FUV,FVE,FWP,GALT,GASS,GCVRZ,GEC']
and self.uri is simply 'https://api01.iq.questrade.com/v1/symbols?names=' as seen in the above Questrade API link.
UPDATE 2
The Marat's answer was a good try but didn't give me a better result. The first test gave me 31,356 (or 10,452 if I divide that result by 3) instead of 10,900. The second test just gave me 0 or the process block completely.
I found out that the Maximum allowed requests per second is 20. Link : http://www.questrade.com/api/documentation/rate-limiting. How could I increase the performance of the last for loop without losing data in considering that new information?
If you are not stuck with using joblib you could try some standard library parallel processing modules. In python2/3 multiprocessing.Pool is available and provides functions for mapping a task across parallel threads. A simplified version would look like this:
from multiprocessing import Pool
import requests
HEADERS = {} # define headers here
def parallel_request(symbols):
response = requests.get('https://api01.iq.questrade.com/v1/symbols?names={}'.format(symbols), headers=HEADERS)
return response.json()
if __name__ == '__main__':
p = Pool()
batch_result = ['AAME,ABAC,ABIL,...',
'BLDP,BLIN,BLNK,...',
'CUR,CVONW,CXDC,...',
...]
p.map(parallel_request, batch_result) # will return a list of len(batch_result) responses
There are asynchronous and iterable versions of map that you would probably want for larger sized jobs, and of course you could add parameters to your parallel_requests task to avoid hard coding things like I did. A caveat with using Pool is that any arguments passed to it have to be picklable.
In python3 the concurrent.futures module actually has a nice example of multithreaded url retrieval in the docs. With a little effort you could replace load_url in that example with your parallel_request function. There is a version of concurrent.futures backported to python2 as the futures module, as well.
These might require a bit more work in refactoring, so if there is a solution that sticks with joblib feel free to prefer that. On the off-chance that your problem is a bug in joblib, there are plenty of ways you could do this in a multithreaded fashion with standard library (albeit with some added boilerplate).
Most likely, it happens because some of HTTP calls fail due to network load. To test, change parallel_request:
def parallel_request(self, elem, result, url, key):
for i in range(3): # 3 retries
try:
response = requests.get(''.join((url, elem)), headers=self.headers)
except IOError:
continue
result.extend(response.json().get(key))
return
Much less likely: list.extend is not thread safe. If the snippet above didn't help, try guarding extend with a lock:
import threading
...
lock = threading.Lock()
def parallel_request(self, elem, result, url, key):
response = requests.get(''.join((url, elem)), headers=self.headers)
lock.acquire()
result.extend(response.json().get(key))
lock.release()

Handling multiple http request in Python

I am mining data from a website through Data Scraping in Python. I am using request package for sending the parameters.
Here is the code snippet in Python:
for param in paramList:
data = get_url_data(param)
def get_url_data(param):
post_data = get_post_data(param)
headers = {}
headers["Content-Type"] = "text/xml; charset=UTF-8"
headers["Content-Length"] = len(post_data)
headers["Connection"] = 'Keep-Alive'
headers["Cache-Control"] = 'no-cache'
page = requests.post(url, data=post_data, headers=headers, timeout=10)
data = parse_page(page.content)
return data
The variable paramList is a list of more than 1000 elements and the endpoint url remains the same. I was wondering if there is a better and more faster way to do this ?
Thanks
As there is a significant amount of networking I/O involved, threading should improve the overall performance significantly.
You can try using a ThreadPool and should test and tweak the number of threads to a one that is best suitable for the situation and shows the overall highest performance .
from multiprocessing.pool import ThreadPool
# Remove 'for param in paramList' iteration
def get_url_data(param):
# Rest of code here
if __name__ == '__main__':
pool = ThreadPool(15)
pool.map(get_url_data, paramList) # Will split the load between the threads nicely
pool.close()
I need to make 1000 post request to same domain, I was wondering if
there is a better and more faster way to do this ?
It depends, if it's a static asset or a servlet which you know what it does, if the same parameters will return the same reponse each time you can implement LRU or some other caching mechanism, if not, 1K of POST requests to some servlet doesn't matter even if they have the same domain.
There is an answer with using multiprocessing whith ThreadPool interface, which actually uses the main process with 15 threads, does it runs on 15 cores machine ? because a core can only run one thread each time (except hyper ones, does it run on 8 hyper-cores?)
ThreadPool interface inside library which has a trivial name, multiprocessing, because python has also threading module, this is confusing as f#ck, lets benchmark some lower level code:
import psutil
from multiprocessing.pool import ThreadPool
from time import sleep
def get_url_data(param):
print(param) # just for convenience
sleep(1) # lets assume it will take one second each time
if __name__ == '__main__':
paramList = [i for i in range(100)] # 100 urls
pool = ThreadPool(psutil.cpu_count()) # each core can run one thread (hyper.. not now)
pool.map(get_url_data, paramList) # splitting the jobs
pool.close()
The code above will use the main process with 4 threads in my case because my laptop has 4 CPUs, benchmark result:
$ time python3_5_2 test.py
real 0m28.127s
user 0m0.080s
sys 0m0.012s
Lets try spawning processes w/ multiprocessing
import psutil
import multiprocessing
from time import sleep
import numpy
def get_url_data(urls):
for url in urls:
print(url)
sleep(1) # lets assume it will take one second each time
if __name__ == "__main__":
jobs = []
# Split URLs into chunks as number of CPUs
chunks = numpy.array_split(range(100), psutil.cpu_count())
# Pass each chunk into process
for url_chunk in chunks:
jobs.append(multiprocessing.Process(target=get_url_data, args=(url_chunk, )))
# Start the processes
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
Benchmark result: less 3 seconds
$ time python3_5_2 test2.py
real 0m25.208s
user 0m0.260s
sys 0m0.216
If you will execute ps -aux | grep "test.py" you will see 5 processes because one is the main which manage the others.
There are some drawbacks:
You did not explain in depth what your code is doing, but if you doing some work which needs to be synchronized you need to know multiprocessing is NOT thread safe.
Spawning extra processes introduces I/O overhead as data is having to be shuffled around between processors.
Assuming the data is restricted to each process, it is possible to gain significant speedup, be aware of Amdahl's Law.
If you will reveal what your code does afterwards ( save it into file ? database ? stdout ? ) it will be easier to give better answer/direction, few ideas comes up to my mind like immutable infrastructure with Bash or Java to handle synchronization or is it a memory-bound issue and you need an objects pool to process the JSON responses.. might even be a job for fault tolerance Elixir)

Parallelizing loop for downloading data

I'm new to Python. I want to run a simple script in Google App Engine that retrieves many files into an object as quickly as possible. Would parallelization be a smart option and how would I go about doing it? Thanks in advance for the brainstorming
import requests
...
theData=[]
for q in range(0, len(theURLs)):
r = requests.get(theURLs[q])
theData.insert(q,r.text)
In "regular" Python this is pretty simple.
from multiprocessing.pool import ThreadPool
import requests
responses = ThreadPool(10).map(requests.get, urls)
Replace 10 with # of threads that produces best results for you.
However you specified GAE which has restrictions on spawning threads/processes and its own async approach, which consists of using the async functions from the URL Fetch service, something along these lines (untested):
rpcs = [urlfetch.create_rpc() for url in urls]
for (rpc, url) in zip(rpcs, urls):
urlfetch.make_fetch_call(rpc, url)
results = [rpc.get_result() for rpc in rpcs]
You will need to add error handling...
You should make your code more Pythonic by using list comprehensions:
# A list of tuples
theData = [(q,requests.get(theURLs[q]).text) for q in range(0, len(theURLs))]
# ... or ...
# A list of lists
theData = [[q,requests.get(theURLs[q]).text] for q in range(0, len(theURLs))]
If you want to retrieve the files concurrently use the threading library, this website has some good examples, might be good practice:
http://www.tutorialspoint.com/python/python_multithreading.htm
I seriously doubt it. Parallelization can really only speed up calculations, while the bottleneck here is data transfer.

Problems with Speed during web-crawling (Python)

I would love to have this programm improve a lot in speed. It reads +- 12000 pages in 10 minutes. I was wondering if there is something what would help a lot to the speed? I hope you guys know some tips. I am supposed to read +- millions of pages... so that would take way too long :( Here is my code:
from eventlet.green import urllib2
import httplib
import time
import eventlet
# Create the URLS in groups of 400 (+- max for eventlet)
def web_CreateURLS():
print str(str(time.asctime( time.localtime(time.time()) )).split(" ")[3])
for var_indexURLS in xrange(0, 2000000, 400):
var_URLS = []
for var_indexCRAWL in xrange(var_indexURLS, var_indexURLS+400):
var_URLS.append("http://www.nu.nl")
web_ScanURLS(var_URLS)
# Return the HTML Source per URL
def web_ReturnHTML(url):
try:
return [urllib2.urlopen(url[0]).read(), url[1]]
except urllib2.URLError:
time.sleep(10)
print "UrlError"
web_ReturnHTML(url)
# Analyse the HTML Source
def web_ScanURLS(var_URLS):
pool = eventlet.GreenPool()
try:
for var_HTML in pool.imap(web_ReturnHTML, var_URLS):
# do something etc..
except TypeError: pass
web_CreateURLS()
I like using greenlets.. but I often benefit from using multiple processes spread over lots of systems.. or just one single system letting the OS take care of all the checks and balances of running multiple processes.
Check out ZeroMQ at http://zeromq.org/ for some good examples on how to make a dispatcher with a TON of listeners that do whatever the dispatcher says. Alternatively check out execnet for a method of quickly getting started with executing remote or local tasks in parallel.
I also use http://spread.org/ a lot and have LOTS of systems listening to a common spread daemon.. it's a very useful message bus where results can be pooled back to and dispatched from a single thread pretty easily.
And then of course there is always redis pub/sub or sync. :)
"Share the load"

Categories

Resources