I've been trying to use concurrent.futures in addition to requests in order to send several DIFFERENT direct messages from several DIFFERENT users. The purpose of the app I am designing is to send these direct messages as fast as possible and sending each request individually was taking too long.
The code below is something that I've tried working on but I have clearly found out that futures will not read requests stored in an array.
Any suggestions on how to go about doing this would be greatly appreciated.
from concurrent import futures
import requests
from requests_oauthlib import OAuth1
import json
from datetime import datetime
startTime = datetime.now()
URLS = ['https://api.twitter.com/1.1/direct_messages/new.json'] * 1
def get_oauth():
oauth = OAuth1("xxxxxx",
client_secret="zzzxxxx",
resource_owner_key="xxxxxxxxxxxxxxxxxx",
resource_owner_secret="xxxxxxxxxxxxxxxxxxxx")
return oauth
oauth = get_oauth()
req = []
def load_url(url, timeout):
req.append(requests.post(url, data={'screen_name':'vancephuoc','text':'hello pasdfasasdfdasdfasdffpls 1 2 3 4 5'}, auth=oauth, stream=True, timeout=timeout))
req.append(requests.post(url, data={'screen_name':'vancephuoc','text':'hello this is tweetnumber2 1 2 3 4 5 7'}, auth=oauth, stream=True, timeout=timeout))
with futures.ThreadPoolExecutor(max_workers=100) as executor:
future_to_url = dict((executor.submit(req, url, 60 ), url)
for url in URLS)
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
print ("DM SENT IN")
print (datetime.now()-startTime)
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
It may be worth to take look at some existing libraries that try to simplify using concurrency with requests.
From: http://docs.python-requests.org/en/latest/user/advanced/#blocking-or-non-blocking
[..] there are lots of projects out there that combine Requests with one of Python’s asynchronicity frameworks. Two excellent examples are grequests and requests-futures.
Related
I'm working on a project that parses data from a lot of websites. Most of my code is done, so i'm looking forward to use asyncio in order to eliminate that I/O waiting, but still i wanted to test how threading would work, better or worse. To do that, i wrote some simple code to make requests to 100 websites. Btw i'm using requests_html library for that, fortunately it supports asynchronous requests as well.
asyncio code looks like:
import requests
import time
from requests_html import AsyncHTMLSession
aio_session = AsyncHTMLSession()
urls = [...] # 100 urls
async def fetch(url):
try:
response = await aio_session.get(url, timeout=5)
status = 200
except requests.exceptions.ConnectionError:
status = 404
except requests.exceptions.ReadTimeout:
status = 408
if status == 200:
return {
'url': url,
'status': status,
'html': response.html
}
return {
'url': url,
'status': status
}
def extract_html(urls):
tasks = []
for url in urls:
tasks.append(lambda url=url: fetch(url))
websites = aio_session.run(*tasks)
return websites
if __name__ == "__main__":
start_time = time.time()
websites = extract_html(urls)
print(time.time() - start_time)
Execution time (multiple tests):
13.466366291046143
14.279950618743896
12.980706453323364
BUT
If i run an example with threading:
from queue import Queue
import requests
from requests_html import HTMLSession
from threading import Thread
import time
num_fetch_threads = 50
enclosure_queue = Queue()
html_session = HTMLSession()
urls = [...] # 100 urls
def fetch(i, q):
while True:
url = q.get()
try:
response = html_session.get(url, timeout=5)
status = 200
except requests.exceptions.ConnectionError:
status = 404
except requests.exceptions.ReadTimeout:
status = 408
q.task_done()
if __name__ == "__main__":
for i in range(num_fetch_threads):
worker = Thread(target=fetch, args=(i, enclosure_queue,))
worker.setDaemon(True)
worker.start()
start_time = time.time()
for url in urls:
enclosure_queue.put(url)
enclosure_queue.join()
print(time.time() - start_time)
Execution time (multiple tests):
7.476433515548706
6.786043643951416
6.717151403427124
The thing that i don't understand .. both libraries are used against I/O problems, but why are threads faster ? The more i increase the number of threads, the more resources it uses but it's a lot faster.. Can someone please explain to me why are threads faster than asyncio in my example ?
Thanks in advance.
It turns out requests-html uses a pool of threads for running the requests. The default number of threads is the number of core on the machine multiplied by 5. This probably explains the difference in performance you noticed.
You might want to try the experiment again using aiohttp instead. In the case of aiohttp, the underlying socket for the HTTP connection is actually registered in the asyncio event loop, so no threads should be involved here.
Is there a possible way to speed up my code using multiprocessing interface? The problem is that this interface uses map function, which works only with 1 function. But my code has 3 functions. I tried to combine my functions into one, but didn't get success. My script reads the URL of site from file and performs 3 functions over it. For Loop makes it very slow, because I got a lot of URLs
import requests
def Login(url): #Log in
payload = {
'UserName_Text' : 'user',
'UserPW_Password' : 'pass',
'submit_ButtonOK' : 'return buttonClick;'
}
try:
p = session.post(url+'/login.jsp', data = payload, timeout=10)
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
print "site is DOWN! :", url[8:]
session.cookies.clear()
session.close()
else:
print 'OK: ', p.url
def Timer(url): #Measure request time
try:
timer = requests.get(url+'/login.jsp').elapsed.total_seconds()
except (requests.exceptions.ConnectionError):
print 'Request time: None'
print '-----------------------------------------------------------------'
else:
print 'Request time:', round(timer, 2), 'sec'
def Logout(url): # Log out
try:
logout = requests.get(url+'/logout.jsp', params={'submit_ButtonOK' : 'true'}, cookies = session.cookies)
except(requests.exceptions.ConnectionError):
pass
else:
print 'Logout '#, logout.url
print '-----------------------------------------------------------------'
session.cookies.clear()
session.close()
for line in open('text.txt').read().splitlines():
session = requests.session()
Login(line)
Timer(line)
Logout(line)
Yes, you can use multiprocessing.
from multiprocessing import Pool
def f(line):
session = requests.session()
Login(session, line)
Timer(session, line)
Logout(session, line)
if __name__ == '__main__':
urls = open('text.txt').read().splitlines()
p = Pool(5)
print(p.map(f, urls))
The requests session cannot be global and shared between workers, every worker should use its own session.
You write that you already "tried to combine my functions into one, but didn't get success". What exactly didn't work?
There are many ways to accomplish your task, but multiprocessing is not needed at that level, it will just add complexity, imho.
Take a look at gevent, greenlets and monkey patching, instead!
Once your code is ready, you can wrap a main function into a gevent loop, and if you applied the monkey patches, the gevent framework will run N jobs concurrently (you can create a jobs pool, set the limits of concurrency, etc.)
This example should help:
#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.
"""Spawn multiple workers and wait for them to complete"""
from __future__ import print_function
import sys
urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']
import gevent
from gevent import monkey
# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()
if sys.version_info[0] == 3:
from urllib.request import urlopen
else:
from urllib2 import urlopen
def print_head(url):
print('Starting %s' % url)
data = urlopen(url).read()
print('%s: %s bytes: %r' % (url, len(data), data[:50]))
jobs = [gevent.spawn(print_head, url) for url in urls]
gevent.wait(jobs)
You can find more here and in the Github repository, from where this example comes from
P.S.
Greenlets will works with requests as well, you don't need to change your code.
I am trying to upload 100,000 data points to a web service backend. If I run it one at a time, it will take ~12 hours. They support 20 API calls simultaneously. How can I run this POST concurrently so I can speed up the import?
def AddPushTokens():
import requests
import csv
import json
count=0
tokenList=[]
apikey="12345"
restkey="12345"
URL="https://api.web.com/1/install/"
headers={'content-type': 'application/json','Application-Id': apikey,'REST-API-Key':restkey}
with open('/Users/name/Desktop/push-new.csv','rU') as csvfile:
deviceTokens=csv.reader(csvfile, delimiter=',')
for token in deviceTokens:
deviceToken=token[0].replace("/","")
deviceType="ios"
pushToken="pushtoken_"+deviceToken
payload={"deviceType": deviceType,"deviceToken":deviceToken,"channels":["",pushToken]}
r = requests.post(URL, data=json.dumps(payload), headers=headers)
count=count+1
print "Count: " + str(count)
print r.content
Edit: I am trying to use concurrent.futures. Where I am confused is how do I set this up so it pulls the token from the CSV and passes it to load_url? Also, I want to make sure that it goes through the first 20 runs the requests, then picks up at 21 and runs the next set of 20.
import concurrent.futures
import requests
URLS = ['https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/']
apikey="12345"
restkey="12345"
URL="https://api.web.com/1/installations/"
headers={'content-type': 'application/json','X-web-Application-Id': apikey,'X-web-REST-API-Key':restkey}
with open('/Users/name/Desktop/push-new.csv','rU') as csvfile:
deviceTokens=csv.reader(csvfile, delimiter=',')
for token in deviceTokens:
deviceToken=token[0].replace("/","")
deviceType="ios"
pushToken="pushtoken_"+deviceToken
payload={"deviceType": deviceType,"deviceToken":deviceToken,"channels":["",pushToken]}
r = requests.post(URL, data=json.dumps(payload), headers=headers)
# Retrieve a single page and report the url and contents
def load_url(token):
URL='https://api.web.com/1/installations/'
deviceToken=token[0].replace("/","")
deviceType="ios"
pushToken="pushtoken_"+deviceToken
payload={"deviceType": deviceType,"deviceToken":deviceToken,"channels":["",pushToken]}
r = requests.post(URL, data=json.dumps(payload), headers=headers)
count=count+1
print "Count: " + str(count)
print r.content
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Edit: Updated based on Comments Below
import concurrent.futures
import requests
import csv
import json
apikey="ldy0eSCqPz9PsyOLAt35M2b0XrfDZT1NBW69Z7Bw"
restkey="587XASjEYdQwH2UHruA1yeZfT0oX7uAUJ8kWTmE3"
URL="https://api.parse.com/1/installations/"
headers={'content-type': 'application/json','X-Parse-Application-Id': apikey,'X-Parse-REST-API-Key':restkey}
with open('/Users/jgurwin/Desktop/push/push-new.csv','rU') as csvfile:
deviceTokens=csv.reader(csvfile, delimiter=',')
for device in deviceTokens:
token=device[0].replace("/","")
# Retrieve a single page and report the url and contents
def load_url(token):
count=0
deviceType="ios"
pushToken="pushtoken_"+token
payload={"deviceType": deviceType,"deviceToken":token,"channels":["",pushToken]}
r = requests.post(URL, data=json.dumps(payload), headers=headers)
count=count+1
print "Count: " + str(count)
print r.content
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
# Start the load operations and mark each future with its URL
future_to_token = {executor.submit(load_url, token, 60): token for token in deviceTokens}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
The easy way to do this is with threads. The nearly-as-easy way is with gevent or a similar library (and grequests even ties gevent and requests together so you don't have to figure out how to do so). The hard way is building an event loop (or, better, using something like Twisted or Tulip) and multiplexing the requests yourself.
Let's do it the easy way.
You don't want to run 100000 threads at once. Besides the fact that it would take hundreds of GB of stack space, and your CPU would spend more time context-switching than running actual code, the service only supports 20 connections at once. So, you want 20 threads.
So, how do you run 100000 tasks on 20 threads? With a thread pool executor (or a bare thread pool).
The concurrent.futures docs have an example which is almost identical to what you want to do, except doing GETs instead of POSTs and using urllib instead of requests. Just change the load_url function to something like this:
def load_url(token):
deviceToken=token[0].replace("/","")
# … your original code here …
r = requests.post(URL, data=json.dumps(payload), headers=headers)
return r.content
… and the example will work as-is.
Since you're using Python 2.x, you don't have the concurrent.futures module in the stdlib; you'll need the backport, futures.
In Python (at least CPython), only one thread at a time can do any CPU work. If your tasks spend a lot more time downloading over the network (I/O work) than building requests and parsing responses (CPU work), that's not a problem. But if that isn't true, you'll want to use processes instead of threads. Which only requires replacing the ThreadPoolExecutor in the example with a ProcessPoolExecutor.
If you want to do this entirely in the 2.7 stdlib, it's nearly as trivial with the thread and process pools built into the multiprocessing. See Using a pool of workers and the Process Pools API, then see multiprocessing.dummy if you want to use threads instead of processes.
Could be overkill, but you may like to have a look at Celery.
Tutorial
tasks.py could be:
from celery import Celery
import requests
app = Celery('tasks', broker='amqp://guest#localhost//')
apikey="12345"
restkey="12345"
URL="https://api.web.com/1/install/"
headers={'content-type': 'application/json','Application-Id': apikey,'REST-API-Key':restkey}
f = open('upload_data.log', 'a+')
#app.task
def upload_data(data, count):
r = requests.post(URL, data=data, headers=headers)
f.write("Count: %d\n%s\n\n" % (count, r.content)
Start celery task with:
$ celery -A tasks worker --loglevel=info -c 20
Then in another script:
import tasks
def AddPushTokens():
import csv
import json
count=0
tokenList=[]
with open('/Users/name/Desktop/push-new.csv','rU') as csvfile:
deviceTokens=csv.reader(csvfile, delimiter=',')
for token in deviceTokens:
deviceToken=token[0].replace("/","")
deviceType="ios"
pushToken="pushtoken_"+deviceToken
payload={"deviceType": deviceType,"deviceToken":deviceToken,"channels":["",pushToken]}
r = tasks.upload_data.delay(json.dumps(payload), count)
count=count+1
NOTE: Above code is sample. You may have to modify it for your requirement.
I am using gevent to download some html pages.
Some websites are way too slow, some stop serving requests after period of time. That is why I had to limit total time for a group of requests I make. For that I use gevent "Timeout".
timeout = Timeout(10)
timeout.start()
def downloadSite():
# code to download site's url one by one
url1 = downloadUrl()
url2 = downloadUrl()
url3 = downloadUrl()
try:
gevent.spawn(downloadSite).join()
except Timeout:
print 'Lost state here'
But the problem with it is that i loose all the state when exception fires up.
Imagine I crawl site 'www.test.com'. I have managed to download 10 urls right before site admins decided to switch webserver for maintenance. In such case i will lose information about crawled pages when exception fires up.
The question is - how do I save state and process the data even if Timeout happens ?
Why not try something like:
timeout = Timeout(10)
def downloadSite(url):
with Timeout(10):
downloadUrl(url)
urls = ["url1", "url2", "url3"]
workers = []
limit = 5
counter = 0
for i in urls:
# limit to 5 URL requests at a time
if counter < limit:
workers.append(gevent.spawn(downloadSite, i))
counter += 1
else:
gevent.joinall(workers)
workers = [i,]
counter = 0
gevent.joinall(workers)
You could also save a status in a dict or something for every URL, or append the ones that fail in a different array, to retry later.
A self-contained example:
import gevent
from gevent import monkey
from gevent import Timeout
gevent.monkey.patch_all()
import urllib2
def get_source(url):
req = urllib2.Request(url)
data = None
with Timeout(2):
response = urllib2.urlopen(req)
data = response.read()
return data
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
print contents[5]
It implements one timeout for each request. In this example, contents contains 10 times the HTML source of google.com, each retrieved in an independent request. If one of the requests had timed out, the corresponding element in contents would be None. If you have questions about this code, don't hesitate to ask in the comments.
I saw your last comment. Defining one timeout per request definitely is not wrong from the programming point of view. If you need to throttle traffic to the website, then just don't spawn 100 greenlets simultaneously. Spawn 5, wait until they returned. Then, you can possibly wait for a given amount of time, and spawn the next 5 (already shown in the other answer by Gabriel Samfira, as I see now). For my code above, this would mean, that you would have to repeatedly call
N = 10
urls = ['http://google.com' for _ in xrange(N)]
getlets = [gevent.spawn(get_source, url) for url in urls]
gevent.joinall(getlets)
contents = [g.get() for g in getlets]
whereas N should not be too high.
What is the best way to make an asynchronous call appear synchronous? Eg, something like this, but how do I coordinate the calling thread and the async reply thread? In java I might use a CountdownLatch() with a timeout, but I can't find a definite solution for Python
def getDataFromAsyncSource():
asyncService.subscribe(callback=functionToCallbackTo)
# wait for data
return dataReturned
def functionToCallbackTo(data):
dataReturned = data
There is a module you can use
import concurrent.futures
Check this post for sample code and module download link: Concurrent Tasks Execution in Python
You can put executor results in future, then get them, here is the sample code from http://pypi.python.org:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
def load_url(url, timeout):
return urllib.request.urlopen(url, timeout=timeout).read()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = dict((executor.submit(load_url, url, 60), url)
for url in URLS)
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is not None:
print('%r generated an exception: %s' % (url,future.exception()))
else:
print('%r page is %d bytes' % (url, len(future.result())))
A common solution would be the usage of a synchronized Queue and passing it to the callback function. See http://docs.python.org/library/queue.html.
So for your example this could look like (I'm just guessing the API to pass additional arguments to the callback function):
from Queue import Queue
def async_call():
q = Queue()
asyncService.subscribe(callback=callback, args=(q,))
data = q.get()
return data
def callback(data, q):
q.put(data)
This is a solution using the threading module internally so it might not work depending on your async library.