I am trying to upload 100,000 data points to a web service backend. If I run it one at a time, it will take ~12 hours. They support 20 API calls simultaneously. How can I run this POST concurrently so I can speed up the import?
def AddPushTokens():
import requests
import csv
import json
count=0
tokenList=[]
apikey="12345"
restkey="12345"
URL="https://api.web.com/1/install/"
headers={'content-type': 'application/json','Application-Id': apikey,'REST-API-Key':restkey}
with open('/Users/name/Desktop/push-new.csv','rU') as csvfile:
deviceTokens=csv.reader(csvfile, delimiter=',')
for token in deviceTokens:
deviceToken=token[0].replace("/","")
deviceType="ios"
pushToken="pushtoken_"+deviceToken
payload={"deviceType": deviceType,"deviceToken":deviceToken,"channels":["",pushToken]}
r = requests.post(URL, data=json.dumps(payload), headers=headers)
count=count+1
print "Count: " + str(count)
print r.content
Edit: I am trying to use concurrent.futures. Where I am confused is how do I set this up so it pulls the token from the CSV and passes it to load_url? Also, I want to make sure that it goes through the first 20 runs the requests, then picks up at 21 and runs the next set of 20.
import concurrent.futures
import requests
URLS = ['https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/',
'https://api.web.com/1/installations/']
apikey="12345"
restkey="12345"
URL="https://api.web.com/1/installations/"
headers={'content-type': 'application/json','X-web-Application-Id': apikey,'X-web-REST-API-Key':restkey}
with open('/Users/name/Desktop/push-new.csv','rU') as csvfile:
deviceTokens=csv.reader(csvfile, delimiter=',')
for token in deviceTokens:
deviceToken=token[0].replace("/","")
deviceType="ios"
pushToken="pushtoken_"+deviceToken
payload={"deviceType": deviceType,"deviceToken":deviceToken,"channels":["",pushToken]}
r = requests.post(URL, data=json.dumps(payload), headers=headers)
# Retrieve a single page and report the url and contents
def load_url(token):
URL='https://api.web.com/1/installations/'
deviceToken=token[0].replace("/","")
deviceType="ios"
pushToken="pushtoken_"+deviceToken
payload={"deviceType": deviceType,"deviceToken":deviceToken,"channels":["",pushToken]}
r = requests.post(URL, data=json.dumps(payload), headers=headers)
count=count+1
print "Count: " + str(count)
print r.content
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
Edit: Updated based on Comments Below
import concurrent.futures
import requests
import csv
import json
apikey="ldy0eSCqPz9PsyOLAt35M2b0XrfDZT1NBW69Z7Bw"
restkey="587XASjEYdQwH2UHruA1yeZfT0oX7uAUJ8kWTmE3"
URL="https://api.parse.com/1/installations/"
headers={'content-type': 'application/json','X-Parse-Application-Id': apikey,'X-Parse-REST-API-Key':restkey}
with open('/Users/jgurwin/Desktop/push/push-new.csv','rU') as csvfile:
deviceTokens=csv.reader(csvfile, delimiter=',')
for device in deviceTokens:
token=device[0].replace("/","")
# Retrieve a single page and report the url and contents
def load_url(token):
count=0
deviceType="ios"
pushToken="pushtoken_"+token
payload={"deviceType": deviceType,"deviceToken":token,"channels":["",pushToken]}
r = requests.post(URL, data=json.dumps(payload), headers=headers)
count=count+1
print "Count: " + str(count)
print r.content
# We can use a with statement to ensure threads are cleaned up promptly
with concurrent.futures.ThreadPoolExecutor(max_workers=20) as executor:
# Start the load operations and mark each future with its URL
future_to_token = {executor.submit(load_url, token, 60): token for token in deviceTokens}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
The easy way to do this is with threads. The nearly-as-easy way is with gevent or a similar library (and grequests even ties gevent and requests together so you don't have to figure out how to do so). The hard way is building an event loop (or, better, using something like Twisted or Tulip) and multiplexing the requests yourself.
Let's do it the easy way.
You don't want to run 100000 threads at once. Besides the fact that it would take hundreds of GB of stack space, and your CPU would spend more time context-switching than running actual code, the service only supports 20 connections at once. So, you want 20 threads.
So, how do you run 100000 tasks on 20 threads? With a thread pool executor (or a bare thread pool).
The concurrent.futures docs have an example which is almost identical to what you want to do, except doing GETs instead of POSTs and using urllib instead of requests. Just change the load_url function to something like this:
def load_url(token):
deviceToken=token[0].replace("/","")
# … your original code here …
r = requests.post(URL, data=json.dumps(payload), headers=headers)
return r.content
… and the example will work as-is.
Since you're using Python 2.x, you don't have the concurrent.futures module in the stdlib; you'll need the backport, futures.
In Python (at least CPython), only one thread at a time can do any CPU work. If your tasks spend a lot more time downloading over the network (I/O work) than building requests and parsing responses (CPU work), that's not a problem. But if that isn't true, you'll want to use processes instead of threads. Which only requires replacing the ThreadPoolExecutor in the example with a ProcessPoolExecutor.
If you want to do this entirely in the 2.7 stdlib, it's nearly as trivial with the thread and process pools built into the multiprocessing. See Using a pool of workers and the Process Pools API, then see multiprocessing.dummy if you want to use threads instead of processes.
Could be overkill, but you may like to have a look at Celery.
Tutorial
tasks.py could be:
from celery import Celery
import requests
app = Celery('tasks', broker='amqp://guest#localhost//')
apikey="12345"
restkey="12345"
URL="https://api.web.com/1/install/"
headers={'content-type': 'application/json','Application-Id': apikey,'REST-API-Key':restkey}
f = open('upload_data.log', 'a+')
#app.task
def upload_data(data, count):
r = requests.post(URL, data=data, headers=headers)
f.write("Count: %d\n%s\n\n" % (count, r.content)
Start celery task with:
$ celery -A tasks worker --loglevel=info -c 20
Then in another script:
import tasks
def AddPushTokens():
import csv
import json
count=0
tokenList=[]
with open('/Users/name/Desktop/push-new.csv','rU') as csvfile:
deviceTokens=csv.reader(csvfile, delimiter=',')
for token in deviceTokens:
deviceToken=token[0].replace("/","")
deviceType="ios"
pushToken="pushtoken_"+deviceToken
payload={"deviceType": deviceType,"deviceToken":deviceToken,"channels":["",pushToken]}
r = tasks.upload_data.delay(json.dumps(payload), count)
count=count+1
NOTE: Above code is sample. You may have to modify it for your requirement.
Related
I am trying to open a multiple web session and save the data into CSV, Have written my code using for loop & requests.get options, But it's taking so long to access 90 number of Web location. Can anyone let me know how the whole process run in parallel for loc_var:
The code is working fine, only the issue is running one by one for loc_var, and took so long time.
Want to access all the for loop loc_var URL in parallel and write operation of CSV
Below is the Code:
import pandas as pd
import numpy as np
import os
import requests
import datetime
import zipfile
t=datetime.date.today()-datetime.timedelta(2)
server = [("A","web1",":5000","username=usr&password=p7Tdfr")]
'''List of all web_ips'''
web_1 = ["Web1","Web2","Web3","Web4","Web5","Web6","Web7","Web8","Web9","Web10","Web11","Web12","Web13","Web14","Web15"]
'''List of All location'''
loc_var =["post1","post2","post3","post4","post5","post6","post7","post8","post9","post10","post11","post12","post13","post14","post15","post16","post17","post18"]
for s,web,port,usr in server:
login_url='http://'+web+port+'/api/v1/system/login/?'+usr
print (login_url)
s= requests.session()
login_response = s.post(login_url)
print("login Responce",login_response)
#Start access the Web for Loc_variable
for mkt in loc_var:
#output is CSV File
com_actions_url='http://'+web+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
print("action",r)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
# If loc is not aceesble try with another Web_1 List
if r.ok == False:
while r.ok == False:
for web_2 in web_1:
login_url='http://'+web_2+port+'/api/v1/system/login/?'+usr
com_actions_url='http://'+web_2+port+'/api/v1/3E+date(%5C%22'+str(t)+'%5C%22)and+location+%3D%3D+%27'+mkt+'%27%22&page_size=-1&format=%22csv%22'
login_response = s.post(login_url)
print("login Responce",login_response)
print("com_action_url",com_actions_url)
r = s.get(com_actions_url)
if r.ok == True:
with open(os.path.join("/home/Reports_DC/", "relation_%s.csv"%mkt),'wb') as f:
f.write(r.content)
break
There are multiple approaches that you can take to make concurrent HTTP requests. Two that I've used are (1) multiple threads with concurrent.futures.ThreadPoolExecutor or (2) send the requests asynchronously using asyncio/aiohttp.
To use a thread pool to send your requests in parallel, you would first generate a list of URLs that you want to fetch in parallel (in your case generate a list of login_urls and com_action_urls), and then you would request all of the URLs concurrently as follows:
from concurrent.futures import ThreadPoolExecutor
import requests
def fetch(url):
page = requests.get(url)
return page.text
# Catch HTTP errors/exceptions here
pool = ThreadPoolExecutor(max_workers=5)
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com'] # Create a list of urls
for page in pool.map(fetch, urls):
# Do whatever you want with the results ...
print(page[0:100])
Using asyncio/aiohttp is generally faster than the threaded approach above, but the learning curve is more complicated. Here is a simple example (Python 3.7+):
import asyncio
import aiohttp
urls = ['http://www.google.com', 'http://www.yahoo.com', 'http://www.bing.com']
async def fetch(session, url):
async with session.get(url) as resp:
return await resp.text()
# Catch HTTP errors/exceptions here
async def fetch_concurrent(urls):
loop = asyncio.get_event_loop()
async with aiohttp.ClientSession() as session:
tasks = []
for u in urls:
tasks.append(loop.create_task(fetch(session, u)))
for result in asyncio.as_completed(tasks):
page = await result
#Do whatever you want with results
print(page[0:100])
asyncio.run(fetch_concurrent(urls))
But unless you are going to be making a huge number of requests, the threaded approach will likely be sufficient (and way easier to implement).
Problem: Check a listing of over 1000 urls and get the url return code (status_code).
The script I have works but very slow.
I am thinking there has to be a better, pythonic (more beutifull) way of doing this, where I can spawn 10 or 20 threads to check the urls and collect resonses.
(i.e:
200 -> www.yahoo.com
404 -> www.badurl.com
...
Input file:Url10.txt
www.example.com
www.yahoo.com
www.testsite.com
....
import requests
with open("url10.txt") as f:
urls = f.read().splitlines()
print(urls)
for url in urls:
url = 'http://'+url #Add http:// to each url (there has to be a better way to do this)
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
Challenges:
Improve speed with multiprocessing.
With multiprocessing
But is it not working.
I get the following error: (note: I am not sure if I have even implemented this correctly)
AttributeError: Can't get attribute 'checkurl' on <module '__main__' (built-in)>
--
import requests
from multiprocessing import Pool
with open("url10.txt") as f:
urls = f.read().splitlines()
def checkurlconnection(url):
for url in urls:
url = 'http://'+url
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
if __name__ == "__main__":
p = Pool(processes=4)
result = p.map(checkurlconnection, urls)
In this case your task is I/O bound and not processor bound - it takes longer for a website to reply than it does for your CPU to loop once through your script (not including the TCP request). What this means is that you wont get any speedup from doing this task in parallel (which is what multiprocessing does). What you want is multi-threading. The way this is achieved is by using the little documented, perhaps poorly named, multiprocessing.dummy:
import requests
from multiprocessing.dummy import Pool as ThreadPool
urls = ['https://www.python.org',
'https://www.python.org/about/']
def get_status(url):
r = requests.get(url)
return r.status_code
if __name__ == "__main__":
pool = ThreadPool(4) # Make the Pool of workers
results = pool.map(get_status, urls) #Open the urls in their own threads
pool.close() #close the pool and wait for the work to finish
pool.join()
See here for examples of multiprocessing vs multithreading in Python.
In checkurlconnection function, parameter must be urls not url.
else, in the for loop, urls will point to the global variable, which is not what you want.
import requests
from multiprocessing import Pool
with open("url10.txt") as f:
urls = f.read().splitlines()
def checkurlconnection(urls):
for url in urls:
url = 'http://'+url
try:
resp = requests.get(url, timeout=1)
print(len(resp.content), '->', resp.status_code, '->', resp.url)
except Exception as e:
print("Error", url)
if __name__ == "__main__":
p = Pool(processes=4)
result = p.map(checkurlconnection, urls)
Is there a possible way to speed up my code using multiprocessing interface? The problem is that this interface uses map function, which works only with 1 function. But my code has 3 functions. I tried to combine my functions into one, but didn't get success. My script reads the URL of site from file and performs 3 functions over it. For Loop makes it very slow, because I got a lot of URLs
import requests
def Login(url): #Log in
payload = {
'UserName_Text' : 'user',
'UserPW_Password' : 'pass',
'submit_ButtonOK' : 'return buttonClick;'
}
try:
p = session.post(url+'/login.jsp', data = payload, timeout=10)
except (requests.exceptions.ConnectionError, requests.exceptions.Timeout):
print "site is DOWN! :", url[8:]
session.cookies.clear()
session.close()
else:
print 'OK: ', p.url
def Timer(url): #Measure request time
try:
timer = requests.get(url+'/login.jsp').elapsed.total_seconds()
except (requests.exceptions.ConnectionError):
print 'Request time: None'
print '-----------------------------------------------------------------'
else:
print 'Request time:', round(timer, 2), 'sec'
def Logout(url): # Log out
try:
logout = requests.get(url+'/logout.jsp', params={'submit_ButtonOK' : 'true'}, cookies = session.cookies)
except(requests.exceptions.ConnectionError):
pass
else:
print 'Logout '#, logout.url
print '-----------------------------------------------------------------'
session.cookies.clear()
session.close()
for line in open('text.txt').read().splitlines():
session = requests.session()
Login(line)
Timer(line)
Logout(line)
Yes, you can use multiprocessing.
from multiprocessing import Pool
def f(line):
session = requests.session()
Login(session, line)
Timer(session, line)
Logout(session, line)
if __name__ == '__main__':
urls = open('text.txt').read().splitlines()
p = Pool(5)
print(p.map(f, urls))
The requests session cannot be global and shared between workers, every worker should use its own session.
You write that you already "tried to combine my functions into one, but didn't get success". What exactly didn't work?
There are many ways to accomplish your task, but multiprocessing is not needed at that level, it will just add complexity, imho.
Take a look at gevent, greenlets and monkey patching, instead!
Once your code is ready, you can wrap a main function into a gevent loop, and if you applied the monkey patches, the gevent framework will run N jobs concurrently (you can create a jobs pool, set the limits of concurrency, etc.)
This example should help:
#!/usr/bin/python
# Copyright (c) 2009 Denis Bilenko. See LICENSE for details.
"""Spawn multiple workers and wait for them to complete"""
from __future__ import print_function
import sys
urls = ['http://www.google.com', 'http://www.yandex.ru', 'http://www.python.org']
import gevent
from gevent import monkey
# patches stdlib (including socket and ssl modules) to cooperate with other greenlets
monkey.patch_all()
if sys.version_info[0] == 3:
from urllib.request import urlopen
else:
from urllib2 import urlopen
def print_head(url):
print('Starting %s' % url)
data = urlopen(url).read()
print('%s: %s bytes: %r' % (url, len(data), data[:50]))
jobs = [gevent.spawn(print_head, url) for url in urls]
gevent.wait(jobs)
You can find more here and in the Github repository, from where this example comes from
P.S.
Greenlets will works with requests as well, you don't need to change your code.
I've been trying to use concurrent.futures in addition to requests in order to send several DIFFERENT direct messages from several DIFFERENT users. The purpose of the app I am designing is to send these direct messages as fast as possible and sending each request individually was taking too long.
The code below is something that I've tried working on but I have clearly found out that futures will not read requests stored in an array.
Any suggestions on how to go about doing this would be greatly appreciated.
from concurrent import futures
import requests
from requests_oauthlib import OAuth1
import json
from datetime import datetime
startTime = datetime.now()
URLS = ['https://api.twitter.com/1.1/direct_messages/new.json'] * 1
def get_oauth():
oauth = OAuth1("xxxxxx",
client_secret="zzzxxxx",
resource_owner_key="xxxxxxxxxxxxxxxxxx",
resource_owner_secret="xxxxxxxxxxxxxxxxxxxx")
return oauth
oauth = get_oauth()
req = []
def load_url(url, timeout):
req.append(requests.post(url, data={'screen_name':'vancephuoc','text':'hello pasdfasasdfdasdfasdffpls 1 2 3 4 5'}, auth=oauth, stream=True, timeout=timeout))
req.append(requests.post(url, data={'screen_name':'vancephuoc','text':'hello this is tweetnumber2 1 2 3 4 5 7'}, auth=oauth, stream=True, timeout=timeout))
with futures.ThreadPoolExecutor(max_workers=100) as executor:
future_to_url = dict((executor.submit(req, url, 60 ), url)
for url in URLS)
for future in futures.as_completed(future_to_url):
url = future_to_url[future]
print ("DM SENT IN")
print (datetime.now()-startTime)
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))
It may be worth to take look at some existing libraries that try to simplify using concurrency with requests.
From: http://docs.python-requests.org/en/latest/user/advanced/#blocking-or-non-blocking
[..] there are lots of projects out there that combine Requests with one of Python’s asynchronicity frameworks. Two excellent examples are grequests and requests-futures.
What is the best way to make an asynchronous call appear synchronous? Eg, something like this, but how do I coordinate the calling thread and the async reply thread? In java I might use a CountdownLatch() with a timeout, but I can't find a definite solution for Python
def getDataFromAsyncSource():
asyncService.subscribe(callback=functionToCallbackTo)
# wait for data
return dataReturned
def functionToCallbackTo(data):
dataReturned = data
There is a module you can use
import concurrent.futures
Check this post for sample code and module download link: Concurrent Tasks Execution in Python
You can put executor results in future, then get them, here is the sample code from http://pypi.python.org:
import concurrent.futures
import urllib.request
URLS = ['http://www.foxnews.com/',
'http://www.cnn.com/',
'http://europe.wsj.com/',
'http://www.bbc.co.uk/',
'http://some-made-up-domain.com/']
def load_url(url, timeout):
return urllib.request.urlopen(url, timeout=timeout).read()
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
future_to_url = dict((executor.submit(load_url, url, 60), url)
for url in URLS)
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
if future.exception() is not None:
print('%r generated an exception: %s' % (url,future.exception()))
else:
print('%r page is %d bytes' % (url, len(future.result())))
A common solution would be the usage of a synchronized Queue and passing it to the callback function. See http://docs.python.org/library/queue.html.
So for your example this could look like (I'm just guessing the API to pass additional arguments to the callback function):
from Queue import Queue
def async_call():
q = Queue()
asyncService.subscribe(callback=callback, args=(q,))
data = q.get()
return data
def callback(data, q):
q.put(data)
This is a solution using the threading module internally so it might not work depending on your async library.