I am trying to send a large amount of messages (tens of millions) to azure using the python azure.storage.queue library however it is taking a very long time to do so. The code I am using is below:
from azure.storage.queue import (
QueueClient,
BinaryBase64EncodePolicy,
BinaryBase64DecodePolicy
)
messages = [example list of messages]
connectionString = "example connection string"
queueName = "example-queue-name"
queueClient = QueueClient.from_connection_string(connectionString, queueName)
for message in messages:
queueClient.send_message(message)
Currently it is taking in the region of 3 hours to send around 70,000 messages which is significantly too slow considering the potential number of messages that need to be sent.
I have looked through the documentation to try and find a batch option but none seem to exist: https://learn.microsoft.com/en-us/python/api/azure-storage-queue/azure.storage.queue.queueclient?view=azure-python
I also wondered if anyone had any experience using the asynchio library to speed this process up and could suggest how to use it?
Try this:
from azure.storage.queue import (
QueueClient,
BinaryBase64EncodePolicy,
BinaryBase64DecodePolicy
)
from concurrent.futures import ProcessPoolExecutor
import time
messages = []
messagesP1 = messages[:len(messages)//2]
messagesP2 = messages[len(messages)//2:]
print(len(messagesP1))
print(len(messagesP2))
connectionString = "<conn str>"
queueName = "<queue name>"
queueClient = QueueClient.from_connection_string(connectionString, queueName)
def pushThread(messages):
for message in messages:
queueClient.send_message(message)
def callback_function(future):
print('Callback with the following result', future.result())
tic = time.perf_counter()
def main():
with ProcessPoolExecutor(max_workers=2) as executor:
future = executor.submit(pushThread, messagesP1)
future.add_done_callback(callback_function)
future2 = executor.submit(pushThread, messagesP2)
while True:
if(future.running()):
print("Task 1 running")
if(future2.running()):
print("Task 2 running")
if(future.done() and future2.done()):
print(future.result(), future2.result())
break
if __name__ == '__main__':
main()
toc = time.perf_counter()
print(f"spent {toc - tic:0.4f} seconds")
As you can see I split the message array into 2 parts and use 2 tasks to push data into the queue concurrently. Per my test, I have about 800 messages and it spends me 94s to push all messages:
But use the way above, it spends me 48s:
Related
I've constructed the following little program for getting phone numbers using google's place api but it's pretty slow. When I'm testing with 6 items it takes anywhere from 4.86s to 1.99s and I'm not sure why the significant change in time. I'm very new to API's so I'm not even sure what sort of things can/cannot be sped up, which sort of things are left to the webserver servicing the API and what I can change myself.
import requests,json,time
searchTerms = input("input places separated by comma")
start_time = time.time() #timer
searchTerms = searchTerms.split(',')
for i in searchTerms:
r1 = requests.get('https://maps.googleapis.com/maps/api/place/textsearch/json?query='+ i +'&key=MY_KEY')
a = r1.json()
pid = a['results'][0]['place_id']
r2 = requests.get('https://maps.googleapis.com/maps/api/place/details/json?placeid='+pid+'&key=MY_KEY')
b = r2.json()
phone = b['result']['formatted_phone_number']
name = b['result']['name']
website = b['result']['website']
print(phone+' '+name+' '+website)
print("--- %s seconds ---" % (time.time() - start_time))
You may want to send requests in parallel. Python provides multiprocessing module which is suitable for task like this.
Sample code:
from multiprocessing import Pool
def get_data(i):
r1 = requests.get('https://maps.googleapis.com/maps/api/place/textsearch/json?query='+ i +'&key=MY_KEY')
a = r1.json()
pid = a['results'][0]['place_id']
r2 = requests.get('https://maps.googleapis.com/maps/api/place/details/json?placeid='+pid+'&key=MY_KEY')
b = r2.json()
phone = b['result']['formatted_phone_number']
name = b['result']['name']
website = b['result']['website']
return ' '.join((phone, name, website))
if __name__ == '__main__':
terms = input("input places separated by comma").split(",")
with Pool(5) as p:
print(p.map(get_data, terms))
Use sessions to enable persistent HTTP connections (so you don't have to establish a new connection every time)
Docs: Requests Advanced Usage - Session Objects
Most of the time isn't spent computing your request. The time is spent in communication with the server. That is a thing you cannot control.
However, you may be able to speed it along using parallelization. Create a separate thread for each request as a start.
from threading import Thread
def request_search_terms(*args):
#your logic for a request goes here
pass
#...
threads = []
for st in searchTerms:
threads.append (Thread (target=request_search_terms, args=(st,)))
threads[-1].start()
for t in threads:
t.join();
Then use a thread pool as the number of request grows, this will avoid the overhead of repeated thread creation.
its a matter of latency between client and servers , you can't change anything in this way unless you use multiple server location ( the near server to the client are getting the request ) .
in term of performance you can build a multithreding system that can handel multiple requests at once .
There is no need to do multithreading yourself. grequests provides a quick drop-in replacement for requests.
I have the following python code that fetches data from a remote json file. The processing of the remote json file can sometimes be quick or sometimes take a little while. So I put the please wait print message before the post request. This works fine. However, I find that for the requests that are quick, the please wait is pointless. Is there a way I can display the please wait message if request is taking longer than x seconds?
try:
print("Please wait")
r = requests.post(url = "http://localhost/test.php")
r_data = r.json()
you can do it using multiple threads as follows:
import threading
from urllib import request
from asyncio import sleep
def th():
sleep(2) # if download takes more than 2 seconds
if not isDone:
print("Please wait...")
dl_thread = threading.Thread(target=th) # create new thread that executes function th when the thread is started
dl_thread.start() # start the thread
isDone = False # variable to track the request status
r = request.post(url="http://localhost/test.php")
isDone = True
r_data = r.json()
My task is to send 30-100 post requests to one url in one exact precise moment of time. For example in 13:00:00.550 with several milliseconds accuracy.
Requests are differ from each other (some types, for example 10 types). And each type must send 5 times.
I have problem with fast sending of http requests. Is there the fastest way to send 30-100 post requests in minimal time?
I tried to use asyncio and httpx.AsyncClient to do it.
Here the part of code how I made it:
from datetime import datetime
import asyncio
import httpx
async def async_post(request_data):
time_to_sleep = 0.005
action_time = '13:00:00'
time_microseconds = 550000
async with httpx.AsyncClient(cookies=request_data['cookies']) as client:
while True:
now_time_second = datetime.now().strftime('%H:%M:%S')
if action_time==now_time_second:
break
await asyncio.sleep(0.05)
while True:
now_time_microsecond = datetime.now().strftime('%f')
if now_time_microsecond >= time_microseconds:
break
await asyncio.sleep(0.003)
for _ in range(5):
response = await client.post(request_data['url'],
headers = request_data['headers'],
params = request_data['params'],
data = request_data['data'],
timeout = 60)
logger.info('Time: ' + str(datetime.now().strftime('%H:%M:%S.%f')))
logger.info('Text: ' + str(response.text))
logger.info('Response time: ' + str(response.headers['Date']))
await asyncio.sleep(time_to_sleep)
def main():
loop = asyncio.get_event_loop()
loop.run_until_complete(
asyncio.gather(*[async_post(request_data) for request_data in all_requests_data]))
all_requests_data - list of all types of requests.
request_data - dict that contains data of request
As result - the time between requests can reach 70-200 ms. That's a lot. It does not suit for me.
And it's not server lag. I tried other application, and could see, that server can make answers in few miliseconds. So that is not on server side.
How to send requests faster?
I'm trying to use Python in an async manner in order to speed up my requests to a server. The server has a slow response time (often several seconds, but also sometimes faster than a second), but works well in parallel. I have no access to this server and can't change anything about it. So, I have a big list of URLs (in the code below, pages) which I know beforehand, and want to speed up their loading by making NO_TASKS=5 requests at a time. On the other hand, I don't want to overload the server, so I want a minimum pause between every request of 1 second (i. e. a limit of 1 request per second).
So far I have successfully implemented the semaphore part (five requests at a time) using a Trio queue.
import asks
import time
import trio
NO_TASKS = 5
asks.init('trio')
asks_session = asks.Session()
queue = trio.Queue(NO_TASKS)
next_request_at = 0
results = []
pages = [
'https://www.yahoo.com/',
'http://www.cnn.com',
'http://www.python.org',
'http://www.jython.org',
'http://www.pypy.org',
'http://www.perl.org',
'http://www.cisco.com',
'http://www.facebook.com',
'http://www.twitter.com',
'http://www.macrumors.com/',
'http://arstechnica.com/',
'http://www.reuters.com/',
'http://abcnews.go.com/',
'http://www.cnbc.com/',
]
async def async_load_page(url):
global next_request_at
sleep = next_request_at
next_request_at = max(trio.current_time() + 1, next_request_at)
await trio.sleep_until(sleep)
next_request_at = max(trio.current_time() + 1, next_request_at)
print('start loading page {} at {} seconds'.format(url, trio.current_time()))
req = await asks_session.get(url)
results.append(req.text)
async def producer(url):
await queue.put(url)
async def consumer():
while True:
if queue.empty():
print('queue empty')
return
url = await queue.get()
await async_load_page(url)
async def main():
async with trio.open_nursery() as nursery:
for page in pages:
nursery.start_soon(producer, page)
await trio.sleep(0.2)
for _ in range(NO_TASKS):
nursery.start_soon(consumer)
start = time.time()
trio.run(main)
However, I'm missing the implementation of the limiting part, i. e. the implementation of max. 1 request per second. You can see above my attempt to do so (first five lines of async_load_page), but as you can see when you execute the code, this is not working:
start loading page http://www.reuters.com/ at 58097.12261669573 seconds
start loading page http://www.python.org at 58098.12367392373 seconds
start loading page http://www.pypy.org at 58098.12380622773 seconds
start loading page http://www.macrumors.com/ at 58098.12389389973 seconds
start loading page http://www.cisco.com at 58098.12397854373 seconds
start loading page http://arstechnica.com/ at 58098.12405119873 seconds
start loading page http://www.facebook.com at 58099.12458010273 seconds
start loading page http://www.twitter.com at 58099.37738939873 seconds
start loading page http://www.perl.org at 58100.37830828273 seconds
start loading page http://www.cnbc.com/ at 58100.91712723473 seconds
start loading page http://abcnews.go.com/ at 58101.91770178373 seconds
start loading page http://www.jython.org at 58102.91875295573 seconds
start loading page https://www.yahoo.com/ at 58103.91993155273 seconds
start loading page http://www.cnn.com at 58104.48031027673 seconds
queue empty
queue empty
queue empty
queue empty
queue empty
I've spent some time searching for answers but couldn't find any.
One of the ways to achieve your goal would be using a mutex acquired by a worker before sending a request and released in a separate task after some interval:
async def fetch_urls(urls: Iterator, responses, n_workers, throttle):
# Using binary `trio.Semaphore` to be able
# to release it from a separate task.
mutex = trio.Semaphore(1)
async def tick():
await trio.sleep(throttle)
mutex.release()
async def worker():
for url in urls:
await mutex.acquire()
nursery.start_soon(tick)
response = await asks.get(url)
responses.append(response)
async with trio.open_nursery() as nursery:
for _ in range(n_workers):
nursery.start_soon(worker)
If a worker gets response sooner than after throttle seconds, it will block on await mutex.acquire(). Otherwise the mutex will be released by the tick and another worker will be able to acquire it.
This is similar to how leaky bucket algorithm works:
Workers waiting for the mutex are like water in a bucket.
Each tick is like a bucket leaking at a constant rate.
If you add a bit of logging just before sending a request you should get an output similar to this:
0.00169 started
0.001821 n_workers: 5
0.001833 throttle: 1
0.002152 fetching https://httpbin.org/delay/4
1.012 fetching https://httpbin.org/delay/2
2.014 fetching https://httpbin.org/delay/2
3.017 fetching https://httpbin.org/delay/3
4.02 fetching https://httpbin.org/delay/0
5.022 fetching https://httpbin.org/delay/2
6.024 fetching https://httpbin.org/delay/2
7.026 fetching https://httpbin.org/delay/3
8.029 fetching https://httpbin.org/delay/0
9.031 fetching https://httpbin.org/delay/0
10.61 finished
Using trio.current_time() for this is much too complicated IMHO.
The easiest way to do rate limiting is a rate limiter, i.e. a separate task that basically does this:
async def ratelimit(queue,tick, task_status=trio.TASK_STATUS_IGNORED):
with trio.open_cancel_scope() as scope:
task_status.started(scope)
while True:
await queue.put()
await trio.sleep(tick)
Example use:
async with trio.open_nursery() as nursery:
q = trio.Queue(0) # can use >0 for burst modes
limiter = await nursery.start(ratelimit, q, 1)
while whatever:
await q.get(None) # will return at most once per second
do_whatever()
limiter.cancel()
in other words, you start that task with
q = trio.Queue(0)
limiter = await nursery.start(ratelimit, q, 1)
and then you can be sure that at most one call of
await q.put(None)
per second will return, as the zero-length queue acts as a rendezvous point. When you're done, call
limiter.cancel()
to stop the rate limiting task, otherwise your nursery won't exit.
If your use case includes starting sub-tasks which you need to finish before the limiter gets cancelled, the easiest way to do that is to rin them in another nursery, i.e. instead of
while whatever:
await q.put(None) # will return at most once per second
do_whatever()
limiter.cancel()
you'd use something like
async with trio.open_nursery() as inner_nursery:
await start_tasks(inner_nursery, q)
limiter.cancel()
which would wait for the tasks to finish before touching the limiter.
NB: You can easily adapt this for "burst" mode, i.e. allow a certain number of requests before the rate limiting kicks in, by simply increasing the queue's length.
Motivation and origin of this solution
Some months have passed since I asked this question. Python has improved since then, so has trio (and my knowledge of them). So I thought it was time for a little update using Python 3.6 with type annotations and trio-0.10 memory channels.
I developed my own improvement of the original version, but after reading #Roman Novatorov's great solution, adapted it again and this is the result. Kudos to him for the main structure of the function (and the idea to use httpbin.org for illustration purposes). I chose to use memory channels instead of a mutex to be able to take out any token re-release logic out of the worker.
Explanation of solution
I can rephrase the original problem like this:
I want to have a number of workers that start the request independently of each other (thus, they will be realized as asynchronous functions).
There is zero or one token released at any point; any worker starting a request to the server consumes a token, and the next token will not be issued until a minimum time has passed. In my solution, I use trio's memory channels to coordinate between the token issuer and the token consumers (workers)
In case your not familiar with memory channels and their syntax, you can read about them in the trio doc. I think the logic of async with memory_channel and memory_channel.clone() can be confusing in the first moment.
from typing import List, Iterator
import asks
import trio
asks.init('trio')
links: List[str] = [
'https://httpbin.org/delay/7',
'https://httpbin.org/delay/6',
'https://httpbin.org/delay/4'
] * 3
async def fetch_urls(urls: List[str], number_workers: int, throttle_rate: float):
async def token_issuer(token_sender: trio.abc.SendChannel, number_tokens: int):
async with token_sender:
for _ in range(number_tokens):
await token_sender.send(None)
await trio.sleep(1 / throttle_rate)
async def worker(url_iterator: Iterator, token_receiver: trio.abc.ReceiveChannel):
async with token_receiver:
for url in url_iterator:
await token_receiver.receive()
print(f'[{round(trio.current_time(), 2)}] Start loading link: {url}')
response = await asks.get(url)
# print(f'[{round(trio.current_time(), 2)}] Loaded link: {url}')
responses.append(response)
responses = []
url_iterator = iter(urls)
token_send_channel, token_receive_channel = trio.open_memory_channel(0)
async with trio.open_nursery() as nursery:
async with token_receive_channel:
nursery.start_soon(token_issuer, token_send_channel.clone(), len(urls))
for _ in range(number_workers):
nursery.start_soon(worker, url_iterator, token_receive_channel.clone())
return responses
responses = trio.run(fetch_urls, links, 5, 1.)
Example of logging output:
As you see, the minimum time between all page requests is one second:
[177878.99] Start loading link: https://httpbin.org/delay/7
[177879.99] Start loading link: https://httpbin.org/delay/6
[177880.99] Start loading link: https://httpbin.org/delay/4
[177881.99] Start loading link: https://httpbin.org/delay/7
[177882.99] Start loading link: https://httpbin.org/delay/6
[177886.20] Start loading link: https://httpbin.org/delay/4
[177887.20] Start loading link: https://httpbin.org/delay/7
[177888.20] Start loading link: https://httpbin.org/delay/6
[177889.44] Start loading link: https://httpbin.org/delay/4
Comments on the solution
As not untypical for asynchronous code, this solution does not maintain the original order of the requested urls. One way to solve this is to associate an id to the original url, e. g. with a tuple structure, put the responses into a response dictionary and later grab the responses one after the other to put them into a response list (saves sorting and has linear complexity).
You need to increment next_request_at by 1 every time you come into async_load_page. Try using next_request_at = max(trio.current_time() + 1, next_request_at + 1). Also I think you only need to set it once. You may get into trouble if you're setting it around awaits, where you're giving the opportunity for other tasks to change it before examining it again.
I am trying to get live data in Python 2.7.13 from Poloniex through the push API.
I read many posts (including How to connect to poloniex.com websocket api using a python library) and I arrived to the following code:
from autobahn.twisted.wamp import ApplicationSession
from autobahn.twisted.wamp import ApplicationRunner
from twisted.internet.defer import inlineCallbacks
import six
class PoloniexComponent(ApplicationSession):
def onConnect(self):
self.join(self.config.realm)
#inlineCallbacks
def onJoin(self, details):
def onTicker(*args):
print("Ticker event received:", args)
try:
yield self.subscribe(onTicker, 'ticker')
except Exception as e:
print("Could not subscribe to topic:", e)
def main():
runner = ApplicationRunner(six.u("wss://api.poloniex.com"), six.u("realm1"))
runner.run(PoloniexComponent)
if __name__ == "__main__":
main()
Now, when I run the code, it looks like it's running successfully, but I don't know where I am getting the data. I have two questions:
I would really appreciate if someone could walk me through the process of subscribing and getting ticker data, that I will elaborate in python, from step 0: I am running the program on Spyder on Windows. Am I supposed to activate somehow Crossbar?
How do I quit the connection? I simply killed the process with Ctrl+c and now when I try to run it agan, I get the error: ReactorNonRestartable.
I ran into a lot of issues using Poloniex with Python2.7 but finally came to a solution that hopefully helps you.
I found that Poloniex has pulled support for the original WAMP socket endpoint so I would probably stray from this method altogether. Maybe this is the entirety of the answer you need but if not here is an alternate way to get ticker information.
The code that ended up working best for me is actually from the post you linked to above but there was some info regarding currency pair ids I found elsewhere.
import websocket
import thread
import time
import json
def on_message(ws, message):
print(message)
def on_error(ws, error):
print(error)
def on_close(ws):
print("### closed ###")
def on_open(ws):
print("ONOPEN")
def run(*args):
# ws.send(json.dumps({'command':'subscribe','channel':1001}))
ws.send(json.dumps({'command':'subscribe','channel':1002}))
# ws.send(json.dumps({'command':'subscribe','channel':1003}))
# ws.send(json.dumps({'command':'subscribe','channel':'BTC_XMR'}))
while True:
time.sleep(1)
ws.close()
print("thread terminating...")
thread.start_new_thread(run, ())
if __name__ == "__main__":
websocket.enableTrace(True)
ws = websocket.WebSocketApp("wss://api2.poloniex.com/",
on_message = on_message,
on_error = on_error,
on_close = on_close)
ws.on_open = on_open
ws.run_forever()
I commented out the lines that pull data you don't seem to want, but for reference here is some more info from that previous post:
1001 = trollbox (you will get nothing but a heartbeat)
1002 = ticker
1003 = base coin 24h volume stats
1010 = heartbeat
'MARKET_PAIR' = market order books
Now you should get some data that looks something like this:
[121,"2759.99999999","2759.99999999","2758.00000000","0.02184376","12268375.01419869","4495.18724321",0,"2767.80020000","2680.10000000"]]
This is also annoying because the "121" at the beginning is the currency pair id, and this is undocumented and also unanswered in the other stack overflow question referred to here.
However, if you visit this url: https://poloniex.com/public?command=returnTicker it seems the id is shown as the first field, so you could create your own mapping of id->currency pair or parse the data by the ids you want from this.
Alternatively, something as simple as:
import urllib
import urllib2
import json
ret = urllib2.urlopen(urllib2.Request('https://poloniex.com/public?command=returnTicker'))
print json.loads(ret.read())
will return to you the data that you want, but you'll have to put it in a loop to get constantly updating information. Not sure of your needs once the data is received so I will leave the rest up to you.
Hope this helps!
I made, with the help of other posts, the following code to get the latest data using Python 3.x. I hope this helps you:
#TO SAVE THE HISTORICAL DATA (X MINUTES/HOURS) OF EVERY CRYPTOCURRENCY PAIR IN POLONIEX:
from poloniex import Poloniex
import pandas as pd
from time import time
import os
api = Poloniex(jsonNums=float)
#Obtains the pairs of cryptocurrencies traded in poloniex
pairs = [pair for pair in api.returnTicker()]
i = 0
while i < len(pairs):
#Available candle periods: 5min(300), 15min(900), 30min(1800), 2hr(7200), 4hr(14400), and 24hr(86400)
raw = api.returnChartData(pairs[i], period=86400, start=time()-api.YEAR*10)
df = pd.DataFrame(raw)
# adjust dates format and set dates as index
df['date'] = pd.to_datetime(df["date"], unit='s')
df.set_index('date', inplace=True)
# Saves the historical data of every pair in a csv file
path=r'C:\x\y\Desktop\z\folder_name'
df.to_csv(os.path.join(path,r'%s.csv' % pairs[i]))
i += 1