Python asyncio read file and execute another activity at intervals

Python asyncio read file and execute another activity at intervals - python

I admit to being very lazy: I need to do this fairly quickly and cannot get my head round the Python3 asyncio module. (Funnily, I found the boost one fairly intuitive.)
I need to readline a file object (a pipe) that will block from time to time. During this, I want to be able fire off another activity at set intervals (say every 30 minutes), regardless of the availability of anything to read from the file.
Can anyone help me with a skeleton to do this using python3 asyncio? (I cannot install a third-party module such as twisted.)

asyncio (as well as other asynchronous libraries like twisted and tornado) doesn't support non-blocking IO for files -- only sockets and pipes are processed asynchronously.
The main reason is: Unix systems have no good way to process files. Say, on Linux any file read is blocking operation.
See also https://groups.google.com/forum/#!topic/python-tulip/MvpkQeetWZA
UPD.
For schedule periodic activity I guess to use asyncio.Task:
#asyncio.coroutine
def periodic(reader, delay):
data = yield from reader.read_exactly(100) # read 100 bytes
yield from asyncio.sleep(delay)
task = asyncio.Task(reader, 30*60)
Snippet assumes reader is asyncio.StreamReader instance.

Related

Python 3.5 non blocking functions

I have a fairly large python package that interacts synchronously with a third party API server and carries out various operations with the server. Additionally, I am now also starting to collect some of the data for future analysis by pickling the JSON responses. After profiling several serialisation/database methods, using pickle was the fastest in my case. My basic pseudo-code is:
While True:
do_existing_api_stuff()...
# additional data pickling
data = {'info': []} # there are multiple keys in real version!
if pickle_file_exists:
data = unpickle_file()
data['info'].append(new_data)
pickle_data(data)
if len(data['info']) >= 100: # file size limited for read/write speed
create_new_pickle_file()
# intensive section...
# move files from "wip" (Work In Progress) dir to "complete"
if number_of_pickle_files >= 100:
compress_pickle_files() # with lzma
move_compressed_files_to_another_dir()
My main issue is that the compressing and moving of the files takes several seconds to complete and is therefore slowing my main loop. What is the easiest way to call these functions in a non-blocking way without any major modifications to my existing code? I do not need any return from the function, however it will raise an error if anything fails. Another "nice to have" would be for the pickle.dump() to also be non-blocking. Again, I am not interested in the return beyond "did it raise an error?". I am aware that unpickle/append/re-pickle every loop is not particularly efficient, however it does avoid data loss when the api drops out due to connection issues, server errors, etc.
I have zero knowledge on threading, multiprocessing, asyncio, etc and after much searching, I am currently more confused than I was 2 days ago!
FYI, all of the file related functions are in a separate module/class, so that could be made asynchronous if necessary.
EDIT:
There may be multiple calls to the above functions, so I guess some sort of queuing will be required?

Easiest solution is probably the threading standard library package. This will allow you to spawn a thread to do the compression while your main loop continues.
There is almost certainly quite a bit of 'dead time' in your existing loop waiting for the API to respond and conversely there is quite a bit of time spent doing the compression when you could be usefully making another API call. For this reason I'd suggest separating these two aspects. There are lots of good tutorials on threading so I'll just describe a pattern which you could aim for
Keep the API call and the pickling in the main loop but add a step which passes the file path to each pickle to a queue after it is written
Write a function which takes a the queue as its input and works through the filepaths performing the compression
Before starting the main loop, start a thread with the new function as its target

speeding up urlib.urlretrieve

I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :
import urllib
urllib.urlretrieve(link, filename)
I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.
For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):
import socket
socket.setdefaulttimeout(5)
Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?

my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.
Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.
To add multi-threading, do something like the following, using the multiprocessing package:
1) encapsulate the url retrieving in a function:
import import urllib.request
def geturl(link,i):
try:
urllib.request.urlretrieve(link, str(i)+".jpg")
except:
pass
2) then create a collection with all urls as well as names you want for the downloaded pictures:
urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]
3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)
then use the pool.starmap() method and pass the function and the arguments of the function.
results = pool.starmap(geturl, zip(links, d))
note: pool.starmap() works only in Python 3

When a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request (this is called a context switch) and is not resumed until the I/O operation is completed.
Context switching is quite a heavy operation. It requires us to save the state of our program (losing any sort of caching we had at the CPU level) and give up the use of the CPU. Later, when we are allowed to run again, we must spend time reinitializing our program on the motherboard and getting ready to resume (of course, all this happens behind the scenes).
With concurrency, on the other hand, we typically have a thing called an “event loop” running that manages what gets to run in our program, and when. In essence, an event loop is simply a list of functions that need to be run. The function at the top of the list gets run, then the next, etc.
The following shows a simple example of an event loop:
from Queue import Queue
from functools import partial
eventloop = None
class EventLoop(Queue):
def start(self):
while True:
function = self.get()
function()
def do_hello():
global eventloop
print "Hello"
eventloop.put(do_world)
def do_world():
global eventloop
print "world"
eventloop.put(do_hello)
if __name__ == "__main__":
eventloop = EventLoop()
eventloop.put(do_hello)
eventloop.start()
If the above seems like something you may use, and you'd also like to see how gevent, tornado, and AsyncIO, can help with your issue, then head out to your (University) library, check out High Performance Python by Micha Gorelick and Ian Ozsvald, and read pp. 181-202.
Note: above code and text are from the book mentioned.

python 3.4 multiprocessing

This question is asking for advice as well as assistance with some code.
I currently am learning Python with 3.4
I have built a basic network checking tool, i import items from a text file and for each of them i want python to check dns (using pydns), ping the ip (using subprocess to call OS native ping).
Currently i am checking 5000 to 9000 thousand IP address and its taking a number of hours, approx 4 to return all the results.
I am wondering if i can use multiprocessing or threading to speed this up but still the return the output to a list so that the row can be written to a csv file at the very end of the script in bulk.
I am new to python so please tell me if i have overlooked something i should of also.
Main code
http://pastebin.com/ZS23XrdE
Class
http://pastebin.com/kh65hYhG

You could use multiple threads to run child processes (ping in your case) and collect their output but it is not necessary. Here's a code example how to make multiple http requests using a thread pool. Here's code that uses concurrent.futures to make dns requests concurrently.
You don't need multiple threads/process to check 5000-9000 IPs (DNS, ICMP).
You could use gevent, twisted, asyncio to make network connections in the same process.

As most of the work seems IO based, you can easily rely on Threads.
Take a look at the Executor.map() function in cocurrent.futures:
https://docs.python.org/3/library/concurrent.futures.html
You can pass the list of IPs and the function you want to run against each element, the returned value, virtually, is the list of results of the given function.
In your specific case you can wrap the two worker's methods (check_dns_ip and os_ping) in a single one and pass it to the ThreadPoolExecutor.map function.

coroutine before Python 3.4

I want to confirm the following questions:
There is no native coroutine decorator in Python 2. It would be provided by something like [1].
Prior to Python 3.4, all other Python 3 releases require pip install asyncio in order to use import asyncio.coroutine [2].
trollius is the reference port implementation of aysncio and tulips for Python 2 (and the community thinks that's the one to use)?
Thank you.

I have used Python generators as coroutines, using only built-in methods. I have not used coroutines in any other environment, so my approach may be misinformed.
Here's some starter code I wrote that uses generators to send and receive data in a coroutine capacity, without even using Python 3's yield from syntax:
def sleep(timer, action=None):
''' wait for time to elapse past a certain timer '''
yield # first yield cannot accept a message
now = then = yield
while now - then < timer:
now = yield
if action:
action()
else:
yield "timer finished"
def buttonwait():
''' yields button presses '''
yield
yield
while True:
c = screen.getch()
if c:
yield c
yield
next, the wait function which manages the coroutines, sending the current time and listening for data
def wait(processes):
start = time.time()
for process in processes:
process.next()
process.send(start)
while True:
now = time.time()
for process in processes:
value = process.send(now)
if value:
return value
last, an implementation of those:
def main():
processes = []
process.append(timer(5)
processes.append(buttonwait())
wait(processes)
I used this on a Raspberry Pi with a 2x16 LCD screen to:
respond to button presses
turn off the backlight after a timeout
scroll long text
move motors
send serial commands
It's a bit complicated to get started, knowing where to put the yields and whatnot, but seems reasonably functional once it is.

Judging from your question, I think you've conflated two things: co-routines and async I/O-run loops. They do not depend upon each other.
Because in Python one can send values into a generator, you can create code implementing the co-routine pattern. All that is needed is generator.send(). Your link to David Beazley's code is a good example of how to create a co-routine. It is just a pattern. The new to Python v3.3 yield from allows this pattern to be used even more flexibly. David has an excellent talk on all of the things you can do with co-routines in a synchronous computing world.
Python's asyncio module depends upon the yield from construct. They have created an async co-routine to package this capability and associate it with a run loop. The goal, I believe, is to allow folks to easily build run loop oriented applications. There is still a role for the Twisteds and Tornados of the world.
(I myself am wary of projects like Trollius. They are kind of "Dancing Bear" project. They miss the point. Asyncio is about bringing a straightforward run loop implementation to Python 3 as a standard service. Python 2 already has two or more excellent async I/O libraries. Albeit these libraries are complex. IOW, if you are starting with Python 3 and your needs are straightforward, then use asyncio otherwise use Tornado. [Twisted isn't ported yet.] If you are starting with Python 2, then Twisted or Tornado is probably where you should start. Trollius? A code incompatible version with Python 3? An excellent "Dancing Bear.")
In my book, asyncio is an excellent reason to move your code to Python 3. We live in an asynchronous world and run loops are too important a feature to be project specific.
Anon,
Andrew

Python Socket and Thread pooling, how to get more performance?

I am trying to implement a basic lib to issue HTTP GET requests. My target is to receive data through socket connections - minimalistic design to improve performance - usage with threads, thread pool(s).
I have a bunch of links which I group by their hostnames, so here's a simple demonstration of input URLs:
hostname1.com - 500 links
hostname2.org - 350 links
hostname3.co.uk - 100 links
...
I intend to use sockets because of performance issues. I intend to use a number of sockets which keeps connected (if possible and it usually is) and issue HTTP GET requests. The idea came from urllib low performance on continuous requests, then I met urllib3, then I realized it uses httplib and then I decided to try sockets. So here's what I accomplished till now:
GETSocket class, SocketPool class, ThreadPool and Worker classes
GETSocket class is a minified, "HTTP GET only" version of Python's httplib.
So, I use these classes like that:
sp = Comm.SocketPool(host,size=self.poolsize, timeout=5)
for link in linklist:
pool.add_task(self.__get_url_by_sp, self.count, sp, link, results)
self.count += 1
pool.wait_completion()
pass
__get_url_by_sp function is a wrapper which calls sp.urlopen and saves the result to results list. I am using a pool of 5 threads which has a socket pool of 5 GETSocket classes.
What I wonder is, is there any other possible way that I can improve performance of this system?
I've read about asyncore here, but I couldn't figure out how to use same socket connection with class HTTPClient(asyncore.dispatcher) provided.
Another point, I don't know if I'm using a blocking or a non-blocking socket, which would be better for performance or how to implement which one.
Please be specific about your experiences, I don't intend to import another library to do just HTTP GET so I want to code my own tiny library.
Any help appreciated, thanks.

Do this.
Use multiprocessing. http://docs.python.org/library/multiprocessing.html.
Write a worker Process which puts all of the URL's into a Queue.
Write a worker Process which gets a URL from a Queue and does a GET, saving a file and putting the File information into another Queue. You'll probably want multiple copies of this Process. You'll have to experiment to find how many is the correct number.
Write a worker Process which reads file information from a Queue and does whatever it is that you're trying do.

I finally found a well chosen path to solve my problems. I was using Python 3 for my project and my only option was to use pycurl, so this made me have to port my project back to Python 2.7 series.
Using pycurl, I gained:
- Consistent responses to my requests (actually my script has to deal with minimum 10k URLs)
- With the usage of ThreadPool class I am receiving responses as fast as my system can (received data is processed later - so multiprocessing is not much of a possibility here)
I tried httplib2 first, I realized that it is not acting as solid as it acts on Python 2, by switching to pycurl I lost caching support.
Final conclusion: When it comes to HTTP communication, one could need a tool like (py)curl at his disposal. It is a lifesaver, especially when one is dealing with loads of URLs (try sometimes for fun: you will get lots of weird responses from them)
Thanks for the replies, folks.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.