speeding up urlib.urlretrieve

speeding up urlib.urlretrieve - python

I am downloading pictures from the internet, and as it turns out, I need to download lots of pictures. I am using a version of the following code fragment (actually looping through the links I intend to download and downloading the pictures :
import urllib
urllib.urlretrieve(link, filename)
I am downloading roughly 1000 pictures every 15 minutes, which is awfully slow based on the number of pictures I need to download.
For efficiency, I set a timeout every 5 seconds (still many downloads last much longer):
import socket
socket.setdefaulttimeout(5)
Besides running a job on a computer cluster to parallelize downloads, is there a way to make the picture download faster / more efficient?

my code above was very naive as I did not take advantage of multi-threading. It obviously takes for url requests to be responded but there is no reason why the computer cannot make further requests while the proxy server responds.
Doing the following adjustments, you can improve efficiency by 10x - and there are further ways for improving efficiency, with packages such as scrapy.
To add multi-threading, do something like the following, using the multiprocessing package:
1) encapsulate the url retrieving in a function:
import import urllib.request
def geturl(link,i):
try:
urllib.request.urlretrieve(link, str(i)+".jpg")
except:
pass
2) then create a collection with all urls as well as names you want for the downloaded pictures:
urls = [url1,url2,url3,urln]
names = [i for i in range(0,len(urls))]
3)Import the Pool class from the multiprocessing package and create an object using such class (obviously you would include all imports in the first line of your code in a real program):
from multiprocessing.dummy import Pool as ThreadPool
pool = ThreadPool(100)
then use the pool.starmap() method and pass the function and the arguments of the function.
results = pool.starmap(geturl, zip(links, d))
note: pool.starmap() works only in Python 3

When a program enters I/O wait, the execution is paused so that the kernel can perform the low-level operations associated with the I/O request (this is called a context switch) and is not resumed until the I/O operation is completed.
Context switching is quite a heavy operation. It requires us to save the state of our program (losing any sort of caching we had at the CPU level) and give up the use of the CPU. Later, when we are allowed to run again, we must spend time reinitializing our program on the motherboard and getting ready to resume (of course, all this happens behind the scenes).
With concurrency, on the other hand, we typically have a thing called an “event loop” running that manages what gets to run in our program, and when. In essence, an event loop is simply a list of functions that need to be run. The function at the top of the list gets run, then the next, etc.
The following shows a simple example of an event loop:
from Queue import Queue
from functools import partial
eventloop = None
class EventLoop(Queue):
def start(self):
while True:
function = self.get()
function()
def do_hello():
global eventloop
print "Hello"
eventloop.put(do_world)
def do_world():
global eventloop
print "world"
eventloop.put(do_hello)
if __name__ == "__main__":
eventloop = EventLoop()
eventloop.put(do_hello)
eventloop.start()
If the above seems like something you may use, and you'd also like to see how gevent, tornado, and AsyncIO, can help with your issue, then head out to your (University) library, check out High Performance Python by Micha Gorelick and Ian Ozsvald, and read pp. 181-202.
Note: above code and text are from the book mentioned.

Related

Concurrency questions with non-concurrent code

I have a library that does calls to smart contracts in the ethereum chain to read data
So for simplicity, my code is like this:
import library
items = [
"address1",
"address2",
"address3",
]
for item in items:
data = library.get_smartcontractinfo(item)
print(data)
if __name__ == '__main__':
main()
I am new to concurrency and this is a topic I need to explore further, as there are many options to do concurrency but seems asyncio is the one most people go for
The library I a musing is not built with asyncio or any sort of concurrency in mind. This means that each time I call the library.get_smartcontractinfo() function then I need to wait until it completes the query so it can do the next iteration, which is blocking the speed.
Lets say that I cannot modify the library, althought maybe I will in the future, but I wanto get something done asap with the existing code
What would be the easiest way to do simultaneous queries so I can get the info as fast as I can in an efficient way?
What about being rate limited? And would it be possible to group these calls into one without rewriting the library code?
Thank you.

Assuming that library.get_smartcontractinfo() does a lot of network I/O, you could use a ThreadPoolExecutor from concurrent.futures to run more of them in parallel.
The documentation has a good example.

Assuming the function library.get_smartcontractinfo() is a I/O bound, you have multiple options to go with asyncio. If you want to use pure asyncio you can go with something like
async def main():
loop = asyncio.get_running_loop()
all_runs = [loop.run_in_executor(None, library.get_smartcontractinfo, item) for item in items]
results = await asyncio.gather(*all_runs)
Bascially running the sync function in a thread. To run those concurrently, you first create all coroutines without awaiting them, and finally pass those into gather.
If you want to use some additional library, I can recommend using anyio or asyncer which basically is a nice wrapper around anyio. With `asyncer?, you basically can change the one line where you transfer a sync function into an async one to
from asyncer import asyncify
...
all_runs = [asyncify(library.get_smartcontractinfo)(item) for item in items]
the rest stays the same.

Python 3.5 non blocking functions

I have a fairly large python package that interacts synchronously with a third party API server and carries out various operations with the server. Additionally, I am now also starting to collect some of the data for future analysis by pickling the JSON responses. After profiling several serialisation/database methods, using pickle was the fastest in my case. My basic pseudo-code is:
While True:
do_existing_api_stuff()...
# additional data pickling
data = {'info': []} # there are multiple keys in real version!
if pickle_file_exists:
data = unpickle_file()
data['info'].append(new_data)
pickle_data(data)
if len(data['info']) >= 100: # file size limited for read/write speed
create_new_pickle_file()
# intensive section...
# move files from "wip" (Work In Progress) dir to "complete"
if number_of_pickle_files >= 100:
compress_pickle_files() # with lzma
move_compressed_files_to_another_dir()
My main issue is that the compressing and moving of the files takes several seconds to complete and is therefore slowing my main loop. What is the easiest way to call these functions in a non-blocking way without any major modifications to my existing code? I do not need any return from the function, however it will raise an error if anything fails. Another "nice to have" would be for the pickle.dump() to also be non-blocking. Again, I am not interested in the return beyond "did it raise an error?". I am aware that unpickle/append/re-pickle every loop is not particularly efficient, however it does avoid data loss when the api drops out due to connection issues, server errors, etc.
I have zero knowledge on threading, multiprocessing, asyncio, etc and after much searching, I am currently more confused than I was 2 days ago!
FYI, all of the file related functions are in a separate module/class, so that could be made asynchronous if necessary.
EDIT:
There may be multiple calls to the above functions, so I guess some sort of queuing will be required?

Easiest solution is probably the threading standard library package. This will allow you to spawn a thread to do the compression while your main loop continues.
There is almost certainly quite a bit of 'dead time' in your existing loop waiting for the API to respond and conversely there is quite a bit of time spent doing the compression when you could be usefully making another API call. For this reason I'd suggest separating these two aspects. There are lots of good tutorials on threading so I'll just describe a pattern which you could aim for
Keep the API call and the pickling in the main loop but add a step which passes the file path to each pickle to a queue after it is written
Write a function which takes a the queue as its input and works through the filepaths performing the compression
Before starting the main loop, start a thread with the new function as its target

Twisted - Deferred, if its not asynchronous, then whats the point?

I've been tasked with learning Twisted.
I am also somewhat new to Python in general, but have used other modern programming languages.
In reading over Twisted documentation, I keep running into examples that are
Not complete executable examples
Run in one thread
Coming from other languages, when I use some asynchronous mechanism, there is usually another thread of execution while I carry out some manner of work, then I am notified when that work is completed, and I react to its results.
I do see that it has some built in asynchronous mechanisms, but none of them provide the user with a means to create custom CPU bound asynchronous tasks akin to 'Tasks' in C# or 'work' with boost::asio in C++ that would run in parallel to the main thread.
I see that Twisted provides a means to asynchronously wait on IO and do things in on the same thread while waiting, if we are waiting on:
network reads and writes
keyboard input
It also shows me how to:
Do some manner of integration with GUI tool kits to make use of their event loop, but doesn't go into detail.
Schedule tasks using reactor on a timer, but doesn't do that task in parallel to anything else
It talks about async/await, but that is for python 3 only, and I am using python 2.7
I figured the some manner of thread pooling must be built into the reactor, but then when I read about the reactor, it says that everything runs on the main thread in reactor.run().
So, I am left confused.
What is the point of deferreds, creating a callback chain and reacting to the results, if we aren't running anything in parallel?
If we are running asynchronous code, how are we making our own custom asynchronous functions? (see keyword async in C#)
In other languages, I might create an async task to count from 1 to 10, while on the main thread, I might count from 'a' to 'z' at the same time. When the the task is complete I would get notified via a callback on a thread from a threadpool. I'd have the option to sync up, if I wanted to, by calling some 'wait' method. While the definition of "asynchronous" only involves the posting of the task, the getting of the result, and the callback when its done....I've never seen it used without doing things in parallel.

I'll address your questions (and statements that seem confusing) one-by-one:
"Examples that are not complete"
Restating what I posted in the comments: see my two previous answers for complete examples ( https://stackoverflow.com/a/30399317/3334178 & https://stackoverflow.com/a/23274411/3334178 ) and go through Krondo's Twisted Introduction
You said you are discounting these because "The examples are the network code in twisted, which has the asynchronisity built in and hidden.". I disagree with that assertion and will explain this in the next section.
"Examples are not asynchronous"
When your talking about "asynchronous programming" in the vain of pythons twisted/tornado/asyncio (or Node.JS or C select/poll/kpoll) your talking about model/pattern of programming that allows the programmer shape their code so that parts of it can run while other parts are blocked (in almost all cases the blocking is caused by a part of the program having to wait for IO).
These libraries/languages will certainly have ways they can do threading and/or multiprocessing, but those are layers grafted on top of the async design - and if that's genuinely what you need (I.E. you have an exclusively CPU bound need) the async systems are going to be a bad choice.
Let's use your "hidden away" comment to get into this a bit more
"Network examples are asych, but the asynchronousity is built in and hidden away"
The fundamental element of the async design is that you write your code so it should never block for IO - You've been calling out network but really we are talking about network/disk/keyboard/mouse/sound/serial - anything that (for whatever reason) can run slower than the CPU (and that the OS has a file-descriptor for).
Also, there isn't anything really "hidden away" about how it functions - async programming always uses non-blocking (status checking / call-back) calls for any IO channel it can operate on. If you dig enough in the twisted codebase all the async logic is in plain sight (Krondo's tutorial is really good for giving examples of this)
Let me use the keyboard as an example.
In sync code, you would use an input or a read - and the program would pause waiting for that line (or key) to be typed.
In async code (at least in featureful implementations like twisted) you will fetch the file-descriptor for "input" and register it with call-back function, to be called when the file-descriptor changes, to the OS-level async engine (select, poll, kpoll, etc...)
The act of doing that registration - which takes almost no time LETS YOU run other logic while the keyboard logic waits for the keyboard event to happen (see the stdio.StandardIO(keyboardobj,sys.stdin.fileno()) line from near the end of my example code in https://stackoverflow.com/a/30399317/3334178).
"[This] leads me to believe there is some other means to use deferreds with asynchronous"
deferreds aren't magic. They are just clever lists of function callback. There are numerous clever ways they can be chained together, but in the end, they are just a tool to help you take advantage of the logic above
"It also talks about async/await, that is for python 3 only, and I am using python 2.7"
async and await are just the python 3 way of doing what was done in python2 with #defer.inlineCallbacks and yield. These systems are shortcuts that rewire code so that to the reader the code looks and acts like sync code, but when its run the code is morphed into a "register a callback and move-on" flow
"when I read about the reactor, it says that everything runs on the main thread in reactor.run()"
Yes, because (as above) async is about not-waiting-for-IO - its not about threading or multi-processing
Your last few questions "point of deferreds" and "how do you make asynchronous" feel like I answered them above - but if not, let me know in the comments, and I'll spell them out.
Also your comment requesting "an example where we count from 1 to 10 in some deferred function while we count from a to z in the main thread?" doesn't make sense when talking about async (both because you talk about a "thread" - which is a different construct, and because those are both (likely) CPU tasks), but I will give you a different example that counts while watching for keyboard input (which is something that definitely DOES make sense when talking about async:
#!/usr/bin/env python
#
# Frankenstein-esk amalgam of example code
# Key of which comes from the Twisted "Chat" example
# (such as: http://twistedmatrix.com/documents/12.0.0/core/examples/chatserver.py)
import sys # so I can get at stdin
import os # for isatty
import termios, tty # access to posix IO settings
from twisted.internet import reactor
from twisted.internet import stdio # the stdio equiv of listenXXX
from twisted.protocols import basic # for lineReceiver for keyboard
from twisted.internet import task
class counter(object):
runs = 0
def runEverySecond():
counter.runs += 1
print "async counting demo: " + str(counter.runs)
# to set keyboard into cbreak mode - so keys don't require a CR before causing an event
class Cbreaktty(object):
org_termio = None
my_termio = None
def __init__(self, ttyfd):
if(os.isatty(ttyfd)):
self.org_termio = (ttyfd, termios.tcgetattr(ttyfd))
tty.setcbreak(ttyfd)
print ' Set cbreak mode'
self.my_termio = (ttyfd, termios.tcgetattr(ttyfd))
else:
raise IOError #Not something I can set cbreak on!
def retToOrgState(self):
(tty, org) = self.org_termio
print ' Restoring terminal settings'
termios.tcsetattr(tty, termios.TCSANOW, org)
class KeyEater(basic.LineReceiver):
def __init__(self):
self.setRawMode() # Switch from line mode to "however much I got" mode
def rawDataReceived(self, data):
key = str(data).lower()[0]
if key == 'q':
reactor.stop()
else:
print "--------------"
print "Press:"
print " q - to cleanly shutdown"
print "---------------"
# Custom tailored example for SO:56013998
#
# This code is a mishmash of styles and techniques. Both to provide different examples of how
# something can be done and because I'm lazy. Its been built and tested on OSX and linux,
# it should be portable (other then perhaps termal cbreak mode). If you want to ask
# questions about this code contact me directly via mail to mike at partialmesh.com
#
#
# Once running press any key in the window where the script was run and it will give
# instructions.
def main():
try:
termstate = Cbreaktty(sys.stdin.fileno())
except IOError:
sys.stderr.write("Error: " + sys.argv[0] + " only for use on interactive ttys\n")
sys.exit(1)
keyboardobj = KeyEater()
l = task.LoopingCall(runEverySecond)
l.start(1.0) # call every second
stdio.StandardIO(keyboardobj,sys.stdin.fileno())
reactor.run()
termstate.retToOrgState()
if __name__ == '__main__':
main()
(I know technically I didn't use a deferred - but I ran out of time - and this case is a bit too simple to really need it (I don't have a chain of callback anywhere, which is what deferreds are for))

Python Flask website from endless data loop?

I've built some websites using Flask before, including one which used websockets, but this time I'm not sure how to begin.
I currently have an endless loop in Python which gets sensor data from a ZeroMQ socket. It roughly looks like this:
import zeromq
socket = zeromq.create_socket()
while True:
data_dict = socket.receive_json()
print data_dict # {'temperature': 34.6, 'speed': 12.8, etc.}
I now want to create a dashboard showing the incoming sensor data in real time in some nice charts. Since it's in Python and I'm familiar with Flask and websockets I would like to use that.
The websites I built before were basic request/reply based ones though. How on earth would I create a Flask website from a continuous loop?

The Web page will only be interested on the latest value within a reasonable interval from the user's point of view..., say, 3 seconds, so you can retrieve values in the background using a separate thread.
This is an example of how to use the threading module to update a latest value in the background:
import threading
import random
import time
_last_value = None
def get_last_value():
return _last_value
def retrieve_value():
global _last_value
while True:
_last_value = random.randint(1, 100)
time.sleep(3)
threading.Thread(target=retrieve_value, daemon=True).start()
for i in range(20):
print(i, get_last_value())
time.sleep(1)
In your case, it would be something like:
import threading
import zeromq
_socket = zeromq.create_socket()
_last_data_dict = {}
def get_latest_data():
return _last_data_dict
def retrieve_value():
global _last_data_dict
while True:
_last_data_dict = _socket.receive_json()
threading.Thread(target=retrieve_value, daemon=True).start()

Basically, what you need is some form of storage two processes can access at the same time.
If you don't want to leave the comfort of a single python executable, you should look into threading:
https://docs.python.org/2/library/thread.html
Otherwise, you could write two different python scripts (one for sensor readout, one for flask), let the one write into a file and the next one reading from it (or use a pipe in Linux, no idea what Windows offers), and run both processes at the same time and let your OS handle the "threading".
The second approach has the advantage of your OS taking care of performance, but you loose a lot of freedom in locking and reading the file. There may be some weird behavior if your server reads in the instant your sensor-script writes, but I did similar things without problems and I dimly recall that an OS should take care of consistent file-states whenever it's read or written to.

Pre-generating GUIDs for use in python?

I have a python program that needs to generate several guids and hand them back with some other data to a client over the network. It may be hit with a lot of requests in a short time period and I would like the latency to be as low as reasonably possible.
Ideally, rather than generating new guids on the fly as the client waits for a response, I would rather be bulk-generating a list of guids in the background that is continually replenished so that I always have pre-generated ones ready to hand out.
I am using the uuid module in python on linux. I understand that this is using the uuidd daemon to get uuids. Does uuidd already take care of pre-genreating uuids so that it always has some ready? From the documentation it appears that it does not.
Is there some setting in python or with uuidd to get it to do this automatically? Is there a more elegant approach then manually creating a background thread in my program that maintains a list of uuids?

Are you certain that the uuid module would in fact be too slow to handle the requests you expect in a timely manner? I would be very surprised if UUID generation accounted for a bottleneck in your application.
I would first build the application to simply use the uuid module and then if you find that this module is in fact slowing things down you should investigate a way to keep a pre-generated list of UUIDs around.

I have tested the performance of the uuid module for generating uuids:
>>> import timeit
>>> timer=timeit.Timer('uuid.uuid1()','import uuid')
>>> timer.repeat(3, 10000)
[0.84600019454956055, 0.8469998836517334, 0.84400010108947754]
How many do you need? Is 10000 per second not enough?

Suppose you have a thread to keep topping up a pool of uuid's.
Here is a very simple version
import uuid,threading,time
class UUID_Pool(threading.Thread):
pool_size=10000
def __init__(self):
super(UUID_Pool,self).__init__()
self.daemon=True
self.uuid_pool=set(uuid.uuid1() for x in range(self.pool_size))
def run(self):
while True:
while len(self.uuid_pool) < self.pool_size:
self.uuid_pool.add(uuid.uuid1())
time.sleep(0.01) # top up the pool 100 times/sec
uuid_pool = UUID_Pool()
uuid_pool.start()
get_uuid = uuid_pool.uuid_pool.pop # make a local binding
uuid=get_uuid() # ~60x faster than uuid.uuid1() on my computer
You'd also need to handle the case where your burst empties the pool by using uuid's faster than the thread can generate them.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.