I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance
Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue
It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?
Related
import wget
with open('downloadhlt.txt') as file:
urls = file.read()
for line in urls.split('\n'):
wget.download(line, 'localfolder')
for some reason the post wouldn't work so I put the code above
What I'm trying to do is from a text file that has ~2 million of lines like these.
http://halitereplaybucket.s3.amazonaws.com/1475594084-2235734685.hlt
http://halitereplaybucket.s3.amazonaws.com/1475594100-2251426701.hlt
http://halitereplaybucket.s3.amazonaws.com/1475594119-2270812773.hlt
I want to grab each line and request it so it downloads as a group greater than 10. Currently, what I have and it downloads one item at a time, it is very time-consuming.
I tried looking at Ways to read/edit multiple lines in python but the iteration seems to be for editing while mine is for multiple executions of wget.
I have not tried other methods simply because this is the first time I have ever been in the need to make over 2 million download calls.
This should work fine, I'm a total newbie so I can't really
advice you on the number of thread to start lol.
These are my 2 cents anyway, hope it somehow helps.
I tried timing yours and mine over 27 downloads:
(base) MBPdiFrancesco:stack francesco$ python3 old.py
Elapsed Time: 14.542160034179688
(base) MBPdiFrancesco:stack francesco$ python3 new.py
Elapsed Time: 1.9618661403656006
And here is the code, you have to create a "downloads" folder
import wget
from multiprocessing.pool import ThreadPool
from time import time as timer
s = timer()
thread_num = 8
def download(url):
try:
wget.download(url, 'downloads/')
except Exception as e:
print(e)
if __name__ == "__main__":
with open('downloadhlt.txt') as file:
urls = file.read().split("\n")
results = ThreadPool(8).imap_unordered(download, urls)
c = 0
for i in results:
c += 1
print("Downloaded {} file{} so far".format(c, "" if c == 1 else "s"))
print("Elapsed Time: {} seconds\nDownloaded {} files".format(timer() - s, c))
Can anyone help me on the issue of downloading multiple files? For a while, it will stop me with IOError and told me connection attempt failed. I tried to use time.sleep function to sleep for random seconds but it doesn't help. And when I re-run the code, it starts to download files again. Any solutions?
import urllib
import time
import random
index_list=["index#1","index#2",..."index#n"]
for n in index_list:
u=urllib.urlopen("url_address"+str(n)+".jpg")
data=u.read()
f=open("tm"+str(n)+".jpg","wb")
f.write(data)
t=random.uniform(0,1)*10
print "system sleep time is ", t, " seconds"
time.sleep(t)
It is very likely that the error is caused by not closing the connection properly (should I call close() after urllib.urlopen()?).
It also is better practice to close, therefore you should close f as well.
You could also use Python's with statement.
import urllib
import time
import random
index_list = ["index#1", "index#2", ..."index#n"]
for n in index_list:
# The str() function call isn't necessary, since it's a list of strings
u = urllib.urlopen("url_address" + n + ".jpg")
data = u.read()
u.close()
with open("tm" + n + ".jpg", "wb") as f:
f.write(data)
t = random.uniform(0, 1) * 10
print "system sleep time is ", t, " seconds"
time.sleep(t)
If the problem still occurs and you can't provide further information, you may try urllib.urlretrieve
Maybe you are not closing the connections properly, so the server sees too many open connections? Try to do a u.close() after reading the data in the loop.
I'm trying to download a lot of data using multiple threads from Yahoo Finance. I'm using concurrent.futures.ThreadPoolExecutor to speed things up. Everything goes well until I consume all the available file descriptors (1024 by default).
When urllib.request.urlopen() raises an exception the file descriptor is left open (no matter what timeout for socket I use). Normally this file descriptor is reused if I run stuff only from a single (main) thread so this problem doesn't occur. But when these exceptional urlopen() calls are made from ThreadPoolExecutor threads these file descriptors are left open. The only solution I have come up with so far is to use either processes (ProcessPoolExecutor) which is very cumbersome and inefficient or increase the number of allowed file descriptors to something really big (not all the potential users of my library are going to do this anyway). There must be a smarter way to deal with this problem.
And also I wonder whether this is a bug in Python libraries or am I just doing something wrong...
I'm running Python 3.4.1 on Debian (testing, kernel 3.10-3-amd64).
This is an example code that demonstrates this behaviour:
import concurrent
import concurrent.futures
import urllib.request
import os
import psutil
from time import sleep
def fetchfun(url):
urllib.request.urlopen(url)
def main():
print(os.getpid())
p = psutil.Process(os.getpid())
print(p.get_num_fds())
# this url doesn't exist
test_url = 'http://ichart.finance.yahoo.com/table.csv?s=YHOOxyz' + \
'&a=00&b=01&c=1900&d=11&e=31&f=2019&g=d'
with concurrent.futures.ThreadPoolExecutor(1) as executor:
futures = []
for i in range(100):
futures.append(executor.submit(fetchfun, test_url))
count = 0
for future in concurrent.futures.as_completed(futures):
count += 1
print("{}: {} (ex: {})".format(count, p.get_num_fds(), future.exception()))
print(os.getpid())
sleep(60)
if __name__ == "__main__":
main()
When the HTTPError is raised, it saves a reference to the HTTPResponse object for the request as the fp attribute of the HTTPError. That reference gets saved in your futures list, which isn't destroyed until your program ends. That means there's a reference to the HTTPResponse being kept alive for your entire program. As long as that reference exists, the socket used in the HTTPResponse stays open. One way you can work around this is by explicitly closing the HTTPResponse when you handle the exception:
with concurrent.futures.ThreadPoolExecutor(1) as executor:
futures = []
for i in range(100):
futures.append(executor.submit(fetchfun, test_url))
count = 0
for future in concurrent.futures.as_completed(futures):
count += 1
exc = future.exception()
print("{}: {} (ex: {})".format(count, p.get_num_fds(), exc))
exc.fp.close() # Close the HTTPResponse
Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()
There are multiple issues in your code.
The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()
Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.
I'm using python boto and threading to download many files from S3 rapidly. I use this several times in my program and it works great. However, there is one time when it doesn't work. In that step, I try to download 3,000 files on a 32 core machine (Amazon EC2 cc2.8xlarge).
The code below actually succeeds in downloading every file (except sometimes there is an httplib.IncompleteRead error that doesn't get fixed by the retries). However, only 10 or so of the 32 threads actually terminate and the program just hangs. Not sure why this is. All the files have been downloaded and all the threads should have exited. They do on other steps when I download fewer files. I've been reduced to downloading all these files with a single thread (which works but is super slow). Any insights would be greatly appreciated!
from boto.ec2.connection import EC2Connection
from boto.s3.connection import S3Connection
from boto.s3.key import Key
from boto.exception import BotoClientError
from socket import error as socket_error
from httplib import IncompleteRead
import multiprocessing
from time import sleep
import os
import Queue
import threading
def download_to_dir(keys, dir):
"""
Given a list of S3 keys and a local directory filepath,
downloads the files corresponding to the keys to the local directory.
Returns a list of filenames.
"""
filenames = [None for k in keys]
class DownloadThread(threading.Thread):
def __init__(self, queue, dir):
# call to the parent constructor
threading.Thread.__init__(self)
# create a connection to S3
connection = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
self.conn = connection
self.dir = dir
self.__queue = queue
def run(self):
while True:
key_dict = self.__queue.get()
print self, key_dict
if key_dict is None:
print "DOWNLOAD THREAD FINISHED"
break
elif key_dict == 'DONE': #last job for last worker
print "DOWNLOADING DONE"
break
else: #still work to do!
index = key_dict.get('idx')
key = key_dict.get('key')
bucket_name = key.bucket.name
bucket = self.conn.get_bucket(bucket_name)
k = Key(bucket) #clone key to use new connection
k.key = key.key
filename = os.path.join(dir, k.key)
#make dirs if don't exist yet
try:
f_dirname = os.path.dirname(filename)
if not os.path.exists(f_dirname):
os.makedirs(f_dirname)
except OSError: #already written to
pass
#inspired by: http://code.google.com/p/s3funnel/source/browse/trunk/scripts/s3funnel?r=10
RETRIES = 5 #attempt at most 5 times
wait = 1
for i in xrange(RETRIES):
try:
k.get_contents_to_filename(filename)
break
except (IncompleteRead, socket_error, BotoClientError), e:
if i == RETRIES-1: #failed final attempt
raise Exception('FAILED TO DOWNLOAD %s, %s' % (k, e))
break
wait *= 2
sleep(wait)
#put filename in right spot!
filenames[index] = filename
num_cores = multiprocessing.cpu_count()
q = Queue.Queue(0)
for i, k in enumerate(keys):
q.put({'idx': i, 'key':k})
for i in range(num_cores-1):
q.put(None) # add end-of-queue markers
q.put('DONE') #to signal absolute end of job
#Spin up all the workers
workers = [DownloadThread(q, dir) for i in range(num_cores)]
for worker in workers:
worker.start()
#Block main thread until completion
for worker in workers:
worker.join()
return filenames
Upgrade to AWS SDK version 1.4.4.0 or newer, or stick to exactly 2 threads. Older versions have a limit of at most 2 simultaneous connections. This means that your code will work well if you launch 2 threads; if you launch 3 or more, you are bound to see incomplete reads and exhausted timeouts.
You will see that while 2 threads can boost your throughput greatly, more than 2 does not change much because your network card is busy all the time anyway.
S3Connection uses httplib.py and that library is not threadsafe so ensuring each thread has it's own connection is critical. It looks like you are doing that.
Boto already has it's own retry mechanism but you are layering one on top of that to handle certain other errors. I wonder if it would be advisable to create a new S3Connection object inside the except block. It just seems like the underlying http connection could be in an unusual state at that point and it might be best to start with a fresh connection.
Just a thought.