IOError when Using Urllib to Download Pics

IOError when Using Urllib to Download Pics - python

Can anyone help me on the issue of downloading multiple files? For a while, it will stop me with IOError and told me connection attempt failed. I tried to use time.sleep function to sleep for random seconds but it doesn't help. And when I re-run the code, it starts to download files again. Any solutions?
import urllib
import time
import random
index_list=["index#1","index#2",..."index#n"]
for n in index_list:
u=urllib.urlopen("url_address"+str(n)+".jpg")
data=u.read()
f=open("tm"+str(n)+".jpg","wb")
f.write(data)
t=random.uniform(0,1)*10
print "system sleep time is ", t, " seconds"
time.sleep(t)

It is very likely that the error is caused by not closing the connection properly (should I call close() after urllib.urlopen()?).
It also is better practice to close, therefore you should close f as well.
You could also use Python's with statement.
import urllib
import time
import random
index_list = ["index#1", "index#2", ..."index#n"]
for n in index_list:
# The str() function call isn't necessary, since it's a list of strings
u = urllib.urlopen("url_address" + n + ".jpg")
data = u.read()
u.close()
with open("tm" + n + ".jpg", "wb") as f:
f.write(data)
t = random.uniform(0, 1) * 10
print "system sleep time is ", t, " seconds"
time.sleep(t)
If the problem still occurs and you can't provide further information, you may try urllib.urlretrieve

Maybe you are not closing the connections properly, so the server sees too many open connections? Try to do a u.close() after reading the data in the loop.

Related

how can I download multiple files with the web address found in a local .txt file

import wget
with open('downloadhlt.txt') as file:
urls = file.read()
for line in urls.split('\n'):
wget.download(line, 'localfolder')
for some reason the post wouldn't work so I put the code above
What I'm trying to do is from a text file that has ~2 million of lines like these.
http://halitereplaybucket.s3.amazonaws.com/1475594084-2235734685.hlt
http://halitereplaybucket.s3.amazonaws.com/1475594100-2251426701.hlt
http://halitereplaybucket.s3.amazonaws.com/1475594119-2270812773.hlt
I want to grab each line and request it so it downloads as a group greater than 10. Currently, what I have and it downloads one item at a time, it is very time-consuming.
I tried looking at Ways to read/edit multiple lines in python but the iteration seems to be for editing while mine is for multiple executions of wget.
I have not tried other methods simply because this is the first time I have ever been in the need to make over 2 million download calls.

This should work fine, I'm a total newbie so I can't really
advice you on the number of thread to start lol.
These are my 2 cents anyway, hope it somehow helps.
I tried timing yours and mine over 27 downloads:
(base) MBPdiFrancesco:stack francesco$ python3 old.py
Elapsed Time: 14.542160034179688
(base) MBPdiFrancesco:stack francesco$ python3 new.py
Elapsed Time: 1.9618661403656006
And here is the code, you have to create a "downloads" folder
import wget
from multiprocessing.pool import ThreadPool
from time import time as timer
s = timer()
thread_num = 8
def download(url):
try:
wget.download(url, 'downloads/')
except Exception as e:
print(e)
if __name__ == "__main__":
with open('downloadhlt.txt') as file:
urls = file.read().split("\n")
results = ThreadPool(8).imap_unordered(download, urls)
c = 0
for i in results:
c += 1
print("Downloaded {} file{} so far".format(c, "" if c == 1 else "s"))
print("Elapsed Time: {} seconds\nDownloaded {} files".format(timer() - s, c))

request.urlretrieve in multiprocessing Python gets stuck

I am trying to download images from a list of URLs using Python. To make the process faster, I used the multiprocessing library.
The problem I am facing is that the script often hangs/freezes on its own, and I don't know why.
Here is the code that I am using
...
import multiprocessing as mp
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
It often gets stuck halfway through the list (it prints DONE, or CAN't DOWNLOAD to half of the list it has processed but I don't know what is happening on the rest of them). Has anyone faced this problem? I have searched for similar problems (e.g. this link) but found no answer.
Thanks in advance

Ok, I have found an answer.
A possible culprit was the script was stuck in connecting/downloading from the URL. So what I added was a socket timeout to limit the time to connect and download the image.
And now, the issue no longer bothers me.
Here is my complete code
...
import multiprocessing as mp
import socket
# Set the default timeout in seconds
timeout = 20
socket.setdefaulttimeout(timeout)
def getImages(val):
#Dowload images
try:
url= # preprocess the url from the input val
local= #Filename Generation From Global Varables And Rand Stuffs...
urllib.request.urlretrieve(url,local)
print("DONE - " + url)
return 1
except Exception as e:
print("CAN'T DOWNLOAD - " + url )
return 0
if __name__ == '__main__':
files = "urls.txt"
lst = list(open(files))
lst = [l.replace("\n", "") for l in lst]
pool = mp.Pool(processes=4)
res = pool.map(getImages, lst)
print ("tempw")
Hope this solution helps others who are facing the same issue

It looks like you're facing a GIL issue : The python Global Interpreter Lock basically forbid python to do more than one task at the same time.
The Multiprocessing module is really launching separate instances of python to get the work done in parallel.
But in your case, urllib is called in all these instances : each of them is trying to lock the IO process : the one who succeed (e.g. come first) get you the result, while the others (trying to lock an already locked process) fail.
This is a very simplified explanation, but here are some additionnal ressources :
You can find another way to parallelize requests here : Multiprocessing useless with urllib2?
And more info about the GIL here : What is a global interpreter lock (GIL)?

Python: Time of day to execute

I have written a small script to fetch instant stock prices.
#script to get stock data
from __future__ import print_function
import urllib
import lxml.html
from datetime import datetime
import sys
import time
stocks=["stock1","stock2","stock3","stock4","stock5"]
while True:
f=open('./out.txt', 'a+')
for x in stock:
url = "http://someurltofetchdata/"+x
code = urllib.urlopen(url).read()
html = lxml.html.fromstring(code)
result = html.xpath('//td[#class="LastValue"][position() = 1]')
result = [el.text_content() for el in result]
f.write(datetime.now().strftime("%Y-%m-%d %H:%M:%S") + ' ' + x + ' ' + result[0])
f.write("\n")
f.close()
I want that code to fetch data only on hours the stock market is open which means on trading hours. (09:00 to 12:30 and 13:30 to 17:30).
Could you please suggest a method to perform the scheduling implicitly on the code? (Not on the OS level)

If you cannot use cron (which is the simplest way to accomplish the task), you can add this to your code. It will download data if within given time range, sleep for 60 seconds and then run again.
while True:
now = datetime.now().strftime('%H%M')
if '0900' <= now <= '1230' or '1330' <= now <= '1730':
# your code starting with f=open('./out.txt', 'a+')
time.sleep(60)

Have a look at APScheduler
from apscheduler.scheduler import Scheduler
sched = Scheduler()
#sched.interval_schedule(hours=3)
def some_job():
print "Decorated job"
sched.configure(options_from_ini_file)
sched.start()
You can also specify a time.date
job = sched.add_date_job(my_job, datetime(2009, 11, 6, 16, 30, 5), ['text'])
Obviously you'll have to write some code to turn these on and off sched.start() sched.stop()at the relevant times , but then it will go and get the data as often as you have set on the decorator automatically. You could even schedule the schedule!

If you want to schedule this script on Windows, please use task schedule.
It has GUI to configuration, and pretty easy. For Linux, crontab will be better. And most important, you don't need to modify your code, and much stable for long term running.

Python Multithreading Not Functioning

Excuse the unhelpful variable names and unnecessarily bloated code, but I just quickly whipped this together and haven't had time to optimise or tidy up yet.
I wrote this program to dump all the images my friend and I had sent to each other using a webcam photo sharing service ( 321cheese.com ) by parsing a message log for the URLs. The problem is that my multithreading doesn't seem to work.
At the bottom of my code, you'll see my commented-out non-multithreaded download method, which consistently produces the correct results (which is 121 photos in this case). But when I try to send this action to a new thread, the program sometimes downloads 112 photos, sometimes 90, sometimes 115 photos, etc, but never gives out the correct result.
Why would this create a problem? Should I limit the number of simultaneous threads (and how)?
import urllib
import thread
def getName(input):
l = input.split(".com/")
m = l[1]
return m
def parseMessages():
theFile = open('messages.html', 'r')
theLines = theFile.readlines()
theFile.close()
theNewFile = open('new321.txt','w')
for z in theLines:
if "321cheese" in z:
theNewFile.write(z)
theNewFile.close()
def downloadImage(inputURL):
urllib.urlretrieve (inputURL, "./grabNew/" + d)
parseMessages()
f = open('new321.txt', 'r')
lines = f.readlines()
f.close()
g = open('output.txt', 'w')
for x in lines:
a = x.split("<a href=\"")
b = a[1].split("\"")
c = b[0]
if ".png" in c:
d = getName(c)
g.write(c+"\n")
thread.start_new_thread( downloadImage, (c,) )
##downloadImage(c)
g.close()

There are multiple issues in your code.
The main issue is d global name usage in multiple threads. To fix it, pass the name explicitly as an argument to downloadImage().
The easy way (code-wise) to limit the number of concurrent downloads is to use concurrent.futures (available on Python 2 as futures) or multiprocessing.Pool:
#!/usr/bin/env python
import urllib
from multiprocessing import Pool
from posixpath import basename
from urllib import unquote
from urlparse import urlsplit
download_dir = "grabNew"
def url2filename(url):
return basename(unquote(urlsplit(url).path).decode('utf-8'))
def download_image(url):
filename = None
try:
filename = os.path.join(download_dir, url2filename(url))
return urllib.urlretrieve(url, filename), None
except Exception as e:
return (filename, None), e
def main():
pool = Pool(processes=10)
for (filename, headers), error in pool.imap_unordered(download_image, get_urls()):
pass # do something with the downloaded file or handle an error
if __name__ == "__main__":
main()

Did you make sure your parsing is working correctly?
Also, you are launching too many threads.
And finally... threads in python are FAKE! Use the multiprocessing module if you want real parallelism, but since the images are probably all from the same server, if you open one hundred connections at the same time with the same server, probably its firewall will start dropping your connections.

Download big files via FTP with python

Im trying to download daily a backup file from my server to my local storage server, but i got some problems.
I wrote this code (removed the useless parts, as the email function):
import os
from time import strftime
from ftplib import FTP
import smtplib
from email.MIMEMultipart import MIMEMultipart
from email.MIMEBase import MIMEBase
from email.MIMEText import MIMEText
from email import Encoders
day = strftime("%d")
today = strftime("%d-%m-%Y")
link = FTP(ftphost)
link.login(passwd = ftp_pass, user = ftp_user)
link.cwd(file_path)
link.retrbinary('RETR ' + file_name, open('/var/backups/backup-%s.tgz' % today, 'wb').write)
link.delete(file_name) #delete the file from online server
link.close()
mail(user_mail, "Download database %s" % today, "Database sucessfully downloaded: %s" % file_name)
exit()
And i run this with a crontab like:
40 23 * * * python /usr/bin/backup-transfer.py >> /var/log/backup-transfer.log 2>&1
It works with small files, but with the backups files (about 1.7Gb) it freeze, the downloaded file get about 1.2Gb then never grows up (i waited about a day), and the log file is empty.
Any idea?
p.s: im using Python 2.6.5

Sorry if i answer my own question, but I found the solution.
I tryed ftputil with no success, so i tryed many way and finally, this works:
def ftp_connect(path):
link = FTP(host = 'example.com', timeout = 5) #Keep low timeout
link.login(passwd = 'ftppass', user = 'ftpuser')
debug("%s - Connected to FTP" % strftime("%d-%m-%Y %H.%M"))
link.cwd(path)
return link
downloaded = open('/local/path/to/file.tgz', 'wb')
def debug(txt):
print txt
link = ftp_connect(path)
file_size = link.size(filename)
max_attempts = 5 #I dont want death loops.
while file_size != downloaded.tell():
try:
debug("%s while > try, run retrbinary\n" % strftime("%d-%m-%Y %H.%M"))
if downloaded.tell() != 0:
link.retrbinary('RETR ' + filename, downloaded.write, downloaded.tell())
else:
link.retrbinary('RETR ' + filename, downloaded.write)
except Exception as myerror:
if max_attempts != 0:
debug("%s while > except, something going wrong: %s\n \tfile lenght is: %i > %i\n" %
(strftime("%d-%m-%Y %H.%M"), myerror, file_size, downloaded.tell())
)
link = ftp_connect(path)
max_attempts -= 1
else:
break
debug("Done with file, attempt to download m5dsum")
[...]
In my log file i found:
01-12-2011 23.30 - Connected to FTP
01-12-2011 23.30 while > try, run retrbinary
02-12-2011 00.31 while > except, something going wrong: timed out
file lenght is: 1754695793 > 1754695793
02-12-2011 00.31 - Connected to FTP
Done with file, attempt to download m5dsum
Sadly, i have to reconnect to FTP even if the file has been fully downloaded, that in my cas is not a problem, becose i have to download the md5sum too.
As you can see, I'm not been able to detect the timeout and retry the connection, but when i got timeout, I simply reconnect again; If someone know how to reconnect without creating a new ftplib.FTP instance, let me know ;)

You might try setting the timeout. From the docs:
# timeout in seconds
link = FTP(host=ftp_host, user=ftp_user, passwd=ftp_pass, acct='', timeout=3600)

I implemented code with ftplib which can monitor connection, reconnect and redownload file in case of failure. Details here: How to download big file in python via ftp (with monitoring & reconnect)?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

IOError when Using Urllib to Download Pics - python

Maybe you are not closing the connections properly, so the server sees too many open connections? Try to do a u.close() after reading the data in the loop.

Related

how can I download multiple files with the web address found in a local .txt file

request.urlretrieve in multiprocessing Python gets stuck

Python: Time of day to execute

Python Multithreading Not Functioning

Download big files via FTP with python

Categories

Resources