Downloading hundreds of files using `request` stalls in the middle.

Downloading hundreds of files using `request` stalls in the middle. - python

I have the problem, that my code to download files from urls using requests stalls for no apparent reason. When I start the script it will download several hundred files, but then it just stops somewhere. If I try the url manually in the browser, the image loads w/o problem. I also tried with urllib.retrieve, but had the same problem. I use Python 2.7.5 on OSX.
Following you find
the code I use,
the stacktrace (dtruss), while the program is stalling and
the traceback, that is printed, when I ctrl-c the process after nothing happend for 10mins and
Code:
def download_from_url(url, download_path):
with open(download_path, 'wb') as handle:
response = requests.get(url, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
def download_photos_from_urls(urls, concept):
ensure_path_exists(concept)
bad_results = list()
for i, url in enumerate(urls):
print i, url,
download_path = concept+'/'+url.split('/')[-1]
try:
download_from_url(url, download_path)
print
except IOError as e:
print str(e)
return bad_result
stacktrace:
My-desk:~ Me$ sudo dtruss -p 708
SYSCALL(args) = return
Traceback:
318 http://farm1.static.flickr.com/32/47394454_10e6d7fd6d.jpg
Traceback (most recent call last):
File "slow_download.py", line 71, in <module>
if final_path == '':
File "slow_download.py", line 34, in download_photos_from_urls
download_path = concept+'/'+url.split('/')[-1]
File "slow_download.py", line 21, in download_from_url
with open(download_path, 'wb') as handle:
File "/Library/Python/2.7/site-packages/requests/models.py", line 638, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 256, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 186, in read
data = self._fp.read(amt)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 567, in read
s = self.fp.read(amt)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
KeyboardInterrupt

So, just to unify all the comments, and propose a potential solution: There are a couple of reasons why your downloads are failing after a few hundred - it may be internal to Python, such as hitting the maximum number of open file handles, or it may be an issue with the server blocking you for being a robot.
You didn't share all of your code, so it's a bit difficult to say, but at least with what you've shown you're using the with context manager when opening the files to write to, so you shouldn't run into problems there. There's the possibility that the request objects are not getting closed properly after exiting the loop, but I'll show you how to deal with that below.
The default requests User-Agent is (on my machine):
python-requests/2.4.1 CPython/3.4.1 Windows/8
so it's not too inconceivable to imagine the server(s) you're requesting from are screening for various UAs like this and limiting their number of connections. The reason you were able to also get the code to work with urllib.retrieve was that its UA is different than requests', so the server allowed it to continue for approximately the same number of requests, then shut it down, too.
To get around these issues, I suggest altering your download_from_url() function to something like this:
import requests
from time import sleep
def download_from_url(url, download_path, delay=5):
headers = {'Accept-Encoding': 'identity, deflate, compress, gzip',
'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0'}
with open(download_path, 'wb') as handle:
response = requests.get(url, headers=headers) # no stream=True, that could be an issue
handle.write(response.content)
response.close()
sleep(delay)
Instead of using stream=True, we use the default value of False to immediately download the full content of the request. The headers dict contains a few default values, as well as the all-important 'User-Agent' value, which in this example happens to be my UA, determined by using What'sMyUserAgent. Feel free to change this to the one returned by your preferred browser. Instead of messing around with iterating through the content by 1KB blocks, here I just write the entire content to disk at once, eliminating extraneous code and some potential sources for errors - for example, if there was a hiccup in your network connectivity, you could temporarily have empty blocks, and break out out in error. I also explicitly close the request, just in case. Finally, I added an extra parameter to your function, delay, to make the function sleep for a certain number of seconds before returning. I gave it a default value of 5, you can make it whatever you want (it also accepts floats for fractional seconds).
I don't happen to have a large list of image URLs lying around to test this, but it should work as expected. Good luck!

Perhaps the lack of pooling might cause too many connections. Try something like this (using a session):
import requests
session = requests.Session()
def download_from_url(url, download_path):
with open(download_path, 'wb') as handle:
response = session.get(url, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
def download_photos_from_urls(urls, concept):
ensure_path_exists(concept)
bad_results = list()
for i, url in enumerate(urls):
print i, url,
download_path = concept+'/'+url.split('/')[-1]
try:
download_from_url(url, download_path)
print
except IOError as e:
print str(e)
return bad_result

Related

How to read a stream with Python 3? (not with requests module)

I'm building an HTTP client that reads a stream from a server. Right now I am using the requests module, but I am having trouble with response.iter_lines(). Every several iterations I lose data.
Python Ver. 3.7
requests Ver. 2.21.0
I tried different methods, including the use of generators (which for some reason raise a StopIteration for very small amounts of iterations). I tried setting chunk_size=None in order to prevent losing data but the problem still occurs.
response = requests.get(url, headers=headers, stream=True, timeout=60 * 10)
gen = response.iter_lines(chunk_size=None)
try:
for line in gen:
json_data = json.loads(line)
yield json_data
except StopIteration:
return
def http_parser():
json_list = []
response = requests.get(url, headers=headers, stream=True, timeout=60 * 10)
for line in respone.iter_lines():
json_data = json.loads(line)
json_list.append(json_data)
return json_list
Both functions cause loss of data.
In the requests documentation it is mentioned as a warning that iter_lines() may cause loss of data.
Does anyone have a recommendation of another module that has a similar ability that won't cause any loss of data?

How to deal with incorrect json format ? (simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0))

1) A script that I had working for many weeks broke a few days ago. I can't parse the JSON properly now. So this is not net new code, it's something that has been in operation for months.
2) Something changed in the servicing website, and it's making the JSON non-compliant but I have been trying to circumvent the issue with no success. I think it may be an extra space or something, but I can't change the information returned from the servicing website.
3) I know the json is not compliant because I used a validator (https://jsonformatter.curiousconcept.com/) by putting the URL of the service I need with my credentials/format needed, and I get proper results but the validation fails with "Invalid encoding, expecting UTF-8, UTF-16 or UTF-32.[Code 29, Structure 0]". There is a way to tell the validator not to validate and the Json looks proper, but Python will not have anything to do with it. When I run my script it reports:
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0).
4) Below is my URL entry manually and the script. I have obfuscated all sensitive and personal information so the URL if you try won't work, but when I do the non-obfuscated format, I do get a JSON response.
5) Manual URL (obfuscated):
https://mystuff.mydevices.com/Membership/SomeOtherURLrelated?appId=BB8pQgg123450WHahgl12345nAkkX67890q2HrHD7H1nabcde5KqtN654321LB%2fi&securityToken=null&username=myemail#somedomain.com&password=mypassword&culture=en
6) If I manually opened a browser and put the previous real URL (unmodified), the browser responds with json. An example (obfuscated):
{"UserId":0,"SecurityToken":"abcdb8c3-1ef1-1110-1234-402a914f52aa","ReturnCode":"0","ErrorMessage":"","BrandId":2,"BrandName":"Mydevicebrandname","RegionId":1}
7) What can I do to overcome this ? any suggestions ? I have been reading and testing but no luck!
8) Now the script (obfuscated) that basically builds the previous URL and extracts from the JSON a one-time security token that then I can use for other purposes in a much bigger application:
import json,requests
APPID = 'BB8pQgg123450WHahgl12345nAkkX67890q2HrHD7H1nabcde5KqtN654321LB%2fi'
USERNAME = 'myemail#somedomain.com'
PASSWORD = 'mypassword'
CULTURE = 'en'
SERVICE = 'https://mystuff.mydevices.com'
def get_token_formydevices():
payload = {'appId': APPID,
'securityToken': 'null',
'username': USERNAME,
'password': PASSWORD,
'culture': CULTURE,}
login_url = SERVICE + '/Membership/SomeOtherURLrelated'
try:
r = requests.get(login_url, params=payload)
except requests.exceptions.RequestException as err:
return
data = r.json()
if data['ReturnCode'] != '0':
print(data['ErrorMessage'])
sys.exit(1)
return data['SecurityToken']
tokenneeded = get_token_formydevices()
print tokenneeded
9) When I run the previous code this is what I get back:
Traceback (most recent call last):
File "testtoken.py", line 33, in <module>
tokenneeded = get_token_formydevices()
File "testtoken.py", line 26, in get_token_formydevices
data = r.json()
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/requests/models.py", line 826, in json
return complexjson.loads(self.text, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/simplejson/__init__.py", line 516, in loads
return _default_decoder.decode(s)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/simplejson/decoder.py", line 370, in decode
obj, end = self.raw_decode(s)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/simplejson/decoder.py", line 400, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

I found a solution and want to share.
So I was very puzzled by the fact that I could open the servicing URL in a web browser and get some json back, but I couldn't do it in my script or even just using cURL. I kept on getting "request denied" even though the request worked from the browser, so it had to be something else.
So I started experimenting and sending in the request user agent information in my script and voilá! the code below is working although I obfuscated the original URL and my credentials for protection.
I further want to explain that I was doing this as the servicing URL provides back a one-time token that I can then use to trigger another action. So I needed this routine and executive for as many times I need to carry on specific actions, so all I wanted was to retrieve the token from the json form that url. Hope this makes more sense now with the code below.
import json,urllib2
url='https://mystuff.mydevices.com/Membership/SomeOtherURLrelated?appId=BB8pQgg123450WHahgl12345nAkkX67890q2HrHD7H1nabcde5KqtN654321LB%2fi&securityToken=null&username=myemail#somedomain.com&password=mypassword&culture=en'
request = urllib2.Request(url)
#request.add_header('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36') # <--- this works too
request.add_header('User-Agent', 'Mozilla/5.0')
data = json.loads(str(urllib2.urlopen(request).read()))
token = data["SecurityToken"]
print token

Requests CookieJar empty even thought the page have it

I'm on Python 3.5.1, using requests, the relevant part of the code is as follows:
req = requests.post(self.URL, data={"username": username, "password": password})
self.cookies = {"MOODLEID1_": req.cookies["MOODLEID1_"], "MoodleSession": req.cookies["MoodleSession"]}
self.URL has the correct page, and the POST is working as intended, I did some print to check that, and it passed.
My output:
Traceback (most recent call last):
File "D:/.../main.py", line 14, in <module>
m.login('first.last', 'pa$$w0rd!')
File "D:\...\moodle2.py", line 14, in login
self.cookies = {"MOODLEID1_": req.cookies["MOODLEID1_"], "MoodleSession": req.cookies["MoodleSession"]}
File "D:\...\venv\lib\site-packages\requests\cookies.py", line 287, in __getitem__
return self._find_no_duplicates(name)
File "D:\...\venv\lib\site-packages\requests\cookies.py", line 345, in _find_no_duplicates
raise KeyError('name=%r, domain=%r, path=%r' % (name, domain, path))
KeyError: "name='MOODLEID1_', domain=None, path=None"
I'm trying to debug during runtime to check what req.cookies has. But what I get is surprising, at least for me. If you put a breakpoint on self.cookies = {...} and run [(c.name, c.value, c.domain) for c in req.cookies] I get an empty list, like there isn't any cookie in there.
The site does send cookies, checking with a Chrome extension, I found 2, "MOODLEID1_" and "MoodleSession", so why I'm not getting them?

The response doesn't appear to contain any cookies. Look for one or more Set-Cookie headers in req.headers.
Cookies stored in a browser are there because a response included a Set-Cookie header for each of those cookies. You'll have to find what response the server sets those cookies with; apparently it is not this response.
If you need to retain those cookies (once set) across requests, do use a requests.Session() object; this'll retain any cookies returned by responses and send them out again as appropriate with new requests.

How to query a restful webservice using Python

Writing a Python script that uses Requests lib to fire off a request to a remote webservice. Here is my code (test.py):
import logging.config
from requests import Request, Session
logging.config.fileConfig('../../resources/logging.conf')
logr = logging.getLogger('pyLog')
url = 'https://158.74.36.11:7443/hqu/hqapi1/user/get.hqu'
token01 = 'hqstatus_python'
token02 = 'ytJFRyV7g'
response_length = 351
def main():
try:
logr.info('start SO example')
s = Session()
prepped = Request('GET', url, auth=(token01, token02), params={'name': token01}).prepare()
response = s.send(prepped, stream=True, verify=False)
logr.info('status: ' + str(response.status_code))
logr.info('elapsed: ' + str(response.elapsed))
logr.info('headers: ' + str(response.headers))
logr.info('content: ' + response.raw.read(response_length).decode())
except Exception:
logr.exception("Exception")
finally:
logr.info('stop')
if __name__ == '__main__':
main()
I get the following successful output when i run this:
INFO test - start SO example
INFO test - status: 200
INFO test - elapsed: 0:00:00.532053
INFO test - headers: CaseInsensitiveDict({'server': 'Apache-Coyote/1.1', 'set-cookie': 'JSESSIONID=8F87A69FB2B92F3ADB7F8A73E587A10C; Path=/; Secure; HttpOnly', 'content-type': 'text/xml;charset=UTF-8', 'transfer-encoding': 'chunked', 'date': 'Wed, 18 Sep 2013 06:34:28 GMT'})
INFO test - content: <?xml version="1.0" encoding="utf-8"?>
<UserResponse><Status>Success</Status> .... </UserResponse>
INFO test - stop
As you can see, there is this weird variable 'response_length' that i need to pass to the response object (optional argument) to be able to read the content. This variable has to be set to a numeric value that is equal to length of the 'content'. This obviously means that i need to know the response-content-length before hand, which is unreasonable.
If i don't pass that variable or set it to a value greater than the content length, I get the following error:
Traceback (most recent call last):
File "\Python33\lib\http\client.py", line 590, in _readall_chunked
chunk_left = self._read_next_chunk_size()
File "\Python33\lib\http\client.py", line 562, in _read_next_chunk_size
return int(line, 16)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 0: invalid start byte
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "test.py", line 22, in main
logr.info('content: ' + response.raw.read().decode())
File "\Python33\lib\site-packages\requests\packages\urllib3\response.py", line 167, in read
data = self._fp.read()
File "\Python33\lib\http\client.py", line 509, in read
return self._readall_chunked()
File "\Python33\lib\http\client.py", line 594, in _readall_chunked
raise IncompleteRead(b''.join(value))
http.client.IncompleteRead: IncompleteRead(351 bytes read)
How do i make this work without this 'response_length' variable?
Also, are there any better options than 'Requests' lib?
PS: this code is an independent script, and does not run in the Django framework.

Use the public API instead of internals and leave worrying about content length and reading to the library:
import requests
s = requests.Session()
s.verify = False
s.auth = (token01, token02)
resp = s.get(url, params={'name': token01}, stream=True)
content = resp.content
or, since stream=True, you can use the resp.raw file object:
for line in resp.iter_lines():
# process a line
or
for chunk in resp.iter_content():
# process a chunk
If you must have a file-like object, then resp.raw can be used (provided stream=True is set on the request, like done above), but then just use .read() calls without a length to read to EOF.
If you are however, not querying a resource that requires you to stream (anything but a large file request, a requirement to test headers first, or a web service that is explicitly documented as a streaming service), just leave off the stream=True and use resp.content or resp.text for byte or unicode response data.
In the end, however, it appears your server is sending chunked responses that are malformed or incomplete; a chunked transfer encoding includes length information for each chunk and the server appears to be lying about a chunk length or sending too little data for a given chunk. The decode error is merely the result of incomplete data having been sent.

The server you request use "chunked" transfer encoding so there is not a content-length header. A raw response in chunked transfer encoding contains not only actual content but also chunks, a chunk is a number in hex followed by "\r\n" and it always cause xml or json parser error.
try use:
response.raw.read(decode_content=True)

Python seek on remote file using HTTP

How do I seek to a particular position on a remote (HTTP) file so I can download only that part?
Lets say the bytes on a remote file were: 1234567890
I wanna seek to 4 and download 3 bytes from there so I would have: 456
and also, how do I check if a remote file exists?
I tried, os.path.isfile() but it returns False when I'm passing a remote file url.

If you are downloading the remote file through HTTP, you need to set the Range header.
Check in this example how it can be done. Looks like this:
myUrlclass.addheader("Range","bytes=%s-" % (existSize))
EDIT: I just found a better implementation. This class is very simple to use, as it can be seen in the docstring.
class HTTPRangeHandler(urllib2.BaseHandler):
"""Handler that enables HTTP Range headers.
This was extremely simple. The Range header is a HTTP feature to
begin with so all this class does is tell urllib2 that the
"206 Partial Content" reponse from the HTTP server is what we
expected.
Example:
import urllib2
import byterange
range_handler = range.HTTPRangeHandler()
opener = urllib2.build_opener(range_handler)
# install it
urllib2.install_opener(opener)
# create Request and set Range header
req = urllib2.Request('http://www.python.org/')
req.header['Range'] = 'bytes=30-50'
f = urllib2.urlopen(req)
"""
def http_error_206(self, req, fp, code, msg, hdrs):
# 206 Partial Content Response
r = urllib.addinfourl(fp, hdrs, req.get_full_url())
r.code = code
r.msg = msg
return r
def http_error_416(self, req, fp, code, msg, hdrs):
# HTTP's Range Not Satisfiable error
raise RangeError('Requested Range Not Satisfiable')
Update: The "better implementation" has moved to github: excid3/urlgrabber in the byterange.py file.

I highly recommend using the requests library. It is easily the best HTTP library I have ever used. In particular, to accomplish what you have described, you would do something like:
import requests
url = "http://www.sffaudio.com/podcasts/ShellGameByPhilipK.Dick.pdf"
# Retrieve bytes between offsets 3 and 5 (inclusive).
r = requests.get(url, headers={"range": "bytes=3-5"})
# If a 4XX client error or a 5XX server error is encountered, we raise it.
r.raise_for_status()

AFAIK, this is not possible using fseek() or similar. You need to use the HTTP Range header to achieve this. This header may or may not be supported by the server, so your mileage may vary.
import urllib2
myHeaders = {'Range':'bytes=0-9'}
req = urllib2.Request('http://www.promotionalpromos.com/mirrors/gnu/gnu/bash/bash-1.14.3-1.14.4.diff.gz',headers=myHeaders)
partialFile = urllib2.urlopen(req)
s2 = (partialFile.read())
EDIT: This is of course assuming that by remote file you mean a file stored on a HTTP server...
If the file you want is on an FTP server, FTP only allows to to specify a start offset and not a range. If this is what you want, then the following code should do it (not tested!)
import ftplib
fileToRetrieve = 'somefile.zip'
fromByte = 15
ftp = ftplib.FTP('ftp.someplace.net')
outFile = open('partialFile', 'wb')
ftp.retrbinary('RETR '+ fileToRetrieve, outFile.write, rest=str(fromByte))
outFile.close()

You can use httpio to access remote HTTP files as if they were local:
pip install httpio
import zipfile
import httpio
url = "http://some/large/file.zip"
with httpio.open(url) as fp:
zf = zipfile.ZipFile(fp)
print(zf.namelist())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading hundreds of files using `request` stalls in the middle. - python

Related

How to read a stream with Python 3? (not with requests module)

How to deal with incorrect json format ? (simplejson.scanner.JSONDecodeError: Expecting value: line 1 column 1 (char 0))

Requests CookieJar empty even thought the page have it

How to query a restful webservice using Python

Python seek on remote file using HTTP

Categories

Resources