How to read a stream with Python 3? (not with requests module) - python

I'm building an HTTP client that reads a stream from a server. Right now I am using the requests module, but I am having trouble with response.iter_lines(). Every several iterations I lose data.
Python Ver. 3.7
requests Ver. 2.21.0
I tried different methods, including the use of generators (which for some reason raise a StopIteration for very small amounts of iterations). I tried setting chunk_size=None in order to prevent losing data but the problem still occurs.
response = requests.get(url, headers=headers, stream=True, timeout=60 * 10)
gen = response.iter_lines(chunk_size=None)
try:
for line in gen:
json_data = json.loads(line)
yield json_data
except StopIteration:
return
def http_parser():
json_list = []
response = requests.get(url, headers=headers, stream=True, timeout=60 * 10)
for line in respone.iter_lines():
json_data = json.loads(line)
json_list.append(json_data)
return json_list
Both functions cause loss of data.
In the requests documentation it is mentioned as a warning that iter_lines() may cause loss of data.
Does anyone have a recommendation of another module that has a similar ability that won't cause any loss of data?

Related

Python memory issue uploading multiple files to API

I'm running a script to upload 20k+ XML files to an API. About 18k in, I get a memory error. I was looking into it and found the memory is just continually climbing until it reaches the limit and errors out (seemingly on the post call). Anyone know why this is happening or a fix? Thanks. I have tried the streaming uploads found here. The empty strings are due to sensitive data.
def upload(self, oauth_token, full_file_path):
file_name = os.path.basename(full_file_path)
upload_endpoint = {'':''}
params = {'': '','': ''}
headers = {'': '', '': ''}
handler = None
try:
handler = open(full_file_path, 'rb')
response = requests.post(url=upload_endpoint[''], params=params, data=handler, headers=headers, auth=oauth_token, verify=False, allow_redirects=False, timeout=600)
status_code = response.status_code
# status checking
return status_code
finally:
if handler:
handler.close()
def push_data(self):
oauth_token = self.get_oauth_token()
files = os.listdir(f_dir)
for file in files:
status = self.upload(oauth_token, file_to_upload)
What version of Python are you using? It looks like there is a bug in Python 3.4 causing memory leaks related to network requests. See here for a similar issue: https://github.com/psf/requests/issues/5215
It may help to update Python.

python requests randomly breaks with JSONDecodeError

I have been debugging for hours why my code randomly breaks with this error: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
This is the code I have:
while True:
try:
submissions = requests.get('http://reymisterio.net/data-dump/api.php/submission?filter[]=form,cs,'+client+'&filter[]=date,cs,'+since).json()['submission']['records']
break
except requests.exceptions.ConnectionError:
time.sleep(100)
And I've been debugging by printing requests.get(url) and requests.get(url).text and I have encountered the following "special "cases:
requests.get(url) returns a successful 200 response and requests.get(url).text returns html. I have read online that this should fail when using requests.get(url).json(), because it won't be able to read the html, but somehow it doesn't break. Why is this?
requests.get(url) returns a successful 200 response and requests.get(url).text is in json format. I don't understand why when it goes to the requests.get(url).json() line it breaks with the JSONDecodeError?
The exact value of requests.get(url).text for case 2 is:
{
"submission": {
"columns": [
"pk",
"form",
"date",
"ip"
],
"records": [
[
"21197",
"mistico-form-contacto-form",
"2018-09-21 09:04:41",
"186.179.71.106"
]
]
}
}
Looking at the documentation for this API it seems the only responses are in JSON format, so receiving HTML is strange. To increase the likelihood of receiving a JSON response, you can set the 'Accept' header to 'application/json'.
I tried querying this API many times with parameters and did not encounter a JSONDecodeError. This error is likely the result of another error on the server side. To handle it, except a json.decoder.JSONDecodeError in addition to the ConnectionError error you currently except and handle this error in the same way as the ConnectionError.
Here is an example with all that in mind:
import requests, json, time, random
def get_submission_records(client, since, try_number=1):
url = 'http://reymisterio.net/data-dump/api.php/submission?filter[]=form,cs,'+client+'&filter[]=date,cs,'+since
headers = {'Accept': 'application/json'}
try:
response = requests.get(url, headers=headers).json()
except (requests.exceptions.ConnectionError, json.decoder.JSONDecodeError):
time.sleep(2**try_number + random.random()*0.01) #exponential backoff
return get_submission_records(client, since, try_number=try_number+1)
else:
return response['submission']['records']
I've also wrapped this logic in a recursive function, rather than using while loop because I think it is semantically clearer. This function also waits before trying again using exponential backoff (waiting twice as long after each failure).
Edit: For Python 2.7, the error from trying to parse bad json is a ValueError, not a JSONDecodeError
import requests, time, random
def get_submission_records(client, since, try_number=1):
url = 'http://reymisterio.net/data-dump/api.php/submission?filter[]=form,cs,'+client+'&filter[]=date,cs,'+since
headers = {'Accept': 'application/json'}
try:
response = requests.get(url, headers=headers).json()
except (requests.exceptions.ConnectionError, ValueError):
time.sleep(2**try_number + random.random()*0.01) #exponential backoff
return get_submission_records(client, since, try_number=try_number+1)
else:
return response['submission']['records']
so just change that except line to include a ValueError instead of json.decoder.JSONDecodeError.
Try this. it might work
while True:
try:
submissions = requests.get('http://reymisterio.net/data-dump/api.php/submission?filter[]=form,cs,'+client+'&filter[]=date,cs,'+since).json()['submission']['records']
sub = json.loads(submissions.text)
print(sub)
break
except requests.exceptions.ConnectionError:
time.sleep(100)

Urllib2 Python - Reconnecting and Splitting Response

I am moving to Python from other language and I am not sure how to properly tackle this. Using the urllib2 library it is quite easy to set up a proxy and get a data from a site:
import urllib2
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req)
the_page = response.read()
The problem I have is that the text file that is retrieved is very large (hundreds of MB) and the connection is often problematic. The code also need to catch connection, server and transfer errors (it will be a part of small extensively used pipeline).
Could anyone suggest how to modify the code above to make sure the code automatically reconnects n times (for example 100 times) and perhaps split the response into chunks so the data will be downloaded faster and more reliably?
I have already split the requests as much as I could so now have to make sure that the retrieve code is as good as it can be. Solutions based on core python libraries are ideal.
Perhaps the library is already doing the above in which case is there any way to improve downloading large files? I am using UNIX and need to deal with a proxy.
Thanks for your help.
I'm putting up an example of how you might want to do this with the python-requests library. The script below checks if the destinations file already exists. If the partially destination file exists, it's assumed to be the partially downloaded file, and tries to resume the download. If the server claims support for a HTTP Partial Request (i.e. the response to a HEAD request contains Accept-Range header), then the script resume based on the size of the partially downloaded file; otherwise it just does a regular download and discard the parts that are already downloaded. I think it should be fairly straight forward to convert this to use just urllib2 if you don't want to use python-requests, it'll probably be just much more verbose.
Note that resuming downloads may corrupt the file if the file on the server is modified between the initial download and the resume. This can be detected if the server supports strong HTTP ETag header so the downloader can check whether it's resuming the same file.
I make no claim that it is bug-free.
You should probably add a checksum logic around this script to detect download errors and retry from scratch if the checksum doesn't match.
import logging
import os
import re
import requests
CHUNK_SIZE = 5*1024 # 5KB
logging.basicConfig(level=logging.INFO)
def stream_download(input_iterator, output_stream):
for chunk in input_iterator:
output_stream.write(chunk)
def skip(input_iterator, output_stream, bytes_to_skip):
total_read = 0
while total_read <= bytes_to_skip:
chunk = next(input_iterator)
total_read += len(chunk)
output_stream.write(chunk[bytes_to_skip - total_read:])
assert total_read == output_stream.tell()
return input_iterator
def resume_with_range(url, output_stream):
dest_size = output_stream.tell()
headers = {'Range': 'bytes=%s-' % dest_size}
resp = requests.get(url, stream=True, headers=headers)
input_iterator = resp.iter_content(CHUNK_SIZE)
if resp.status_code != requests.codes.partial_content:
logging.warn('server does not agree to do partial request, skipping instead')
input_iterator = skip(input_iterator, output_stream, output_stream.tell())
return input_iterator
rng_unit, rng_start, rng_end, rng_size = re.match('(\w+) (\d+)-(\d+)/(\d+|\*)', resp.headers['Content-Range']).groups()
rng_start, rng_end, rng_size = map(int, [rng_start, rng_end, rng_size])
assert rng_start <= dest_size
if rng_start != dest_size:
logging.warn('server returned different Range than requested')
output_stream.seek(rng_start)
return input_iterator
def download(url, dest):
''' Download `url` to `dest`, resuming if `dest` already exists
If `dest` already exists it is assumed to be a partially
downloaded file for the url.
'''
output_stream = open(dest, 'ab+')
output_stream.seek(0, os.SEEK_END)
dest_size = output_stream.tell()
if dest_size == 0:
logging.info('STARTING download from %s to %s', url, dest)
resp = requests.get(url, stream=True)
input_iterator = resp.iter_content(CHUNK_SIZE)
stream_download(input_iterator, output_stream)
logging.info('FINISHED download from %s to %s', url, dest)
return
remote_headers = requests.head(url).headers
remote_size = int(remote_headers['Content-Length'])
if dest_size < remote_size:
logging.info('RESUMING download from %s to %s', url, dest)
support_range = 'bytes' in [s.strip() for s in remote_headers['Accept-Ranges'].split(',')]
if support_range:
logging.debug('server supports Range request')
logging.debug('downloading "Range: bytes=%s-"', dest_size)
input_iterator = resume_with_range(url, output_stream)
else:
logging.debug('skipping %s bytes', dest_size)
resp = requests.get(url, stream=True)
input_iterator = resp.iter_content(CHUNK_SIZE)
input_iterator = skip(input_iterator, output_stream, bytes_to_skip=dest_size)
stream_download(input_iterator, output_stream)
logging.info('FINISHED download from %s to %s', url, dest)
return
logging.debug('NOTHING TO DO')
return
def main():
TEST_URL = 'http://mirror.internode.on.net/pub/test/1meg.test'
DEST = TEST_URL.split('/')[-1]
download(TEST_URL, DEST)
main()
You can try something like this. It reads the file line by line and appends it to a file. It also checks to make sure that you don't go over the same line. I'll write another script that does it by chunks as well.
import urllib2
file_checker = None
print("Please Wait...")
while True:
try:
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req, timeout=20)
print("Connected")
with open("outfile.html", 'w+') as out_data:
for data in response.readlines():
file_checker = open("outfile.html")
if data not in file_checker.readlines():
out_data.write(str(data))
break
except urllib2.URLError:
print("Connection Error!")
print("Connecting again...please wait")
file_checker.close()
print("done")
Here's how to read the data in chunks instead of by lines
import urllib2
CHUNK = 16 * 1024
file_checker = None
print("Please Wait...")
while True:
try:
req = urllib2.Request('http://www.voidspace.org.uk')
response = urllib2.urlopen(req, timeout=1)
print("Connected")
with open("outdata", 'wb+') as out_data:
while True:
chunk = response.read(CHUNK)
file_checker = open("outdata")
if chunk and chunk not in file_checker.readlines():
out_data.write(chunk)
else:
break
break
except urllib2.URLError:
print("Connection Error!")
print("Connecting again...please wait")
file_checker.close()
print("done")

Downloading hundreds of files using `request` stalls in the middle.

I have the problem, that my code to download files from urls using requests stalls for no apparent reason. When I start the script it will download several hundred files, but then it just stops somewhere. If I try the url manually in the browser, the image loads w/o problem. I also tried with urllib.retrieve, but had the same problem. I use Python 2.7.5 on OSX.
Following you find
the code I use,
the stacktrace (dtruss), while the program is stalling and
the traceback, that is printed, when I ctrl-c the process after nothing happend for 10mins and
Code:
def download_from_url(url, download_path):
with open(download_path, 'wb') as handle:
response = requests.get(url, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
def download_photos_from_urls(urls, concept):
ensure_path_exists(concept)
bad_results = list()
for i, url in enumerate(urls):
print i, url,
download_path = concept+'/'+url.split('/')[-1]
try:
download_from_url(url, download_path)
print
except IOError as e:
print str(e)
return bad_result
stacktrace:
My-desk:~ Me$ sudo dtruss -p 708
SYSCALL(args) = return
Traceback:
318 http://farm1.static.flickr.com/32/47394454_10e6d7fd6d.jpg
Traceback (most recent call last):
File "slow_download.py", line 71, in <module>
if final_path == '':
File "slow_download.py", line 34, in download_photos_from_urls
download_path = concept+'/'+url.split('/')[-1]
File "slow_download.py", line 21, in download_from_url
with open(download_path, 'wb') as handle:
File "/Library/Python/2.7/site-packages/requests/models.py", line 638, in generate
for chunk in self.raw.stream(chunk_size, decode_content=True):
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 256, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/Library/Python/2.7/site-packages/requests/packages/urllib3/response.py", line 186, in read
data = self._fp.read(amt)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/httplib.py", line 567, in read
s = self.fp.read(amt)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
KeyboardInterrupt
So, just to unify all the comments, and propose a potential solution: There are a couple of reasons why your downloads are failing after a few hundred - it may be internal to Python, such as hitting the maximum number of open file handles, or it may be an issue with the server blocking you for being a robot.
You didn't share all of your code, so it's a bit difficult to say, but at least with what you've shown you're using the with context manager when opening the files to write to, so you shouldn't run into problems there. There's the possibility that the request objects are not getting closed properly after exiting the loop, but I'll show you how to deal with that below.
The default requests User-Agent is (on my machine):
python-requests/2.4.1 CPython/3.4.1 Windows/8
so it's not too inconceivable to imagine the server(s) you're requesting from are screening for various UAs like this and limiting their number of connections. The reason you were able to also get the code to work with urllib.retrieve was that its UA is different than requests', so the server allowed it to continue for approximately the same number of requests, then shut it down, too.
To get around these issues, I suggest altering your download_from_url() function to something like this:
import requests
from time import sleep
def download_from_url(url, download_path, delay=5):
headers = {'Accept-Encoding': 'identity, deflate, compress, gzip',
'Accept': '*/*',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:28.0) Gecko/20100101 Firefox/28.0'}
with open(download_path, 'wb') as handle:
response = requests.get(url, headers=headers) # no stream=True, that could be an issue
handle.write(response.content)
response.close()
sleep(delay)
Instead of using stream=True, we use the default value of False to immediately download the full content of the request. The headers dict contains a few default values, as well as the all-important 'User-Agent' value, which in this example happens to be my UA, determined by using What'sMyUserAgent. Feel free to change this to the one returned by your preferred browser. Instead of messing around with iterating through the content by 1KB blocks, here I just write the entire content to disk at once, eliminating extraneous code and some potential sources for errors - for example, if there was a hiccup in your network connectivity, you could temporarily have empty blocks, and break out out in error. I also explicitly close the request, just in case. Finally, I added an extra parameter to your function, delay, to make the function sleep for a certain number of seconds before returning. I gave it a default value of 5, you can make it whatever you want (it also accepts floats for fractional seconds).
I don't happen to have a large list of image URLs lying around to test this, but it should work as expected. Good luck!
Perhaps the lack of pooling might cause too many connections. Try something like this (using a session):
import requests
session = requests.Session()
def download_from_url(url, download_path):
with open(download_path, 'wb') as handle:
response = session.get(url, stream=True)
for block in response.iter_content(1024):
if not block:
break
handle.write(block)
def download_photos_from_urls(urls, concept):
ensure_path_exists(concept)
bad_results = list()
for i, url in enumerate(urls):
print i, url,
download_path = concept+'/'+url.split('/')[-1]
try:
download_from_url(url, download_path)
print
except IOError as e:
print str(e)
return bad_result

How can I read exactly one response chunk with python's http.client?

Using http.client in Python 3.3+ (or any other builtin python HTTP client library), how can I read a chunked HTTP response exactly one HTTP chunk at a time?
I'm extending an existing test fixture (written in python using http.client) for a server which writes its response using HTTP's chunked transfer encoding. For the sake of simplicity, let's say that I'd like to be able to print a message whenever an HTTP chunk is received by the client.
My code follows a fairly standard pattern for reading a large response:
conn = http.client.HTTPConnection(...)
conn.request(...)
response = conn.getresponse()
resbody = []
while True:
chunk = response.read(1024)
if len(chunk):
resbody.append(chunk)
else:
break
conn.close();
But this reads 1024 byte chunks regardless of whether or not the server is sending 10 byte chunks or 10MiB chunks.
What I'm looking for would be something like the following:
while True:
chunk = response.readchunk()
if len(chunk):
resbody.append(chunk)
else
break
If this is not possible with http.client, is it possible with another builtin http client library? If it's not possible with a builtin client lib, is it possible with pip installable module?
I found it easier to use the requests library like so
r = requests.post(url, data=foo, headers=bar, stream=True)
for chunk in (r.raw.read_chunked()):
print(chunk)
Update:
The benefit of chunked transfer encoding is to allow the transmission of dynamically generated content. Whether a HTTP library lets you read individual chunks or not is a separate issue (see RFC 2616 - Section 3.6.1).
I can see how what you are trying to do would be useful, but the standard python http client libraries don't do what you want without some hackery (see http.client and httplib).
What you are trying to do may be fine for use in your test fixture, but in the wild there are no guarantees. It is possible for the chunking of the data read by your client to be be different from the chunking of the data sent by your server. E.g. the data could have been "re-chunked" by a proxy server before it arrived (see RFC 2616 - Section 3.2 - Framing Techniques).
The trick is to tell the response object that it isn't chunked (resp.chunked = False) so that it returns the raw bytes. This allows you to parse the size and data of each chunk as it is returned.
import http.client
conn = http.client.HTTPConnection("localhost")
conn.request('GET', "/")
resp = conn.getresponse()
resp.chunked = False
def get_chunk_size():
size_str = resp.read(2)
while size_str[-2:] != b"\r\n":
size_str += resp.read(1)
return int(size_str[:-2], 16)
def get_chunk_data(chunk_size):
data = resp.read(chunk_size)
resp.read(2)
return data
respbody = ""
while True:
chunk_size = get_chunk_size()
if (chunk_size == 0):
break
else:
chunk_data = get_chunk_data(chunk_size)
print("Chunk Received: " + chunk_data.decode())
respbody += chunk_data.decode()
conn.close()
print(respbody)

Categories

Resources