Read lines of file over HTTP on demand - python

What I need to do is to read a file over HTTP in chunks (iterate over lines to be specific). I want to not read the entire file (or a large part of it) and then split it into lines, but rather read a small (<=8kB) chunk and then split this into lines. When all the lines in chunk are consumed, then receive the next chunk.
I have tried the following:
with urllib.request.urlopen(url) as f:
yield from f
Which didn't work. In Wireshark I see that about 140kB of total ~220kB are received just by calling urlopen(url).
The next thing I tried was to use requests:
with requests.get(url, stream=True) as req:
yield from req.iter_lines()
Which also reads about 140kB just by calling get(url, stream=True). According to the documentation this should not happen. Other than that, I did not find any information about this behavior or how to control it. I'm using Requests 2.21.0, CPython 3.7.3, on Windows 10.

According to the docs and docs 2 (and given that the source is actually working in chunks) I think you should use iter_content, which accepts the chunk_size parameter which you have to set to None:
with requests.get(url, stream=True) as req:
yield from req.iter_content(chunk_size=None)
I haven't tried, but is seems that somewhere in you code something accesses req.content before iter_lines, therefore loading the entire payload.
edit_ added example

Related

How to print "dots" (or other kind of feedback) while writing a file in python?

I am trying to print a visible feedback for the user in the terminal while my aplication donwloads a file from the web and write it into the hard drive, but I could not find how to do this reading the documentation or googling it.
This is my code:
res = requests.get(url_to_file)
with open("./downloads/%s" % (file_name), 'wb') as f:
f.write(res.content)
I was expecting to figure out how to make something like this:
Downloading file ........
# it keeps going ultil the download is finished and the file writen
Done!
I am realy strugling even to start, because none of the methods returns a "promise" (like in JS).
Any help would be very apreciated!
Thanks!
requests.get by default downloads the entirety of the requested resource before it gets back to you. However, it has an optional argument stream, which allows you to invoke .iter_content or .iter_lines on the Response object. This allows you to take action every N bytes (or as each chunk of data arrives), or at every line, respectively. Something like this:
chunks = []
chunk_size = 16384 # 16Kb chunks
# alternately
# chunk_size = None # whenever a chunk arrives
res = requests.get(url_to_file, stream=True)
for chunk in res.iter_content(chunk_size):
chunks.append(chunk)
print(".", end="")
data = b''.join(chunks)
This still blocks though, so nothing else will be happening. If you want more of the JavaScript style, per Grismar's comment, you should run under Python's async loop. In that case, I suggest using aiohttp rather than requests, as it is created with async style in mind.
Here's a version that will download the file into a bytearray in a separate thread.
As mentioned in other answers and comments, there are other alternativs that are developed with async operations in mind, so don't read too much into the decision to go with threading, it's just to demonstrate the concept (and because of convenience, since it comes with python).
In the below code, if the size of the file is known, each . will correspond to 1%. As a bonus, the downloaded and total number of bytes will be printed at the start of the line like (1234 B / 1234567 B). If size is not known, the fallback solution is to have each . represent a chunk.
import requests
import threading
def download_file(url: str):
headers = {"<some_key>": "<some_value>"}
data = bytearray()
with requests.get(url, headers=headers, stream=True) as request:
if file_size := request.headers.get("Content-Length"):
file_size = int(file_size)
else:
file_size = None
received = 0
for chunk in request.iter_content(chunk_size=2**15):
received += len(chunk)
data += chunk
try:
num_dots = int(received * 100 / file_size)
print(
f"({received} B/{file_size} B) "
+ "." * num_dots, end="\r"
)
except TypeError:
print(".", end="")
print("\nDone!")
url = "<some_url>"
thread = threading.Thread(target=download_file, args=(url,))
thread.start()
# Do something in the meantime
thread.join()
Do keep in mind that I've left out the lock to protect against simultaneous access to stdout to reduce the noise. I've also left out writing the bytarray to file at the end (or writing the chunks to file as they are received if the file is large), but keep in mind that you may want to use a lock for that as well if you read and/or write to the same file in any other part of your script.

Python - read huge online csv through proxy

I have a huuuuuge csv online and I wan't to read it line by line whitout download it. But this file is behind a proxy.
I wrote this code :
import requests
import pandas as pd
import io
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
content = requests.get(url, proxies=proxies, auth=auth, verify=cafile).content
csv_read = pd.read_csv(io.StringIO(content.decode('utf-8')))
pattern = 'mypattern'
for row in csv_read:
if row[0] == pattern:
print(row)
break
This code above works but the line 'content = requests.get(...' takes soooo much time ! Because of the size of the csv file.
So my question is :
Is it possible to read an online csv line by line through proxy ?
In the best way, I wish to read the first row, check if it equals to my pattern, if yes = break, if not = read the second line ans so on.
Thank's for your help
You can pass stream=True to requests.get to avoid fetching the entire result immediately. In that case you can access a pseudo-file object through response.raw, you can build your CSV reader based on that (alternatively, the response object has iter_content and iter_lines methods but I don't know how easy it is to feed that to a CSV parser).
However while the stdlib's csv module simply yields a sequence of lists or dicts and can therefore easily be lazy, pandas returns a dataframe which are not lazy, so you need to specify some special parameters then you get a dataframe per chunk or something it looks like.
The requests.get call will get you the whole file anyway. You'd need to implement your own HTTP code, down to the socket level, to be able to process the content as it gets in, in a plain HTTP Get method.
The only way of getting partial results and slice the download is to add HTTP "range" request headers, if the server providing the file support then. (requests can let you set these headers).
enter requests advanced usage:
The good news is that requests can do that for you under the hood -
you can set stream=True parameter when calling requests, and it even will let you iterate the contents line by line. Check the documentation on that part.
Here is more or less what requests does under the hood so that you can get your contents line by line:
It will get reasobale sized chunks of your data, - but certainly not equest one line at a time (think ~80 bytes versus 100.000 bytes), because otherwise it'd need a new HTTP request for each line,and the overhead for each request is not trivial, even if made over the same TCP connection.
Anyway, as CSV being a text format, neither requests nor any other software could know the size of the lines, and even less the exact size of the "next" line to be read - before setting the range headers accordingly.
So, for this to work, ther have to have to be Python code to:
accept a request for a "new line" of the CSV if there are buffered
text lines, yield the next line,
otherwise make an HTTP request for
the next 100KB or so
Concatenate the downloaded data to the
remainder of the last downloaded line
split the downloaded data at
the last line-feed in the binary data,
save the remainder of the
last line
convert your binary buffer to text, (you'd have to take
care of multi-byte character boundaries in a multi-byte encoding
(like utf-8) - but cutting at newlines may save you that)
yield the
next text line
According to Masklinn's answer, my code looks like this now :
import requests
cafile = 'mycert.crt'
proxies = {"http":"http://ipproxy:port", "https":"http://ipproxy:port"}
auth = HttpNtlmAuth('Username','Password')
url = 'http://myurl/ressources.csv'
pattern = 'mypattern'
r = requests.get(url, stream=True, proxies=proxies, verify=cafile)
if r.encoding is None:
r.encoding = 'ISO-8859-1'
for line in r.iter_lines(decode_unicode=True):
if line.split(';')[0] == pattern:
print(line)
break

Checking my assumption of how generators work in python 3

I'm writing a blog post about generators in the context of screenscaping, or making lots of requests to an API, based on the contents of a large-ish text file, and after reading this nifty comic by Julia Evans, I want to check something.
Assume I'm on linux or OS X.
Let's say I'm making a screenscraper with scrapy (it's not so important to know scrapy this qn, but it might be useful context)
If I have an open file like so, and I went to be able to return a scrapy.Request for every line I pull out of a largeish csv file.
with open('top-50.csv') as csvfile:
urls = gen_urls(csvfile)
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
gen_urls is a function that looks like this.
def gen_urls(file_object):
while True:
# Read a line from the file, by seeking til you hit something like '\n'
line = file_object.readline()
# Drop out if there are no lines left to iterate through
if not line:
break
# turn '1,google.com\n' to just 'google.com'
domain = line.split(',')[1]
trimmed_domain = domain.rstrip()
yield "http://domain/api/{}".format(trimmed_domain)
This works, but I want to understand what's happening under the hood.
When I pass the csvfile to the gen_urls() like so:
urls = gen_urls(csvfile)
In gen_urls my understanding is that it works by pulling out a line at a time in the while loop with file_object.readline(), then yielding with yield "http://domain/api/{}".format(trimmed_domain).
Under the hood, I think is a reference to some file descriptor, and readline() is essentially seeking forwards through the file, until it finds the next newline \n character, and the yield basically pauses this function until the next call to __.next__() or the builtin next(), at which point it resumes the loop. This next is called implicitly in the loop in the snippet that looks like:
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
Because we're only pulling a line at the time from the file descriptor then 'pausing' the function with yield , we don't end up with loads of stuff in memory. Because scrapy uses an evented model, you can make a bunch of scrapy.Request objects without them all immediately sending off a bajillion HTTP requests and saturating your network. This way, scrapy is also able to do useful things like throttle how quickly they're sent, how many are sent concurrently, and so on.
This about right?
I'm mainly looking for a mental model that helps me think about using generators in python and explain them to other people, rather than all the gory details, as I've been using them for ages, without thinking through what's happening, and I figured asking here might shed some light.

Python - split files

I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?
----Adding current code----
if not os.path.exists(output_path):
os.makedirs(output_path)
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
outFile = open('output/tempfile', 'wb')
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
outFile.write(chunk)
f = open('output/tempfile', 'rb').read().split('\r\n\r\n')
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')
Okay, I was bored and wanted to figure out the best way to do this. Turns out that my initial way in the comments above was overly complicated (unless considering some scenario where time is absolutely critical, or memory is severely constrained). A buffer is a much simpler way to achieve this, so long as you take two or more blocks at a time. This code emulates the questions scenario for demonstration.
Note: depending on the regex engine implementation, this is more efficient and requires significantly less str/byte conversions, as using regex requires casting each block of bytes to string. The approach below requires no string conversions, instead operating solely on the bytes returned from request.post(), and in turn writing those same bytes to file, without conversions.
from pprint import pprint
someString = '''I currently have a script that requests a file via a requests.post(). The server sends me two files in the same stream. The way I am processing this right now is to save it all as one file, open it again, split the file based on a regex string, save it as a new file, and delete the old one. The file is large enough that I have to stream=True in my requests.post() statement and write it in chunks.
I was hoping that maybe someone knows a better way to issue the post or work with the data coming back so that the files are stored correctly the first time? Or is this the best way to do it?'''
n = 16
# emulate a stream by creating 37 blocks of 16 bytes
byteBlocks = [bytearray(someString[i:i+n]) for i in range(0, len(someString), n)]
pprint(byteBlocks)
# this string is present twice, but both times it is split across two bytearrays
matchBytes = bytearray('requests.post()')
# our buffer
buff = bytearray()
count = 0
for bb in byteBlocks:
buff += bb
count += 1
# every two blocks
if (count % 2) == 0:
if count == 2:
start = 0
else:
start = len(matchBytes)
# check the bytes starting from block (n -2 -len(matchBytes)) to (len(buff) -len(matchBytes))
# this will check all the bytes only once...
if matchBytes in buff[ ((count-2)*n)-start : len(buff)-len(matchBytes) ]:
print('Match starting at index:', buff.index(matchBytes), 'ending at:', buff.index(matchBytes)+len(matchBytes))
Update:
So, given the updated question, this code may remove the need to create a temporary file. I haven't been able to test it exactly, as I don't have a similar response, but you should be able to figure out any bugs yourself.
Since you aren't actually working with a stream directly, i.e. you're given the finished response object from requests.post(), then you don't have to worry about using chunks in the networking sense. The "chunks" that requests refers to is really it's way of dishing out the bytes, of which it already has all of. You can access the bytes directly using r.raw.read(n) but as far as I can tell, the request object doesn't allow you to see how many bytes there are in "r.raw", thus you're more or less forced to use the "iter_content" method.
Anyway, this code should copy all the bytes from the request object into a string, then you can search and split that string as before.
memFile = requests.post(url, data=etree.tostring(etXML), headers=headers, stream=True)
match = '\r\n\r\n'
data = ''
for chunk in memFile.iter_content(chunk_size=512):
if chunk:
data += chunk
f = data.split(match)
arf = open('output/recording.arf', 'wb')
arf.write(f[3])
os.remove('output/tempfile')

Too many open files using requests module

I am using requests module to POST several files to a server, this works fine most of the time. However, when many files are uploaded >256 I get an IOError: [Errno 24] Too many open files. The problem happens because I build a dictionary with many files which are opened as shown in the code below. Since I do not have a handle to close these open files we see this error. This leads to following questions
Is there a way to close these files in chunks?
Does requests module close the open files automatically?
url = 'http://httpbin.org/post'
#dict with several files > 256
files = {'file1': open('report.xls', 'rb'), 'file2': open('report2.xls', 'rb')}
r = requests.post(url, files=files)
r.text
The workaround I am using right now is files.clear() after uploading < 256 files at a time. I am unsure whether the files get closed doing so, but the error goes away.
Please provide insight on how to handle this situation. Thanks
The simplest solution here is to read the files into memory yourself, then pass them to requests. Note that, as the docs say, "If you want, you can send strings to be received as files". So, do that.
In other words, instead of building a dict like this:
files = {tag: open(pathname, 'rb') for (tag, pathname) in stuff_to_send}
… build it like this:
def read_file(pathname):
with open(pathname, 'rb') as f:
return f.read()
files = {tag: read_file(pathname) for (tag, pathname) in stuff_to_send}
Now you've only got one file open at a time, guaranteed.
This may seem wasteful, but it really isn't—requests is just going to read in all of the data from all of your files if you don't.*
But meanwhile, let me answer your actual questions instead of just telling you what to do instead.
Since I do not have a handle to close these open files we see this error.
Sure you do. You have a dict, whose values are these open files.
In fact, if you didn't have a handle to them, this problem would probably occur much less often, because the garbage collector would (usually, but not necessarily robustly/reliably enough to count on) take care of things for you. The fact that it's never doing so implies that you must have a handle to them.
Is there a way to close these files in chunks?
Sure. I don't know how you're doing the chunks, but presumably each chunk is a list of keys or something, and you're passing files = {key: files[key] for key in chunk}, right?
So, after the request, do this:
for key in chunk:
files[key].close()
Or, if you're building a dict for each chunk like this:
files = {tag: open(filename, 'rb') for (tag, filename) in chunk}
… just do this:
for file in files.values():
file.close()
Does requests module close the open files automatically?
No. You have to do it manually.
In many use cases, you get away with never doing so because the files variable goes away soon after the request, and once nobody has a reference to the dict, it gets cleaned up soon (immediately, with CPython and if there are no cycles; just "soon" if either of those is not true), meaning all the files get cleaned up soon, at which point the destructor closes them for you. But you shouldn't rely on that. Always close your files explicitly.
And the reason the files.clear() seems to work is that it's doing the same thing as letting files go away: it's forcing the dict to forget all the files, which removes the last reference to each of them, meaning they will get cleaned up soon, etc.
* What if you don't have enough page space to hold them all in memory? Then you can't send them all at once anyway. You'll have to make separate requests, or use the streaming API—which I believe means you have to do the multiparting manually as well. But if you have enough page space, just not enough real RAM, so trying to read them all sends you into swap thrashing hell, you might be able to get around it by concatenating them all on-disk, opening the giant file, mmapping segments of it, and sending those as the strings…
Don't forget about the power of python duck typing!
Just implement a wrapper class for your files:
class LazyFile(object):
def __init__(self, filename, mode):
self.filename = filename
self.mode = mode
def read(self):
with open(self.filename, self.mode) as f:
return f.read()
url = 'http://httpbin.org/post'
#dict with a billion files
files = {'file1': LazyFile('report.xls', 'rb'), 'file2': LazyFile('report2.xls', 'rb')}
r = requests.post(url, files=files)
r.text
In this way, each file is opened read and closed one at a time as requests iterates over the dict.
Note that while this answer and abarnert's answer basically do the same thing right now, requests may, in the future, not build the request entirely in memory and then send it, but send each file in a stream, keeping memory usage low. At that point this code would be more memory efficent.

Categories

Resources