Too many open files using requests module - python

I am using requests module to POST several files to a server, this works fine most of the time. However, when many files are uploaded >256 I get an IOError: [Errno 24] Too many open files. The problem happens because I build a dictionary with many files which are opened as shown in the code below. Since I do not have a handle to close these open files we see this error. This leads to following questions
Is there a way to close these files in chunks?
Does requests module close the open files automatically?
url = 'http://httpbin.org/post'
#dict with several files > 256
files = {'file1': open('report.xls', 'rb'), 'file2': open('report2.xls', 'rb')}
r = requests.post(url, files=files)
r.text
The workaround I am using right now is files.clear() after uploading < 256 files at a time. I am unsure whether the files get closed doing so, but the error goes away.
Please provide insight on how to handle this situation. Thanks

The simplest solution here is to read the files into memory yourself, then pass them to requests. Note that, as the docs say, "If you want, you can send strings to be received as files". So, do that.
In other words, instead of building a dict like this:
files = {tag: open(pathname, 'rb') for (tag, pathname) in stuff_to_send}
… build it like this:
def read_file(pathname):
with open(pathname, 'rb') as f:
return f.read()
files = {tag: read_file(pathname) for (tag, pathname) in stuff_to_send}
Now you've only got one file open at a time, guaranteed.
This may seem wasteful, but it really isn't—requests is just going to read in all of the data from all of your files if you don't.*
But meanwhile, let me answer your actual questions instead of just telling you what to do instead.
Since I do not have a handle to close these open files we see this error.
Sure you do. You have a dict, whose values are these open files.
In fact, if you didn't have a handle to them, this problem would probably occur much less often, because the garbage collector would (usually, but not necessarily robustly/reliably enough to count on) take care of things for you. The fact that it's never doing so implies that you must have a handle to them.
Is there a way to close these files in chunks?
Sure. I don't know how you're doing the chunks, but presumably each chunk is a list of keys or something, and you're passing files = {key: files[key] for key in chunk}, right?
So, after the request, do this:
for key in chunk:
files[key].close()
Or, if you're building a dict for each chunk like this:
files = {tag: open(filename, 'rb') for (tag, filename) in chunk}
… just do this:
for file in files.values():
file.close()
Does requests module close the open files automatically?
No. You have to do it manually.
In many use cases, you get away with never doing so because the files variable goes away soon after the request, and once nobody has a reference to the dict, it gets cleaned up soon (immediately, with CPython and if there are no cycles; just "soon" if either of those is not true), meaning all the files get cleaned up soon, at which point the destructor closes them for you. But you shouldn't rely on that. Always close your files explicitly.
And the reason the files.clear() seems to work is that it's doing the same thing as letting files go away: it's forcing the dict to forget all the files, which removes the last reference to each of them, meaning they will get cleaned up soon, etc.
* What if you don't have enough page space to hold them all in memory? Then you can't send them all at once anyway. You'll have to make separate requests, or use the streaming API—which I believe means you have to do the multiparting manually as well. But if you have enough page space, just not enough real RAM, so trying to read them all sends you into swap thrashing hell, you might be able to get around it by concatenating them all on-disk, opening the giant file, mmapping segments of it, and sending those as the strings…

Don't forget about the power of python duck typing!
Just implement a wrapper class for your files:
class LazyFile(object):
def __init__(self, filename, mode):
self.filename = filename
self.mode = mode
def read(self):
with open(self.filename, self.mode) as f:
return f.read()
url = 'http://httpbin.org/post'
#dict with a billion files
files = {'file1': LazyFile('report.xls', 'rb'), 'file2': LazyFile('report2.xls', 'rb')}
r = requests.post(url, files=files)
r.text
In this way, each file is opened read and closed one at a time as requests iterates over the dict.
Note that while this answer and abarnert's answer basically do the same thing right now, requests may, in the future, not build the request entirely in memory and then send it, but send each file in a stream, keeping memory usage low. At that point this code would be more memory efficent.

Related

How to avoid file corruption in Python?

I have a pretty basic question, perhaps I didn't know the right keywords as I couldn't find a previous answer. I use Python scripts control and gather information for a smarthome environment. I mostly use text files to store and update information within and between the scripts. However, I frequently run into this one issue whenever the server crashes or loses power: The file contents tend to corrupt or vanish while the crash happens.
To write file content, I usually use a structure like this:
try:
with open(savefile, "r") as file:
lines = file.readlines()
except:
lines = []
pass
lines.append(str(time.time()) + ";" + str(value) + "\n")
if len(lines) > MAX_READINGS:
lines = lines[-MAX_READINGS:]
with open(savefile, "w") as file:
file.writelines(lines)
In case of partial corruption such as blank lines between the data points, I often use a line-by-line loop that only qualifies lines with the correct structure (such as a timestamp in the beginning in the example-like cases). However, sometimes a file gets corrupted to the point it only contains spaces or is empty, getting useless for the scrips depending on the data.
The filesystem's integrity remains intact in crashes, so it's probably not a lower level problem. But what's the suggested workaround to minimize the corruption risk?
Should I use the "a" mode to append a new line and have another way to deal with the file lengths (the MAX_READINGS), or should I make a temporary copy which I'd then use to overwrite the original after the writing is done? Or might there be an external library providing the right functionality?

Approaches to loading JSON from file to dict using Python?

Using Python I'm loading JSON from a text file and converting it into a dictionary. I thought of two approaches and wanted to know which would be better.
Originally I open the text file, load the JSON, and then close text file.
import json
// Open file. Load as JSON.
data_file = open(file="fighter_data.txt", mode="r")
fighter_match_data = json.load(data_file)
data_file.close()
Could I instead do the following instead?
import json
// Open file. Load as JSON.
fighter_match_data = json.load(open(file="fighter_data.txt", mode="r"))
Would I still need to close the file? If so, how? If not, does Python close the file automatically?
Personally wouldn't do either. Best practice for opening files generally is to use with.
with open(file="fighter_data.txt", mode="r") as data_file:
fighter_match_data = json.load(data_file)
That way it automatically closes when you're out of the with statement. It's shorter than the first, and if it throws an error (say, there's an error parsing the json), it'll still close it.
Regarding your actual question, on needing to close the file in your one liner.
From what I understand about file handling and garbage collection, if you're using CPython, since the file isn't referenced anymore it "should" be closed straight away by the garbage collector. However, relying on garbage collection to do your work for you is never the nicest way of writing code. (See the answers to open read and close a file in 1 line of code for information as to why).
Your code as under is valid:
fighter_match_data = json.load(open(file="fighter_data.txt", mode="r"))
Consider this part:
open(file="fighter_data.txt", mode="r") . #1
v/s
data_file = open(file="fighter_data.txt", mode="r") . #2
In case of #2, in case you do not explicitly close the file, the file will automatically be closed when the variable ceases to exist[In better words, no reference exists to that variable] (when you move out of the function).
In case of #1, since you never create a variable, the lifespan of that implicit variable created for opening that file ceases to exist on that line itself. And python automatically closes the file after opening it.

File Reading Options Enquiry (Python)

I am a programming student for the semester. In class we have been learning about file opening, reading and writing.
We have used a_reader to achieve such tasks for file opening. I have been reading our associated text/s and I have noticed that there is a CSV reader option which I have been using.
I wanted to know if there were anymore possible ways to open/read a file as I am trying to grow my knowledge base in python and its associated contents.
EDIT:
I was referring to CSV more specifically as that is the type of files we use at the moment. We have learnt about CSV Reader and a_reader and an example from one of our lectures is shown below.
def main():
a_reader = open('IDCJAC0016_009225_1800_Data.csv', 'rU')
file_data = a_reader.read()
a_reader.close()
print file_data
main()
It may seem overly broad but I have no knowledge which is why I am asking is there more than just the 2 ways above. If there is can someone who knows provide the types so I can read up on and research on them.
If you're asking about places to store things, the first interfaces you'll meet are files and sockets (pretend a network connection is like a file, see http://docs.python.org/2/library/socket.html).
If you mean file formats (like csv), there are many! Probably you can think of many yourself, but besides csv there are html files, pictures (png, jpg, gif), archive formats (tar, zip), text files (.txt!), python files (.py). The list goes on.
There are many ways to read files in different ways.
Just plain open will take a filename and open it as a sequence of lines. Or, you can just call read() on it, and it will read the whole file at once into one giant string.
codecs.open will take a filename and a character set, and decode each line to Unicode automatically. Or, again, you can just call read() on it, and it will read and decode the whole file at once into one giant Unicode string.
csv.reader will take a file or file-like object, and read it as a sequence of CSV rows. There's no direct equivalent of read()—but you can turn any sequence into a list by just calling list on it, so list(my_reader) will give you a list of rows (each of which is, itself, a list).
zipfile.ZipFile will take a filename, or a file or file-like object, and read it as a ZIP archive. This doesn't go line by line, of course, but you can go archived file by archived file. Or you can do fancier things, like search for archived files by name.
There are modules for reading JSON and XML documents, different ways of handling binary files, and so on. Some of them work differently—for example, you can search an XML document as a tree with one module, or go element by element with a different one.
Python has a pretty extensive standard library, and you can find the documentation online. Every module that seems like it should be able to work on files, probably can.
And, beyond what comes in the standard library, PyPI, the Python Package Index has thousands of additional modules. Looking for a way to read YAML documents? Search PyPI for yaml and you'll find it.
Finally, Python makes it very easy to add things like this on your own. The skeleton of a function like csv.reader is as simple as this:
def reader(fileobj):
for line in fileobj:
yield parse_one_csv_line(line)
You can replace that parse_one_csv_line with anything you want, and you've got a custom reader. For example, here's an uppercase_reader:
def uppercase_reader(fileobj):
for line in fileobj:
yield line.upper()
In fact, you can even write the whole thing in one line:
shouts = (line.upper() for line in fileobj)
And the best thing is that, as long as your reader only yields one line at a time, your reader is itself a file-like object, so you can pass uppercase_reader(fileobj) to csv.reader and it works just fine.

How to store many file handles so that we can close it at a later stage

I have written a small program in python where I need to open many files and close it at a later stage, I have stored all the file handles in a list so that I can refer to it later for closing.
In my program I am storing all the file handles (fout) in the list foutList[]
for cnt in range(count):
fileName = "file" + `cnt` + ".txt"
fullFileName = path + fileName
print "opening file " + fullFileName
try:
fout = open(fullFileName,"r")
foutList.append(fout)
except IOError as e:
print "Cannot open file: %s" % e.strerror
break
Some people suggested me that do no store it in a List, but did not give me the reason why. Can anyone explain why it is not recommended to store it in a List and what is the other possible way to do the same ?
I can't think of any reasons why this is really evil, but possible objections to doing this might include:
It's hard to guarantee that every single file handle will be closed when you're done. Using the file handle with a context manager (see the with open(filename) as file_handle: syntax) always guarantees the file handle is closed, even if something goes wrong.
Keeping lots of files open at the same time may be impolite if you're going to have them open for a long time, and another program is trying to access the files.
This said - why do you want to keep a whole bunch of files open for writing? If you're writing intermittently to a bunch of files, a better way to do this is to open the file, write to it, and then close it until you're ready to write again.
All you have to do is open the file in append mode - open(filename,'a'). This lets you write to the end of an existing file without erasing what's already there (like the 'w' mode.)
Edit(1) I slightly misread your question - I thought you wanted to open these files for writing, not reading. Keeping a bunch of files open for reading isn't too bad.
If you have the files open because you want to monitor the files for changes, try using your platform's equivalent of Linux's inotify, which will tell you when a file has changed (without you having to look at it repeatedly.)
If you don't store them at all, they will eventually be garbage collected, which will close them.
If you really want to close them manually, use weak references to hold them, which will not prevent garbage collection: http://docs.python.org/library/weakref.html

Decompressing a .bz2 file in Python

So, this is a seemingly simple question, but I'm apparently very very dull. I have a little script that downloads all the .bz2 files from a webpage, but for some reason the decompressing of that file is giving me a MAJOR headache.
I'm quite a Python newbie, so the answer is probably quite obvious, please help me.
In this bit of the script, I already have the file, and I just want to read it out to a variable, then decompress that? Is that right? I've tried all sorts of way to do this, I usually get "ValueError: couldn't find end of stream" error on the last line in this snippet. I've tried to open up the zipfile and write it out to a string in a zillion different ways. This is the latest.
openZip = open(zipFile, "r")
s = ''
while True:
newLine = openZip.readline()
if(len(newLine)==0):
break
s+=newLine
print s
uncompressedData = bz2.decompress(s)
Hi Alex, I should've listed all the other methods I've tried, as I've tried the read() way.
METHOD A:
print 'decompressing ' + filename
fileHandle = open(zipFile)
uncompressedData = ''
while True:
s = fileHandle.read(1024)
if not s:
break
print('RAW "%s"', s)
uncompressedData += bz2.decompress(s)
uncompressedData += bz2.flush()
newFile = open(steamTF2mapdir + filename.split(".bz2")[0],"w")
newFile.write(uncompressedData)
newFile.close()
I get the error:
uncompressedData += bz2.decompress(s)
ValueError: couldn't find end of stream
METHOD B
zipFile = steamTF2mapdir + filename
print 'decompressing ' + filename
fileHandle = open(zipFile)
s = fileHandle.read()
uncompressedData = bz2.decompress(s)
Same error :
uncompressedData = bz2.decompress(s)
ValueError: couldn't find end of stream
Thanks so much for you prompt reply. I'm really banging my head against the wall, feeling inordinately thick for not being able to decompress a simple .bz2 file.
By the by, used 7zip to decompress it manually, to make sure the file isn't wonky or anything, and it decompresses fine.
You're opening and reading the compressed file as if it was a textfile made up of lines. DON'T! It's NOT.
uncompressedData = bz2.BZ2File(zipFile).read()
seems to be closer to what you're angling for.
Edit: the OP has shown a few more things he's tried (though I don't see any notes about having tried the best method -- the one-liner I recommend above!) but they seem to all have one error in common, and I repeat the key bits from above:
opening ... the compressed file as if
it was a textfile ... It's NOT.
open(filename) and even the more explicit open(filename, 'r') open, for reading, a text file -- a compressed file is a binary file, so in order to read it correctly you must open it with open(filename, 'rb'). ((my recommended bz2.BZ2File KNOWS it's dealing with a compressed file, of course, so there's no need to tell it anything more)).
In Python 2.*, on Unix-y systems (i.e. every system except Windows), you could get away with a sloppy use of open (but in Python 3.* you can't, as text is Unicode, while binary is bytes -- different types).
In Windows (and before then in DOS) it's always been indispensable to distinguish, as Windows' text files, for historical reason, are peculiar (use two bytes rather than one to end lines, and, at least in some cases, take a byte worth '\0x1A' as meaning a logical end of file) and so the reading and writing low-level code must compensate.
So I suspect the OP is using Windows and is paying the price for not carefully using the 'rb' option ("read binary") to the open built-in. (though bz2.BZ2File is still simpler, whatever platform you're using!-).
openZip = open(zipFile, "r")
If you're running on Windows, you may want to do say openZip = open(zipFile, "rb") here since the file is likely to contain CR/LF combinations, and you don't want them to be translated.
newLine = openZip.readline()
As Alex pointed out, this is very wrong, as the concept of "lines" is foreign to a compressed stream.
s = fileHandle.read(1024)
[...]
uncompressedData += bz2.decompress(s)
This is wrong for the same reason. 1024-byte chunks aren't likely to mean much to the decompressor, since it's going to want to work with it's own block-size.
s = fileHandle.read()
uncompressedData = bz2.decompress(s)
If that doesn't work, I'd say it's the new-line translation problem I mentioned above.
This was very helpful.
44 of 2300 files gave an end of file missing error, on Windows open.
Adding the b(inary) flag to open fixed the problem.
for line in bz2.BZ2File(filename, 'rb', 10000000) :
works well. (the 10M is the buffering size that works well with the large files involved)
Thanks!

Categories

Resources