Is file-input thread-safe in Python?

Is file-input thread-safe in Python? - python

In Python2, is it safe to have multiple threads read from a single unchanging disk file using code such as:
with open( pathname, 'rb' ) as f:
f.seek( file_position )
data = f.read( number_of_bytes )
No process has, or will have, write-permission for the file.
Obviously, reading files in this way is not atomic. The Python2 documents say nothing (I could find) about file objects and threads. Here is the documentation for the seek method:
https://docs.python.org/2/library/stdtypes.html?highlight=seek#file-objects
This is a critical issue for my system, so if pointers into the documentation could be provided, that would be reassuring.
Thank you.

If each thread executes the code you've given, they open the file separately, and this is safe. I'm not sure to what documentation to refer you; this is just a result of a process being allowed to have the same file open more than once. You may not be on a POSIX system, but for reference it describes an open file description as the thing created by open() (in C, but wrapped by Python) that holds the file offset and other information relevant to accessing the file.

Related

Python Compressed file ended before the end-of-stream marker was reached. But file is not Corrupted

i made a simple request code that downloads a file from a Server
r = requests.get("https:.../index_en.txt.lzma")
index_en= open('C:\...\index_en.txt.lzma','wb')
index_en.write(r.content)
index_en.close
when i now extract the file manually in the directorry with 7zip everything is fine and the file decrippts as normal.
i tried two ways to do it in a ython programm but scince the file ends with .lzma i guess the following one is a bether approach
import lzma
with open('C:\...\index_en.txt.lzma') as compressed:
print(compressed.readline)
with lzma.LZMAFile(compressed) as uncompressed:
for line in uncompressed:
print(line)
this one gives me the Error: "Compressed file ended before the end-of-stream marker was reached" at the line with the for loop.
the second way i tried was with 7zip, because by hand it worked fine
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
this one gives me the Error: OSError 22 Invalid Argument at the "with py7zr..." line
i really dont understand where the problem here is. WHy does it work by hand but not in python?
Thanks

You didn't close your file, so data stuck in user mode buffers isn't visible on disk until the file is cleaned up at some undetermined future point (may not happen at all, and may not happen until the program exits even if it does). Because of this, any attempt to access the file by any means other than the single handle you wrote to will not see the unflushed data, which would cause it to appear as if the file was truncated, getting the error you observe.
The minimal solution is to actually call close, changing index_en.close to index_en.close(). But practically speaking, you should use with statements for all files (and locks, and socket-like things, and all other resources that require cleanup), whenever possible, so even when an exception occurs the file is definitely closed; it's most important for files you're writing to (where data might not get flushed to disk without it), but even for files opened for reading, in pathological cases you can end up hitting the open file handle limit.
Rewriting your first block of code to be completely safe gets you:
with requests.get("https:.../index_en.txt.lzma") as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
index_en.write(r.content)
Note: request.Response objects are also context managers, so I added it to the with to ensure the underlying connection is released back to the pool promptly. I also prefixed your local path with an r to make it a raw string; on Windows, with backslashes in the path, you always want to do this, so that a file or directory beginning with a character that Python recognizes as a string literal escape doesn't get corrupted (e.g. "C:\foo" is actually "C:<form feed>oo", containing neither a backslash nor an f).
You could even optimize it a bit, in case the file is large, by streaming the data into the file (requiring mostly fixed memory overhead, tied to the buffer size of the underlying connection) rather than fetching eagerly (requiring memory proportionate to file size):
# stream=True means underlying file is opened without being immediately
# read into memory
with requests.get("https:.../index_en.txt.lzma", stream=True) as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
# iter_content(None) produces an iterator of chunks of data (of whatever size
# is available in a single system call)
# Changing to writelines means the iterator is consumed and written
# as the data arrives
index_en.writelines(r.iter_content(None))
Controlling the requests.get with a with statement is more important here (as stream=True mode means the underlying socket isn't consumed and freed immediately).
Also note that print(compressed.readline) is doing nothing (because you didn't call readline). If there is some line of text in the response prior to the raw LZMA data, you failed to skip it. If there is not such a garbage line, and if you'd called readline properly (with print(compressed.readline())), it would have broken decompression because the file pointer would now have skipped the first few (or many) bytes of the file, landing at some mostly random offset.
Lastly,
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
is wrong because you passed it a mode indicating you're opening it for write, when you're clearly attempting to read from it; either omit the 'w' or change it to 'r'.

python write file vs matlab write file, read by another software

I've encountered this strange problem with opening/closing files in python. I am trying to do the same thing in python that i was doing successfully in matlab, and i am getting a problem communicating with some software through text files. I've come up with a strange workaround to solve the problem, but i dont understand why it works.
I have software that communicates with a some lab equipment. To communicate with this software, i write a file ('wavefile.txt') to a specific folder, containing parameters to send to the device. I then write another file named 'request.txt' containing the location of this first file ('wavefile.txt') which contains the parameters to send to the device. The software is constantly checking this folder to find the file named 'request.txt' and once it finds it, it will read the parameters in the file which is specified by the text in 'request.txt' and then delete 'request.txt'. The software/equipment developer instructs to give a 50 ms second delay before closing the 'request.txt' file.
original matlab code that works:
home = cd;
cd \\CREOL-FAST-01\data
fileID = fopen('request.txt', 'wt');
proj = 'C:\\dazzler\\data\\wavefile.txt';
fprintf(fileID, proj);
pause(0.05);
fclose('all');
cd(home);
original python code that does not work:
home = os.getcwd()
os.chdir(r'\\CREOL-FAST-01\data')
with open('request.txt', 'w') as file:
proj = r'C:\dazzler\data\wavefile.txt'
file.write(proj)
time.sleep(0.05)
os.chdir(home)
Every time the device program reads the 'request.txt' when its working with matlab, it deletes it immediately after matlab closes it. When i run that code with python it works SOMETIMES, maybe 1 in every 5 tries will be successful and the parameters are sent. The 'request.txt' file is always deleted with the python code above,but the parameters i've input are clearly not sent to my lab device. My guess is that when i write the file in python, the device program is able to read it before python writes the text to it, so its just opening the blank file, not applying any parameters, and then deleting it.
My workaround in python:
home = os.getcwd()
os.chdir(r'\\CREOL-FAST-01\data')
fileh = open('request.txt', 'w+')
proj = r'C:\dazzler\data\wavefile.txt'
fileh.write(proj)
time.sleep(0.05)
print(fileh.read())
time.sleep(0.05)
fileh.close()
This method in python seems to work 100% of the time. I open the file in w+ mode, and using fileh.read() is absolutely necessary. if i delete that line and still include the extra sleeptime, it will again work about 1 in 5 tries. This seems really strange to me. Any explanation, or better solutions?

My guess (which could be wrong) is that the file is being read before it is completely flushed. I would try using the flush() method after the write to make sure that the complete data is written to the file. You might also need the os.fsync() method to make sure the data is flushed properly. Try something like this:
home = os.getcwd()
os.chdir(r'\\CREOL-FAST-01\data')
with open('request.txt', 'w') as file:
proj = r'C:\dazzler\data\wavefile.txt'
file.write(proj)
file.flush()
os.fsync()
time.sleep(0.05)
os.chdir(home)

Not knowing any details about the particular equipment and other software you are using it's hard to say. One guess is the difference in buffering on write calls.
From this blog post on Matlab's fwrite: "The default behavior for fprintf and fwrite is to flush the file buffer after each call to either of these functions"
Whereas for Python's open:
When no buffering argument is given, the default buffering policy
works as follows:
Binary files are buffered in fixed-size chunks; the size of the buffer
is chosen using a heuristic trying to determine the underlying
device’s “block size” and falling back on io.DEFAULT_BUFFER_SIZE. On
many systems, the buffer will typically be 4096 or 8192 bytes long.
“Interactive” text files (files for which isatty() returns True) use
line buffering. Other text files use the policy described above for
binary files.
To test this guess change:
with open('request.txt', 'w') as file:
proj = r'C:\dazzler\data\wavefile.txt'
to:
with open('request.txt', 'w', buffer=1) as file:
proj = 'C:\\dazzler\\data\\wavefile.txt\n'

The problem is probably that you are doing the delay while the file is still open and thus not written to disk. Try something like this:
home = os.getcwd()
os.chdir(r'\\CREOL-FAST-01\data')
with open('request.txt', 'w') as file:
proj = r'C:\dazzler\data\wavefile.txt'
file.write(proj)
time.sleep(0.05)
os.chdir(home)
The only difference here is that the sleep is done after the file is closed (the file is closed when the with block ends), and thus the delay doesn't happen until after the text is written to disk.
To put it in words, what you are doing is:
Open (and create) file
Write text to a buffer (in memory, not on disk)
Wait 50 ms
Close (and write) the file
What you want to do is:
Open (and create) file
Write text to a buffer (in memory, not on disk)
Close (and write) the file
Wait 50 ms
So what you end up with is a period of at least 50 ms where the text file has been created, but where there is nothing in it because the text is sitting in your computer memory not on disk.
To be honest, I would put as little in the with block as possible, to avoid issues like this. So I would write it like so:
home = os.getcwd()
os.chdir(r'\\CREOL-FAST-01\data')
proj = r'C:\dazzler\data\wavefile.txt'
with open('request.txt', 'w') as file:
file.write(proj)
time.sleep(0.05)
os.chdir(home)
Also keep in mind that you also can't do the opposite: assume that no text is written until you close. For small files like this, that will probably happen. But when the file is written to disk depends on a lot of factors. When a file is closed, it is written, but it may be written before that too.

add header to stdout of a subprocess in python

I am merging several dataframes into one and sorting them using unix sort. Before I write the final sorted data I would like to add a prefix/header to that output.
So, my code is something like:
my_cols = '\t'.join(['CHROM', 'POS', "REF" ....])
my_cmd = ["sort", "-k1,2", "-V", "final_merged.txt"]
with open(output + 'mergedAndSorted.txt', 'w') as sort_data:
sort_data.write(my_cols + '\n')
subprocess.run(my_cmd, stdout=sort_data)
But, this above doe puts my_cols at the end of the final output file (i.e mergedAndSorted.txt)
I also tried substituting:
sort_data=io.StringIO(my_cols)
but this gives me an error as I had expected.
How can I add that header to the begining of the subprocess output. I believe this can be achieved by a simple code change.

The problem with your code is a matter of buffering; the tldr is that you can fix it like this:
sort_data.write(my_cols + '\n')
sort_data.flush()
subprocess.run(my_cmd, stdout=sort_data)
If you want to understand why it happens, and how the fix solves it:
When you open a file in text mode, you're opening a buffered file. Writes go into the buffer, and the file object doesn't necessarily flush them to disk immediately. (There's also stream-encoding from Unicode to bytes going on, but that doesn't really add a new problem, it just adds two layers where the same thing can happen, so let's ignore that.)
As long as all of your writes are to the buffered file object, that's fine—they get sequenced properly in the buffer, so they get sequenced properly on the disk.
But if you write to the underlying sort_data.buffer.raw disk file, or to the sort_data.fileno() OS file descriptor, those writes may get ahead of the ones that went to sort_data.
And that's exactly what happens when you use the file as a pipe in subprocess. This doesn't seem to be explained directly, but can be inferred from Frequently Used Arguments:
stdin, stdout and stderr specify the executed program’s standard input, standard output and standard error file handles, respectively. Valid values are PIPE, DEVNULL, an existing file descriptor (a positive integer), an existing file object, and None.
This implies pretty strongly—if you know enough about the way piping works on *nix and Windows—that it's passing the actual file descriptor/handle to the underlying OS functionality. But it doesn't actually say that. To really be sure, you have to check the Unix source and Windows source, where you can see that it is calling fileno or msvcrt.get_osfhandle on the file objects.

opening and writing a large binary file python

I have a homebrew web based file system that allows users to download their files as zips; however, I found an issue while dev'ing on my local box not present on the production system.
In linux this is a non-issue (the local dev box is a windows system).
I have the following code
algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
temp_file = open(temp_file_path, 'wb+')
data = file.read(settings.READ_SIZE)
while data:
dec_data = decrypt.update(data)
temp_file.write(dec_data)
data = file.read(settings.READ_SIZE)
# Takes a dump right here!
# error in cipher operation (wrong final block length)
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()
The above code opens a file, and (using the key for the current file share) decrypts the file and writes it to a temporary location (that will later be stuffed into a zip file).
My issue is on the file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb') line. Since windows cares a metric ton about binary files if I don't specify 'rb' the file will not read to end on the data read loop; however, for some reason since I am also writing to temp_file it never completely reads to the end of the file...UNLESS i add a + after the b 'rb+'.
if i change the code to file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb+') everything works as desired and the code successfully scrapes the entire binary file and decrypts it. If I do not add the plus it fails and cannot read the entire file...
Another section of the code (for downloading individual files) reads (and works flawlessly no matter the OS):
algo = CipherType('AES-256', 'CBC')
decrypt = DecryptCipher(algo, cur_share.key[:32], cur_share.key[-16:])
file = open(settings.STORAGE_ROOT + 'f_' + str(cur_file.id), 'rb')
filename = smart_str(cur_file.name, errors='replace')
response = HttpResponse(mimetype='application/octet-stream')
response['Content-Disposition'] = 'attachment; filename="' + filename + '"'
data = file.read(settings.READ_SIZE)
while data:
dec_data = decrypt.update(data)
response.write(dec_data)
data = file.read(settings.READ_SIZE)
# no dumps to be taken when finishing up the decrypt process...
final_data = decrypt.finish()
temp_file.write(final_data)
file.close()
temp_file.close()
Clarification
The cipher error is likely because the file was not read in its entirety. For example, I have a 500MB file I am reading in at 64*1024 bytes at a time. I read until I receive no more bytes, when I don't specify b in windows it cycles through the loop twice and returns some crappy data (because python thinks it is interacting with a string file not a binary file).
When I specify b it takes 10-15 seconds to completely read in the file, but it does it succesfully, and the code completes normally.
When I am concurrently writing to another file as i read in from the source file (as in the first example) if I do not specify rb+ it displays the same behavior as not even specifying b which is, that it only reads a couple segments from the file before closing the handle and moving on, i end up with an incomplete file and the decryption fails.

I'm going to take a guess here:
You have some other program that's continually replacing the files you're trying to read.
On linux, this other program works by atomically replacing the file (that is, writing to a temporary file, then moving the temporary file to the path). So, when you open a file, you get the version from 8 seconds ago. A few seconds later, someone comes along and unlinks it from the directory, but that doesn't affect your file handle in any way, so you can read the entire file at your leisure.
On Windows, there is no such thing as atomic replacement. There are a variety of ways to work around that problem, but what many people do is to just rewrite the file in-place. So, when you open a file, you get the version from 8 seconds ago, start reading it… and then suddenly someone else blanks the file to rewrite it. That does affect your file handle, because they've rewritten the same file. So you hit an EOF.
Opening the file in r+ mode doesn't do anything to solve the problem, but it adds a new problem that hides it: You're opening the file with sharing settings that prevent the other program from rewriting the file. So, now the other program is failing, meaning nobody is interfering with this one, meaning this one appears to work.
In fact, it could be even more subtle and annoying than this. Later versions of Windows try to be smart. If I try to open a file while someone else has it locked, instead of failing immediately, it may wait a short time and try again. The rules for exactly how this works depend on the sharing and access you need, and aren't really documented anywhere. And effectively, whenever it works the way you want, it means you're relying on a race condition. That's fine for interactive stuff like dragging a file from Explorer to Notepad (better to succeed 99% of the time instead of 10% of the time), but obviously not acceptable for code that's trying to work reliably (where succeeding 99% of the time just means the problem is harder to debug). So it could easily work differently between r and r+ modes for reasons you will never be able to completely figure out, and wouldn't want to rely on if you could…
Anyway, if any variation of this is your problem, you need to fix that other program, the one that rewrites the file, or possibly both programs in cooperation, to properly simulate atomic file replacement on Windows. There's nothing you can do from just this program to solve it.*
* Well, you could do things like optimistic check-read-check and start over whenever the modtime changes unexpectedly, or use the filesystem notification APIs, or… But it would be much more complicated than fixing it in the right place.

Python Multiple users append to the same file at the same time

I'm working on a python script that will be accessed via the web, so there will be multiple users trying to append to the same file at the same time. My worry is that this might cause a race condition where if multiple users wrote to the same file at the same time and it just might corrupt the file.
For example:
#!/usr/bin/env python
g = open("/somepath/somefile.txt", "a")
new_entry = "foobar"
g.write(new_entry)
g.close
Will I have to use a lockfile for this as this operation looks risky.

You can use file locking:
import fcntl
new_entry = "foobar"
with open("/somepath/somefile.txt", "a") as g:
fcntl.flock(g, fcntl.LOCK_EX)
g.write(new_entry)
fcntl.flock(g, fcntl.LOCK_UN)
Note that on some systems, locking is not needed if you're only writing small buffers, because appends on these systems are atomic.

If you are doing this operation on Linux, and the cache size is smaller than 4KB, the write operation is atomic and you should be good.
More to read here: Is file append atomic in UNIX?

You didn't state what platform you use, but here is an module you can use that is cross platform:
File locking in Python

Depending on your platform/filesystem location this may not be doable in a safe manner (e.g. NFS). Perhaps you can write to different files and merge the results afterwards?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.