Whats the best way to get the filesize? - python

There are actually three ways I have in mind to determine a files size:
open and read it, and get the size of the string with len()
using os.stat and getting it via st_size -> what should be the "right" way because its handled by the underlying os
os.path.getsize what should be the same as above
So what is the actual right way to determine the filesize? What is the worst way to do?
Or doesn't it even matter because at the end it is all the same?
(I can imagine the first method having a problem with really large files, while the two others have not)

The first method would be a waste if you don't need the contents of the file anyway. Either of your other two options are fine. os.path.getsize() uses os.stat()
From genericpath.py
def getsize(filename):
"""Return the size of a file, reported by os.stat()."""
return os.stat(filename).st_size
Edit:
In case it isn't obvious, os.path.getsize() comes from genericpath.py.
>>> os.path.getsize.__code__
<code object getsize at 0x1d457b0, file "/usr/lib/python2.7/genericpath.py", line 47>

Method 1 is the slowest way possible. Don't use it unless you will need the entire contents of the file as a string later.
Methods 2 and 3 are the fastest, since they don't even have to open the file.
Using f.seek(os.SEEK_END) and f.tell() requires opening the file, and might be a bit slower than 2&3 unless you're going to open the file anyway.
All methods will give the same result when no other program is writing to the file. If the file is in the middle of being modified when your code runs, seek+tell can sometimes give you a more up-to-date answer than 2&3.

no. 1 is definitely the worst. If at all, it's better to seek() and tell(), but that's not as good as the other two.
no. 2 and no. 3 are equally ok IMO. I think no. 3 is a bit clearer to read, but that's negligible.

Related

Python's os.copy_file_range not working with O_APPEND

I want to copy the content of a file 'from_path' to the end of another file 'to_path'. I wrote the code
fd_from = os.open(from_path, os.O_RDONLY)
fd_to = os.open(to_path, os.O_WRONLY | os.O_APPEND)
os.copy_file_range(fd_from, fd_to, os.path.getsize(from_path))
os.close(fd_from)
os.close(fd_to)
However, I get the following error
OSError: [Errno 9] Bad file descriptor
on the third line.
This (or something similar) was working fine, but now I can't avoid said error, even though (I believe) I haven't changed anything.
I looked around online and figured that this error usually happens because a file was not properly opened/close. However, that should not be the case here.
If we do, for example
fd_to = os.open(to_path, os.O_WRONLY | os.O_APPEND)
os.write(fd_to, b'something')
os.close(fd_to)
Everything works smoothly.
Also, if I write the exact same code as the problematic one, but without O_APPEND, everything works as well.
I am using Python 3.8.13, glibc 2.35 and linux kernel 5.15.0.
Note that efficiency is important in my case, thus many of the alternatives I've came across are undesirable.
Some of the alternatives that were found to be slower than this particular method are:
Using subprocess to launch the unix utility cat;
Iterating over the lines of the first file and appending them to the second.
While I had the implementation with copy_file_range working, I managed to find that this was around 2.6 times faster than cat and 14 times faster than iterating over the lines.
I've also read about shutil and other methods, but those don't seem to allow appending of the copied contents.
Can anyone explain the problem? Does this function not work with append mode? Or maybe there is a workaround?
Thank you in advance for your help!

open(..., encoding="") vs str.encode(encoding="")

Question:
What is the difference between open(<name>, "w", encoding=<encoding>) and open(<name>, "wb") + str.encode(<encoding>)? They seem to (sometimes) produce different outputs.
Context:
While using PyFPDF (version 1.7.2), I subclassed the FPDF class, and, among other things, added my own output method (taking pathlib.Path objects). While looking at the source of the original FPDF.output() method, I noticed almost all of it is argument parsing - the only relevant bits are
#Finish document if necessary
if(self.state < 3):
self.close()
[...]
f=open(name,'wb')
if(not f):
self.error('Unable to create output file: '+name)
if PY3K:
# manage binary data as latin1 until PEP461 or similar is implemented
f.write(self.buffer.encode("latin1"))
else:
f.write(self.buffer)
f.close()
Seeing that, my own Implementation looked like this:
def write_file(self, file: Path) -> None:
if self.state < 3:
# See FPDF.output()
self.close()
file.write_text(self.buffer, "latin1", "strict")
This seemed to work - a .pdf file was created at the specified path, and chrome opened it. But it was completely blank, even tho I added Images and Text. After hours of experimenting, I finally found a Version that worked (produced a non empty pdf file):
def write_file(self, file: Path) -> None:
if self.state < 3:
# See FPDF.output()
self.close()
# using .write_text(self.buffer, "latin1", "strict") DOES NOT WORK AND I DON'T KNOW WHY
file.write_bytes(self.buffer.encode("latin1", "strict"))
Looking at the pathlib.Path source, it uses io.open for Path.write_text(). As all of this is Python 3.8, io.open and the buildin open() are the same.
Note:
FPDF.buffer is of type str, but holds binary data (a pdf file). Probably because the Library was originally written for Python 2.
Both should be the same (with minor differences).
I like open way, because it is explicit and shorter, OTOH if you want to handle encoding errors (e.g. a way better error to user), one should use decode/encode (maybe after a '\n'.split(s), and keeping line numbers)
Note: if you use the first method (open), you should just use r or w, so without b. For your question's title, it seems you did correct, but check that your example keep b, and probably for this, it used encoding. OTOH the code seems old, and I think the ".encoding" was just done because it would be more natural in Python2 mindset.
Note: I would also replace strict to backslashreplace for debugging. And possibly you may want to check and print (maybe just ord) of the first few characters of the self.buffer on both methods, to see if there are substantial differences before file.write.
I would add a file.flush() on both functions. This is one of the differences: buffering is different, and I'll make sure I close the file. Python will do it, but when debugging, it is important to see the content of the file as quick as possible (and also after an exception). Garbage collector could not guarantee all of this. Maybe you are reading a text file which was not yet flushed.
Aaaand found it: Path.write_bytes() will save the bytes object as is, and str.encoding doesn't touch the line endings.
Path.write_text() will encode the bytes object just like str.encode(), BUT: because the file is opened in text mode, the line endings will be normalized after encoding - in my case converting \n to \r\n because I'm on Windows. And pdfs have to use \n, no matter what platform your on.

python File read buffer boundary

when we use function open() to open a file,we may set the buffersize for performance . But I just doubt if we set 1024,but the data in file is like this:
1999999999 3232344 54354364576 2343243254 6453623453245r3245235 5342453245233333333333333333 534545454364536 4355545...
So I don't know whether this will cut off one number,just as first read,buffer will be 1999999999 3232344 54354364576 2343243254 6453623453245r3245235 53424532,
And next we read buffer will be 45233333333333333333 534545454364536 4355545,and so on.
Or python's buffer implement had solve this question ? Can anyone give me some pointers ? Thanks.
If you use the read method without any arguments it will return the entire file content. You can use the size parameter if you only want to read a part of the file at a time.
See the documentation for more info http://docs.python.org/2.7/tutorial/inputoutput.html#reading-and-writing-files

Can I read and write file in one line with Python?

with ruby I can
File.open('yyy.mp4', 'w') { |f| f.write(File.read('xxx.mp4')}
Can I do this using Python ?
Sure you can:
with open('yyy.mp4', 'wb') as f:
f.write(open('xxx.mp4', 'rb').read())
Note the binary mode flag there (b), since you are copying over mp4 contents, you don't want python to reinterpret newlines for you.
That'll take a lot of memory if xxx.mp4 is large. Take a look at the shutil.copyfile function for a more memory-efficient option:
import shutil
shutil.copyfile('xxx.mp4', 'yyy.mp4')
Python is not about writing ugly one-liner code.
Check the documentation of the shutil module - in particular the copyfile() method.
http://docs.python.org/library/shutil.html
You want to copy a file, do not manually read then write bytes, use file copy functions which are generally much better and efficient for a number of reasons in this simple case.
If you want a true one-liner, you can replace line-breaks by semi-colons :
import shutil; shutil.copyfile("xxx.mp4","yyy.mp4")
Avoid this! I did that once to speed-up an extremely specific case completely unrelated to Python, but by the presence of line-breaks in my python -c "Put 🐍️ code here" command-line and the way Meson handle it.

How to cause an error or unwanted behaviour by not closing a file object in python? (code example pls) [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Does a File Object Automatically Close when its Reference Count Hits Zero?
I read that file objects need to be closed but can someone actually provide a very simple example (code example) where an error is caused by not closing a file object?
Are you asking if Python will raise an error if you fail to close a file? Then the answer is "no".
If you are asking if you might lose data, the answer is "yes".
By analogy, will the cops write you a ticket if you leave your keys in the ignition? No.
Does this practice increase the odds that you will "lose" your car? Yes.
Edit:
Ok, you asked for an example, not smart-aleck comments. Here is an example, although a bit contrived because it's easier to do this than investigate buffer-size corner cases.
Good:
fh = open("erase_me.txt", "w")
fh.write("Hello there!")
fh.close()
# Writes "Hello there!" to 'erase_me.txt'
# tcsh-13: cat erase_me.txt
# Hello there!tcsh-14:
Bad:
import os
fh = open("erase_me.txt", "w")
fh.write("Hello there!")
# Whoops! Something bad happened and my program bombed!
os._exit(1)
fh.close()
# tcsh-19: cat erase_me.txt
# tcsh-20: ll erase_me.txt
# -rw-r--r-- 1 me us 0 Jul 17 15:41 erase_me.txt
# (Notice file length = 0)
It's something you might observe if you are writing data to a file and at the end your output file doesn't contain all of the data you have written to it because the file wasn't properly closed (and its buffers flushed).
Here's a fairly recent example on SO of just this problem Python not writing full string to file.
Note that this problem isn't unique to Python, you are likely to encounter this with other languages too (e.g., I've run into this more than once with C)
On some operating systems, writing a lot of data to a file and not closing it will cause the data not to be flushed when the libc tears it down, resulting in a truncated file.
I will add also that if you don't close opened files in a long running process you can end up hitting the maximum number of file opened allowed per process, in a Linux system the default limit can be checked using the command ulimit -aH.
Note: The limit that i'm talking about is the limit of file descriptors per process which include beside physical files, sockets ...
There is not technically an error, but it will stay in memory until the garbage collector closes the file, which can have a negative effect on other processes. You should always explicitly close your file descriptors.
It is good practice to use the with keyword when dealing with file
objects. This has the advantage that the file is properly closed after
its suite finishes, even if an exception is raised on the way. It is
also much shorter than writing equivalent try-finally blocks:
Using with:
with open('test.txt', 'rb') as f:
buf = f.readlines()
You may not get any error, exception in trivial cases and your file may have all the content but it is prone to catastrophic errors in real world programs.
One of principle of python is "bad behavior should be discouraged but not banned" so i would suggest always focus on closing file in "finally" block.
Say you are processing a bunch of files in a directory. Not closing them will take a significant amount of memory, and can even cause your program to run out of file descriptors or some other resource.
We expect that CPython will close files when there are no references to the file remaining, but that behaviour is not guaranteed and if someone tries to use your module on a Python implementation such as Jython that doesn't use ref counting, they may encounter strange bugs or excessively long spurts of garbage collection

Categories

Resources