File size changes after read/write txt file in python - python

After executing the following code to generate a copy of a text file with Python, the newfile.txt doesn't have the exact same file size as oldfile.txt.
with open('oldfile.txt','r') as a, open('newfile.txt','w') as b:
content = a.read()
b.write(content)
While oldfile.txt has e.g. 667 KB, newfile.txt has 681 KB.
Does anyone have an explanation for that?

There are various causes.
You are opening a file as text file, so the bytes of file are interpreted (decoded) into python, and than encoded. So there could be changes.
From open documentation (https://docs.python.org/3/library/functions.html#open):
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller.
So if the original file were ASCII (e.g. generated in Windows), you will have the \r removed. But when writing back the file you can have no more the original \r (if you are in Linux or MacOs) or you will have always \r\n, if you are on Windows (which it seems the case, because you file increase in size).
Also encoding could change text. E.g. BOM mark could be removed (or added), and potentially (but AFAIK it is not done implicitly), unneeded codes could be removed (you can have some extra code in Unicode, which change the behaviour of nearby codes. One could add more of one of them, but only the last one is effective.

I tried on Linux / Ubuntu. It works as expected, the file-size of both files is perfectly equal.
At this point, i guess this behavior does not relate to python, maybe it depends on your filesystem (compression) or operating system.

Related

Python Compressed file ended before the end-of-stream marker was reached. But file is not Corrupted

i made a simple request code that downloads a file from a Server
r = requests.get("https:.../index_en.txt.lzma")
index_en= open('C:\...\index_en.txt.lzma','wb')
index_en.write(r.content)
index_en.close
when i now extract the file manually in the directorry with 7zip everything is fine and the file decrippts as normal.
i tried two ways to do it in a ython programm but scince the file ends with .lzma i guess the following one is a bether approach
import lzma
with open('C:\...\index_en.txt.lzma') as compressed:
print(compressed.readline)
with lzma.LZMAFile(compressed) as uncompressed:
for line in uncompressed:
print(line)
this one gives me the Error: "Compressed file ended before the end-of-stream marker was reached" at the line with the for loop.
the second way i tried was with 7zip, because by hand it worked fine
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
this one gives me the Error: OSError 22 Invalid Argument at the "with py7zr..." line
i really dont understand where the problem here is. WHy does it work by hand but not in python?
Thanks
You didn't close your file, so data stuck in user mode buffers isn't visible on disk until the file is cleaned up at some undetermined future point (may not happen at all, and may not happen until the program exits even if it does). Because of this, any attempt to access the file by any means other than the single handle you wrote to will not see the unflushed data, which would cause it to appear as if the file was truncated, getting the error you observe.
The minimal solution is to actually call close, changing index_en.close to index_en.close(). But practically speaking, you should use with statements for all files (and locks, and socket-like things, and all other resources that require cleanup), whenever possible, so even when an exception occurs the file is definitely closed; it's most important for files you're writing to (where data might not get flushed to disk without it), but even for files opened for reading, in pathological cases you can end up hitting the open file handle limit.
Rewriting your first block of code to be completely safe gets you:
with requests.get("https:.../index_en.txt.lzma") as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
index_en.write(r.content)
Note: request.Response objects are also context managers, so I added it to the with to ensure the underlying connection is released back to the pool promptly. I also prefixed your local path with an r to make it a raw string; on Windows, with backslashes in the path, you always want to do this, so that a file or directory beginning with a character that Python recognizes as a string literal escape doesn't get corrupted (e.g. "C:\foo" is actually "C:<form feed>oo", containing neither a backslash nor an f).
You could even optimize it a bit, in case the file is large, by streaming the data into the file (requiring mostly fixed memory overhead, tied to the buffer size of the underlying connection) rather than fetching eagerly (requiring memory proportionate to file size):
# stream=True means underlying file is opened without being immediately
# read into memory
with requests.get("https:.../index_en.txt.lzma", stream=True) as r, open(r'C:\...\index_en.txt.lzma','wb') as index_en:
# iter_content(None) produces an iterator of chunks of data (of whatever size
# is available in a single system call)
# Changing to writelines means the iterator is consumed and written
# as the data arrives
index_en.writelines(r.iter_content(None))
Controlling the requests.get with a with statement is more important here (as stream=True mode means the underlying socket isn't consumed and freed immediately).
Also note that print(compressed.readline) is doing nothing (because you didn't call readline). If there is some line of text in the response prior to the raw LZMA data, you failed to skip it. If there is not such a garbage line, and if you'd called readline properly (with print(compressed.readline())), it would have broken decompression because the file pointer would now have skipped the first few (or many) bytes of the file, landing at some mostly random offset.
Lastly,
with py7zr.SevenZipFile("C:\...\index_en.txt.lzma", 'w') as archive:
archive.extract(path="C:\...\Json")
is wrong because you passed it a mode indicating you're opening it for write, when you're clearly attempting to read from it; either omit the 'w' or change it to 'r'.

python3 for win and cygwin - line endings in buffer

Setup: python3.6 for windows in Cygwin (have to use Win version because of functionalities introduced in 3.5 and Cygwin is stuck at 3.4)
How to get \n new lines in buffer (stdout) output from a python script (instead of \r\n)? The output is a list of paths and I want to get one per line for further processing by other Cygwin/Windows tools.
All answers I've found so far are dealing with file writing and I just want to modify what is written to stdout. So far the only sure way to get rid of \r is piping results through sed 's/\\10//' which is awkward.
Weird thing is that even Windows applications fed with script output don't accept it with messages like:
Can't find file <asdf.txt
>
(note newline before >)
Supposedly sys.stdout.write is doing pure output but when doing:
sys.stdout.write(line)
I get a list of paths without any separation. If I introduce anything which resembles NL (\n, \012, etc.) it is automatically converted to CRLF (\r\n). How to stop this conversion?
You need to write to stdout in binary mode; the default is text mode, which translates everything you write.
According to Issue4571 you can do this by writing directly to the internal buffer used by stdout.
sys.stdout.buffer.write(line)
Note that if you're writing Unicode strings you'll need to encode them to byte strings first.
sys.stdout.buffer.write(line.encode('utf-8')) # or 'mbcs'

Python pickle weird behavior on using different file modes for the file where pickled object is stored

I am pickling a dictionary with the following statement:
pickle.dump(paramsToSave, open('testvars.txt','wb'))
I am unpickling with the following:
vars = pickle.load(open('testvars.txt','rb'))
Now when I use file mode 'w' in pickling and 'r' in unpickling, it is fine. Same for wb-rb, wb-r combinations.
But when I use w-rb combination, I get an error:
ValueError: insecure string pickle
Can someone please explain this behavior? And which is the right file mode combination to use?
Edit: I am using Python 2.6.6 on Windows 7
First of all, you should always use binary mode for pickle files. On platforms where this matters (e.g. Windows), opening a file in text mode means that all line terminators are translated; \n becomes \r\n on write, and \r\n becomes \n again on reading.
On Python 2 the default pickle protocol is ASCII-based, but that doesn't mean that the contents of the values are not going to be affected. For your w -> rb example, most likely a value with a \n embedded was written out as \r\n, then read as \r\n meaning the length of the data changed, triggering the error message because certain quoting expectations were not met (the closing quote was not read because the string length changed).
The fact that you didn't run into this specific exception with the other non-binary combinations does not mean you didn't have problems anyway. Values could still end up being corrupted.
All other protocol versions are binary based, meaning you can break the protocol in more creative ways still.

what does ^# sign in .txt file suggest

I was concurrently manipulating a txt file (some r/w operation)with multiple processes. and I saw traces of special signs as ^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^#^# spreading across some lines now and then. What does this suggest? And under what circumstances will these symbols appear. Does it mean some binary contents were written in to, by mistake, where it should be text?
UPDATE
I read through the documentation. Some suggest it's due to newline issue on linux/windows platform, while others suggest it's because of big endian/small endian in a networked environment. The fact is I was running multiple processes in a networked filesystem and manipulate one common txt file. So I guess the encoding format might be the major reason. Anyone who can suggest how to avoid this issue? I don't want to edit files(like manually doing text substitution). A clean way of producing the right file without any null characters are preferred.
UPDATE2
This is the python pseudo code that implements my project. the fcntl.lockf thing is to lock the common manipulated file across multiple machines that run multiple process on it.
while(manipulatedfile size is not 0):
open(manipulatedfile, 'r+') as fh:
fcntl.lockf(fh, fcntl.LOCK_EX)
all_lines = fh.readlines()
listing=all_lines[0:50] #get the first 50 lines
rest_lines = all_lines[50:] # get remaining lines
fh.seek(0)
fh.truncate()
fh.writelines(rest_lines) # write remaining lines back to file
fcntl.lockf(fh, fcntl.LOCK_UN)
listing = map(lambda s:s.strip(), listing)
do_sth(listing)
Thanks
In ASCII, ^# is a binary zero (NUL) character.
Data containing ^# between each ASCII character can sometimes be incorrectly translated from Unicode (4 bytes to a character) to ASCII (1 bytes to a character), or vice versa.
To remove the ^# characters, run vi file.txt, and enter :%s/ Ctrl+V Ctrl+# //g and hit ↵ Return.
See this detailed article for more information.
These are "file holes" and contain null characters. The null character (or NUL char) has an ASCII code of 0 and appears as ^# when viewed in vi or less.
I usually see these when I am nearly out of disk space and processes are trying to write to log files.

Difference between parsing a text file in r and rb mode

What makes parsing a text file in 'r' mode more convenient than parsing it in 'rb' mode?
Especially when the text file in question may contain non-ASCII characters.
This depends a little bit on what version of Python you're using. In Python 2, Chris Drappier's answer applies.
In Python 3, its a different (and more consistent) story: in text mode ('r'), Python will parse the file according to the text encoding you give it (or, if you don't give one, a platform-dependent default), and read() will give you a str. In binary ('rb') mode, Python does not assume that the file contains things that can reasonably be parsed as characters, and read() gives you a bytes object.
Also, in Python 3, the universal newlines (the translating between '\n' and platform-specific newline conventions so you don't have to care about them) is available for text-mode files on any platform, not just Windows.
from the documentation:
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.
The difference lies in how the end-of-line (EOL) is handled. Different operating systems use different characters to mark EOL - \n in Unix, \r in Mac versions prior to OS X, \r\n in Windows. When a file is opened in text mode, when the file is read, Python replaces the OS specific end-of-line character read from the file with just \n. And vice versa, i.e. when you try to write \n to a file opened in text mode, it is going to write the OS specific EOL character. You can find what your OS default EOL by checking os.linesep.
When a file is opened in binary mode, no mapping takes place. What you read is what you get. Remember, text mode is the default mode. So if you are handling non-text files (images, video, etc.), make sure you open the file in binary mode, otherwise you’ll end up messing up the file by introducing (or removing) some bytes.
Python also has a universal newline mode. When a file is opened in this mode, Python maps all of the characters \r, \n and \r\n to \n.
For clarification and to answer Agostino's comment/question (I don't have sufficient reputation to comment so bear with me stating this as an answer...):
In Python 2 no line end modification happens, neither in text nor binary mode - as has been stated before, in Python 2 Chris Drappier's answer applies (please note that its link nowadays points to the 3.x Python docs but Chris' quoted text is of course from the Python 2 input and output tutorial)
So no, it is not true that opening a file in text mode with Python 2 on non-Windows does any line end modification:
0 $ cat data.txt
line1
line2
line3
0 $ file data.txt
data.txt: ASCII text, with CRLF line terminators
0 $ python2.7 -c 'f = open("data.txt"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "r"); print f.readlines()'
['line1\r\n', 'line2\r\n', 'line3\r\n']
0 $ python2.7 -c 'f = open("data.txt", "rb"); print f.readlines()'
It is however possible to open the file in universal newline mode in Python 2, which does exactly perform said line end mod:
0 $ python2.7 -c 'f = open("data.txt", "rU"); print f.readlines()'
['line1\n', 'line2\n', 'line3\n']
(the universal newline mode specifier is deprecated as of Python 3.x)
On Python 3, on the other hand, platform-specific line ends do get normalized to '\n' when reading a file in text mode, and '\n' gets converted to the current platform's default line end when writing in text mode (in addition to the bytes<->unicode<->bytes decoding/encoding going on in text mode). E.g. reading a Dos/Win CRLF-line-ended file on Linux will normalize the line ends to '\n'.

Categories

Resources