I have a csv file that contains two different newline terminators (\n and \r\n). I want my Python script to use \r\n as the newline terminator and NOT \n. But the problem is that Python's universal newlines feature keeps normalizing everything to be \n when I open the file using open().
The strange thing is that it never used to normalize my newlines when I wrote this script, that's why I used Python 2.7 and it worked fine. But all of a sudden today it started normalizing everything and my script no longer works as needed.
How can I disable universal newlines when opening a file using open() (without opening in binary mode)?
You need to open the file in binary mode, as stated in the module documentation:
with open(csvfilename, 'rb') as fileobj:
reader = csv.reader(fileobj)
From the csv.reader() documentation:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
In binary mode no line separator translations take place.
Related
According to https://docs.python.org/3/library/csv.html
If csvfile is a file object, it should be opened with newline=''.
Why? I have tested it both ways and it seems to work equally well either way. Are there some semi-valid CSV files that will work only if the above instruction is followed?
From the footnote on the page:
If newline='' is not specified, newlines embedded inside quoted fields
will not be interpreted correctly, and on platforms that use \r\n
linendings on write an extra \r will be added. It should always be
safe to specify newline='', since the csv module does its own
(universal) newline handling.
In a python script, I need to detect the endline terminator of different csv files. These endline terminators could be: '\r' (mac), '\r\n' (windows), '\n' (unix).
I tried with:
dialecto = csv.Sniffer().sniff(csvfile.read(2048), delimiters=",;")
dialecto.lineterminator
But it doesn't work.
How I could do that?
EDIT:
Based on abarnert response:
def getLineterminator(file):
with open(file, 'rU') as csvfile:
csvfile.next()
return csvfile.newlines
You can't use the csv module to auto-detect line terminators this way. The Sniffer that you're using is designed to guess between CSV dialects for use by csv.Reader. But, as the docs say, csv.Reader actually ignores lineterminator and handles line endings interchangeably, so Sniffer doesn't have any reason to set it.
But really, a CSV file with a XXX line terminators is just a text file with XXX line terminators. The fact that it's CSV is irrelevant. Just open the file in text mode, read a line out of it, and check its newlines property:
next(file)
file.newlines
In Python 3, as long as you opened the file in text mode (don't use a 'b' in the mode), this will work. In Python 2.x, you may need to specify universal newlines mode (don't use a 'b', and also do use a 'U'). If you're writing code for both versions, you can use universal newlines mode, and it'll just be ignored in 3.x—but don't do that unless you need it, since it's deprecated as of 3.6 and may become an error one day.
So I got those template, they are all ending in LF and I can fill some terms inside with format and still get LF files by opening with "wb".
Those templates are used in a deployment script on a windows machine to deploy on a unix server.
Problem is, a lot of people are going to mess with those template, and I'm 100% sure that some of them will put some CRLF inside.
How could I, using Python, convert all the CRLF to LF?
Convert line endings in-place (with Python 3)
Line endings:
Windows - \r\n, called CRLF
Linux/Unix/MacOS - \n, called LF
Windows to Linux/Unix/MacOS (CRLF ➡ LF)
Here is a short Python script for directly converting Windows line endings to Linux/Unix/MacOS line endings. The script works in-place, i.e., without creating an extra output file.
# replacement strings
WINDOWS_LINE_ENDING = b'\r\n'
UNIX_LINE_ENDING = b'\n'
# relative or absolute file path, e.g.:
file_path = r"c:\Users\Username\Desktop\file.txt"
with open(file_path, 'rb') as open_file:
content = open_file.read()
# Windows ➡ Unix
content = content.replace(WINDOWS_LINE_ENDING, UNIX_LINE_ENDING)
# Unix ➡ Windows
# content = content.replace(UNIX_LINE_ENDING, WINDOWS_LINE_ENDING)
with open(file_path, 'wb') as open_file:
open_file.write(content)
Linux/Unix/MacOS to Windows (LF ➡ CRLF)
To change the converting from Linux/Unix/MacOS to Windows, simply comment the replacement for Unix ➡ Windows back in (remove the # in front of the line).
DO NOT comment out the command for the Windows ➡ Unix replacement, as it ensures a correct conversion. When converting from LF to CRLF, it is important that there are no CRLF line endings already present in the file. Otherwise, those lines would be converted to CRCRLF. Converting lines from CRLF to LF first and then doing the aspired conversion from LF to CRLF will avoid this issue (thanks #neuralmer for pointing that out).
Code Explanation
Binary Mode
Important: We need to make sure that we open the file both times in binary mode (mode='rb' and mode='wb') for the conversion to work.
When opening files in text mode (mode='r' or mode='w' without b), the platform's native line endings (\r\n on Windows and \r on old Mac OS versions) are automatically converted to Python's Unix-style line endings: \n. So the call to content.replace() couldn't find any \r\n line endings to replace.
In binary mode, no such conversion is done. Therefore the call to str.replace() can do its work.
Binary Strings
In Python 3, if not declared otherwise, strings are stored as Unicode (UTF-8). But we open our files in binary mode - therefore we need to add b in front of our replacement strings to tell Python to handle those strings as binary, too.
Raw Strings
On Windows the path separator is a backslash \ which we would need to escape in a normal Python string with \\. By adding r in front of the string we create a so called "raw string" which doesn't need any escaping. So you can directly copy/paste the path from Windows Explorer into your script.
(Hint: Inside Windows Explorer press CTRL+L to automatically select the path from the address bar.)
Alternative solution
We open the file twice to avoid the need of repositioning the file pointer. We could also have opened the file once with mode='rb+' but then we would have needed to move the pointer back to start after reading its content (open_file.seek(0)) and truncate its original content before writing the new one (open_file.truncate(0)).
Simply opening the file again in write mode does that automatically for us.
Cheers and happy programming,
winklerrr
Python 3:
The default newline type for open is universal, in which case it doesn't mind which sort of newline each line has.
You can also request a specific form of newline with the newline argument for open.
Translating from one form to the other is thus rather simple in Python:
with open('filename.in', 'r') as infile, \
open('filename.out', 'w', newline='\n') as outfile:
outfile.writelines(infile.readlines())
Python 2:
The open function supports universal newlines via the 'rU' mode.
Again, translating from one form to the other:
with open('filename.in', 'rU') as infile, \
open('filename.out', 'w', newline='\n') as outfile:
outfile.writelines(infile.readlines())
(In Python 3, mode U is actually deprecated; the equivalent form is newline=None, which is the default)
Why don't you try below:
str.replace('\r\n','\n');
CRLF => \r\n
LF => \n
It is possible to fix existing templates with messed-up ending with this code:
with open('file.tpl') as template:
lines = [line.replace('\r\n', '\n') for line in template]
with open('file.tpl', 'w') as template:
template.writelines(lines)
I am trying to write to a notepad file with binary encodings each separated by a newline.The gist of the code is as follows
with open("filedir","ab") as Afile:
Afile.write(info+"\n")
However, the outputs are just being appended and not new lined.
If you're writing to a binary file (like you say) and you want it to work properly on Windows (I'm assuming you're on Windows since you're talking about notepad), then you need to use the Windows line endings "\r\n". Given that you're trying to write line endings in the proper "encoding" I'd have to ask why you want to use binary mode, given that all it does is disable converting "\n" into "\r\n" on Windows.
This question already has answers here:
Unable to read huge (20GB) file from CPython
(2 answers)
Closed 9 years ago.
I'm a Python newbie and had a quick question regarding memory usage when reading large text files. I have a ~13GB csv I'm trying to read line-by-line following the Python documentation and more experienced Python user's advice to not use readlines() in order to avoid loading the entire file into memory.
When trying to read a line from the file I get the error below and am not sure what might be causing it. Besides this error, I also notice my PC's memory usage is excessively high. This was a little surprising since my understanding of the readline function is that it only loads a single line from the file at a time into memory.
For reference, I'm using Continuum Analytic's Anaconda distribution of Python 2.7 and PyScripter as my IDE for debugging and testing. Any help or insight is appreciated.
with open(R'C:\temp\datasets\a13GBfile.csv','r') as f:
foo = f.readline(); #<-- Err: SystemError: ..\Objects\stringobject.c:3902 bad argument to internal function
UPDATE:
Thank you all for the quick, informative and very helpful feedback, I reviewed the referenced link which is exactly the problem I was having. After applying the documented 'rU' option mode I was able to read lines from the file like normal. I didn't notice this mode mentioned in the documentation link I was referencing initially and neglected to look at the details for the open function first. Thanks again.
Unix text files end each line with \n.
Windows text files end each line with \r\n.
When you open a file in text mode, 'r', Python assumes it has the native line endings for your platform.
So, if you open a Unix text file on Windows, Python will look for \r\n sequences to split the lines. But there won't be any, so it'll treat your whole file is one giant 13-billion-character line. So that readline() call ends up trying to read the whole thing into memory.
The fix for this is to use universal newlines mode, by opening the file in mode rU. As explained in the docs for open:
supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'.
So, instead of searching for \r\n sequences to split the lines, it looks for \r\n, or \n, or \r. And there are millions of \n. So, the problem is solved.
A different way to fix this is to use binary mode, 'rb'. In this mode, Python doesn't do any conversion at all, and assumes all lines end in \n, no matter what platform you're on.
On its own, this is pretty hacky—it means you'll end up with an extra \r on the end of every line in a Windows text file.
But it means you can pass the file on to a higher-level file reader like csv that wants binary files, so it can parse them the way it wants to. On top of magically solving this problem for you, a higher-level library will also probably make the rest of your code a lot simpler and more robust. For example, it might look something like this:
with open(R'C:\temp\datasets\a13GBfile.csv','rb') as f:
for row in csv.reader(f):
# do stuff
Now each row is automatically split on commas, except that commas that are inside quotes or escaped in the appropriate way don't count, and so on, so all you need to deal with is a list of column values.