Get csv file line terminator - python

In a python script, I need to detect the endline terminator of different csv files. These endline terminators could be: '\r' (mac), '\r\n' (windows), '\n' (unix).
I tried with:
dialecto = csv.Sniffer().sniff(csvfile.read(2048), delimiters=",;")
dialecto.lineterminator
But it doesn't work.
How I could do that?
EDIT:
Based on abarnert response:
def getLineterminator(file):
with open(file, 'rU') as csvfile:
csvfile.next()
return csvfile.newlines

You can't use the csv module to auto-detect line terminators this way. The Sniffer that you're using is designed to guess between CSV dialects for use by csv.Reader. But, as the docs say, csv.Reader actually ignores lineterminator and handles line endings interchangeably, so Sniffer doesn't have any reason to set it.
But really, a CSV file with a XXX line terminators is just a text file with XXX line terminators. The fact that it's CSV is irrelevant. Just open the file in text mode, read a line out of it, and check its newlines property:
next(file)
file.newlines
In Python 3, as long as you opened the file in text mode (don't use a 'b' in the mode), this will work. In Python 2.x, you may need to specify universal newlines mode (don't use a 'b', and also do use a 'U'). If you're writing code for both versions, you can use universal newlines mode, and it'll just be ignored in 3.x—but don't do that unless you need it, since it's deprecated as of 3.6 and may become an error one day.

Related

CSV file opened with newline='' - why?

According to https://docs.python.org/3/library/csv.html
If csvfile is a file object, it should be opened with newline=''.
Why? I have tested it both ways and it seems to work equally well either way. Are there some semi-valid CSV files that will work only if the above instruction is followed?
From the footnote on the page:
If newline='' is not specified, newlines embedded inside quoted fields
will not be interpreted correctly, and on platforms that use \r\n
linendings on write an extra \r will be added. It should always be
safe to specify newline='', since the csv module does its own
(universal) newline handling.

How to convert CRLF to LF on a Windows machine in Python

So I got those template, they are all ending in LF and I can fill some terms inside with format and still get LF files by opening with "wb".
Those templates are used in a deployment script on a windows machine to deploy on a unix server.
Problem is, a lot of people are going to mess with those template, and I'm 100% sure that some of them will put some CRLF inside.
How could I, using Python, convert all the CRLF to LF?
Convert line endings in-place (with Python 3)
Line endings:
Windows - \r\n, called CRLF
Linux/Unix/MacOS - \n, called LF
Windows to Linux/Unix/MacOS (CRLF ➡ LF)
Here is a short Python script for directly converting Windows line endings to Linux/Unix/MacOS line endings. The script works in-place, i.e., without creating an extra output file.
# replacement strings
WINDOWS_LINE_ENDING = b'\r\n'
UNIX_LINE_ENDING = b'\n'
# relative or absolute file path, e.g.:
file_path = r"c:\Users\Username\Desktop\file.txt"
with open(file_path, 'rb') as open_file:
content = open_file.read()
# Windows ➡ Unix
content = content.replace(WINDOWS_LINE_ENDING, UNIX_LINE_ENDING)
# Unix ➡ Windows
# content = content.replace(UNIX_LINE_ENDING, WINDOWS_LINE_ENDING)
with open(file_path, 'wb') as open_file:
open_file.write(content)
Linux/Unix/MacOS to Windows (LF ➡ CRLF)
To change the converting from Linux/Unix/MacOS to Windows, simply comment the replacement for Unix ➡ Windows back in (remove the # in front of the line).
DO NOT comment out the command for the Windows ➡ Unix replacement, as it ensures a correct conversion. When converting from LF to CRLF, it is important that there are no CRLF line endings already present in the file. Otherwise, those lines would be converted to CRCRLF. Converting lines from CRLF to LF first and then doing the aspired conversion from LF to CRLF will avoid this issue (thanks #neuralmer for pointing that out).
Code Explanation
Binary Mode
Important: We need to make sure that we open the file both times in binary mode (mode='rb' and mode='wb') for the conversion to work.
When opening files in text mode (mode='r' or mode='w' without b), the platform's native line endings (\r\n on Windows and \r on old Mac OS versions) are automatically converted to Python's Unix-style line endings: \n. So the call to content.replace() couldn't find any \r\n line endings to replace.
In binary mode, no such conversion is done. Therefore the call to str.replace() can do its work.
Binary Strings
In Python 3, if not declared otherwise, strings are stored as Unicode (UTF-8). But we open our files in binary mode - therefore we need to add b in front of our replacement strings to tell Python to handle those strings as binary, too.
Raw Strings
On Windows the path separator is a backslash \ which we would need to escape in a normal Python string with \\. By adding r in front of the string we create a so called "raw string" which doesn't need any escaping. So you can directly copy/paste the path from Windows Explorer into your script.
(Hint: Inside Windows Explorer press CTRL+L to automatically select the path from the address bar.)
Alternative solution
We open the file twice to avoid the need of repositioning the file pointer. We could also have opened the file once with mode='rb+' but then we would have needed to move the pointer back to start after reading its content (open_file.seek(0)) and truncate its original content before writing the new one (open_file.truncate(0)).
Simply opening the file again in write mode does that automatically for us.
Cheers and happy programming,
winklerrr
Python 3:
The default newline type for open is universal, in which case it doesn't mind which sort of newline each line has.
You can also request a specific form of newline with the newline argument for open.
Translating from one form to the other is thus rather simple in Python:
with open('filename.in', 'r') as infile, \
open('filename.out', 'w', newline='\n') as outfile:
outfile.writelines(infile.readlines())
Python 2:
The open function supports universal newlines via the 'rU' mode.
Again, translating from one form to the other:
with open('filename.in', 'rU') as infile, \
open('filename.out', 'w', newline='\n') as outfile:
outfile.writelines(infile.readlines())
(In Python 3, mode U is actually deprecated; the equivalent form is newline=None, which is the default)
Why don't you try below:
str.replace('\r\n','\n');
CRLF => \r\n
LF => \n
It is possible to fix existing templates with messed-up ending with this code:
with open('file.tpl') as template:
lines = [line.replace('\r\n', '\n') for line in template]
with open('file.tpl', 'w') as template:
template.writelines(lines)

How to disable universal newlines in Python 2.7 when using open()

I have a csv file that contains two different newline terminators (\n and \r\n). I want my Python script to use \r\n as the newline terminator and NOT \n. But the problem is that Python's universal newlines feature keeps normalizing everything to be \n when I open the file using open().
The strange thing is that it never used to normalize my newlines when I wrote this script, that's why I used Python 2.7 and it worked fine. But all of a sudden today it started normalizing everything and my script no longer works as needed.
How can I disable universal newlines when opening a file using open() (without opening in binary mode)?
You need to open the file in binary mode, as stated in the module documentation:
with open(csvfilename, 'rb') as fileobj:
reader = csv.reader(fileobj)
From the csv.reader() documentation:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
In binary mode no line separator translations take place.

trouble with csv reader in python 2

I am having trouble getting python 2 to loop through a .csv file. The code bellow is throwing the error:
>>> import csv
>>> with open('test.csv', 'rb') as f:
... reader = csv.reader(f)
... for row in reader:
... print row
...
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
The python 3 version of this works fine but I need this to run for 2. Any ideas what I am doing wrong?
You need to open using open('test.csv', 'rU')
universal newlines
relevant info from the docs here:
A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention '\n', the Windows convention '\r\n', and the old Macintosh convention '\r'. See PEP 278 and PEP 3116, as well as str.splitlines() for an additional use
and here
In addition to the standard fopen() values mode may be 'U' or 'rU'. Python is usually built with universal newlines support; supplying 'U' opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention '\n', the Macintosh convention '\r', or the Windows convention '\r\n'. All of these external representations are seen as '\n' by the Python program. If Python is built without universal newlines support a mode with 'U' is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen), '\n', '\r', '\r\n', or a tuple containing all the newline types seen.

Removing newline from a csv file

I am trying to process a csv file in python that has ^M character in the middle of each row/line which is a newline. I cant open the file in any mode other than 'rU'.
If I do open the file in the 'rU' mode, it reads in the newline and splits the file (creating a newline) and gives me twice the number of rows.
I want to remove the newline altogether. How?
Note that, as the docs say:
csvfile can be any object which supports the iterator protocol and returns a string each time its next() method is called — file objects and list objects are both suitable.
So, you can always stick a filter on the file before handing it to your reader or DictReader. Instead of this:
with open('myfile.csv', 'rU') as myfile:
for row in csv.reader(myfile):
Do this:
with open('myfile.csv', 'rU') as myfile:
filtered = (line.replace('\r', '') for line in myfile)
for row in csv.reader(filtered):
That '\r' is the Python (and C) way of spelling ^M. So, this just strips all ^M characters out, no matter where they appear, by replacing each one with an empty string.
I guess I want to modify the file permanently as opposed to filtering it.
First, if you want to modify the file before running your Python script on it, why not do that from outside of Python? sed, tr, many text editors, etc. can all do this for you. Here's a GNU sed example:
gsed -i'' 's/\r//g' myfile.csv
But if you want to do it in Python, it's not that much more verbose, and you might find it more readable, so:
First, you can't really modify a file in-place if you want to insert or delete from the middle. The usual solution is to write a new file, and either move the new file over the old one (Unix only) or delete the old one (cross-platform).
The cross-platform version:
os.rename('myfile.csv', 'myfile.csv.bak')
with open('myfile.csv.bak', 'rU') as infile, open('myfile.csv', 'wU') as outfile:
for line in infile:
outfile.write(line.replace('\r'))
os.remove('myfile.csv.bak')
The less-clunky, but Unix-only, version:
temp = tempfile.NamedTemporaryFile(delete=False)
with open('myfile.csv', 'rU') as myfile, closing(temp):
for line in myfile:
temp.write(line.replace('\r'))
os.rename(tempfile.name, 'myfile.csv')

Categories

Resources