How to convert CRLF to LF on a Windows machine in Python - python

So I got those template, they are all ending in LF and I can fill some terms inside with format and still get LF files by opening with "wb".
Those templates are used in a deployment script on a windows machine to deploy on a unix server.
Problem is, a lot of people are going to mess with those template, and I'm 100% sure that some of them will put some CRLF inside.
How could I, using Python, convert all the CRLF to LF?

Convert line endings in-place (with Python 3)
Line endings:
Windows - \r\n, called CRLF
Linux/Unix/MacOS - \n, called LF
Windows to Linux/Unix/MacOS (CRLF ➡ LF)
Here is a short Python script for directly converting Windows line endings to Linux/Unix/MacOS line endings. The script works in-place, i.e., without creating an extra output file.
# replacement strings
WINDOWS_LINE_ENDING = b'\r\n'
UNIX_LINE_ENDING = b'\n'
# relative or absolute file path, e.g.:
file_path = r"c:\Users\Username\Desktop\file.txt"
with open(file_path, 'rb') as open_file:
content = open_file.read()
# Windows ➡ Unix
content = content.replace(WINDOWS_LINE_ENDING, UNIX_LINE_ENDING)
# Unix ➡ Windows
# content = content.replace(UNIX_LINE_ENDING, WINDOWS_LINE_ENDING)
with open(file_path, 'wb') as open_file:
open_file.write(content)
Linux/Unix/MacOS to Windows (LF ➡ CRLF)
To change the converting from Linux/Unix/MacOS to Windows, simply comment the replacement for Unix ➡ Windows back in (remove the # in front of the line).
DO NOT comment out the command for the Windows ➡ Unix replacement, as it ensures a correct conversion. When converting from LF to CRLF, it is important that there are no CRLF line endings already present in the file. Otherwise, those lines would be converted to CRCRLF. Converting lines from CRLF to LF first and then doing the aspired conversion from LF to CRLF will avoid this issue (thanks #neuralmer for pointing that out).
Code Explanation
Binary Mode
Important: We need to make sure that we open the file both times in binary mode (mode='rb' and mode='wb') for the conversion to work.
When opening files in text mode (mode='r' or mode='w' without b), the platform's native line endings (\r\n on Windows and \r on old Mac OS versions) are automatically converted to Python's Unix-style line endings: \n. So the call to content.replace() couldn't find any \r\n line endings to replace.
In binary mode, no such conversion is done. Therefore the call to str.replace() can do its work.
Binary Strings
In Python 3, if not declared otherwise, strings are stored as Unicode (UTF-8). But we open our files in binary mode - therefore we need to add b in front of our replacement strings to tell Python to handle those strings as binary, too.
Raw Strings
On Windows the path separator is a backslash \ which we would need to escape in a normal Python string with \\. By adding r in front of the string we create a so called "raw string" which doesn't need any escaping. So you can directly copy/paste the path from Windows Explorer into your script.
(Hint: Inside Windows Explorer press CTRL+L to automatically select the path from the address bar.)
Alternative solution
We open the file twice to avoid the need of repositioning the file pointer. We could also have opened the file once with mode='rb+' but then we would have needed to move the pointer back to start after reading its content (open_file.seek(0)) and truncate its original content before writing the new one (open_file.truncate(0)).
Simply opening the file again in write mode does that automatically for us.
Cheers and happy programming,
winklerrr

Python 3:
The default newline type for open is universal, in which case it doesn't mind which sort of newline each line has.
You can also request a specific form of newline with the newline argument for open.
Translating from one form to the other is thus rather simple in Python:
with open('filename.in', 'r') as infile, \
open('filename.out', 'w', newline='\n') as outfile:
outfile.writelines(infile.readlines())
Python 2:
The open function supports universal newlines via the 'rU' mode.
Again, translating from one form to the other:
with open('filename.in', 'rU') as infile, \
open('filename.out', 'w', newline='\n') as outfile:
outfile.writelines(infile.readlines())
(In Python 3, mode U is actually deprecated; the equivalent form is newline=None, which is the default)

Why don't you try below:
str.replace('\r\n','\n');
CRLF => \r\n
LF => \n

It is possible to fix existing templates with messed-up ending with this code:
with open('file.tpl') as template:
lines = [line.replace('\r\n', '\n') for line in template]
with open('file.tpl', 'w') as template:
template.writelines(lines)

Related

Get csv file line terminator

In a python script, I need to detect the endline terminator of different csv files. These endline terminators could be: '\r' (mac), '\r\n' (windows), '\n' (unix).
I tried with:
dialecto = csv.Sniffer().sniff(csvfile.read(2048), delimiters=",;")
dialecto.lineterminator
But it doesn't work.
How I could do that?
EDIT:
Based on abarnert response:
def getLineterminator(file):
with open(file, 'rU') as csvfile:
csvfile.next()
return csvfile.newlines
You can't use the csv module to auto-detect line terminators this way. The Sniffer that you're using is designed to guess between CSV dialects for use by csv.Reader. But, as the docs say, csv.Reader actually ignores lineterminator and handles line endings interchangeably, so Sniffer doesn't have any reason to set it.
But really, a CSV file with a XXX line terminators is just a text file with XXX line terminators. The fact that it's CSV is irrelevant. Just open the file in text mode, read a line out of it, and check its newlines property:
next(file)
file.newlines
In Python 3, as long as you opened the file in text mode (don't use a 'b' in the mode), this will work. In Python 2.x, you may need to specify universal newlines mode (don't use a 'b', and also do use a 'U'). If you're writing code for both versions, you can use universal newlines mode, and it'll just be ignored in 3.x—but don't do that unless you need it, since it's deprecated as of 3.6 and may become an error one day.

What is os.linesep for?

Python's os module contains a value for a platform specific line separating string, but the docs explicitly say not to use it when writing to a file:
Do not use os.linesep as a line terminator when writing files opened in text mode (the default); use a single '\n' instead, on all platforms.
Docs
Previous questions have explored why you shouldn't use it in this context, but then what context is it useful for? When should you use the line separator, and for what?
the docs explicitly say not to use it when writing to a file
Not exactly. The doc says not to use it in text mode.
The os.linesep is used when you want to iterate through the lines of a text file. The internal scanner recognises the os.linesep and replaces it by a single \n.
For illustration, we write a binary file which contains 3 lines separated by \r\n (Windows delimiter):
import io
filename = "text.txt"
content = b'line1\r\nline2\r\nline3'
with io.open(filename, mode="wb") as fd:
fd.write(content)
The content of the binary file is:
with io.open(filename, mode="rb") as fd:
for line in fd:
print(repr(line))
NB: I used the "rb" mode to read the file as a binary file.
I get:
b'line1\r\n'
b'line2\r\n'
b'line3'
If I read the content of the file using the text mode, like this:
with io.open(filename, mode="r", encoding="ascii") as fd:
for line in fd:
print(repr(line))
I get:
'line1\n'
'line2\n'
'line3'
The delimiter is replaced by \n.
The os.linesep is also used in write mode. Any \n character is converted to the system default line separator: \r\n on Windows, \n on POSIX, etc.
With the io.open function you can force the line separator to whatever you want.
Example: how to write a Windows text file:
with io.open(filename, mode="w", encoding="ascii", newline="\r\n") as fd:
fd.write("one\ntwo\nthree\n")
If you read this file in text mode like this:
with io.open(filename, mode="rb") as fd:
content = fd.read()
print(repr(content))
You get:
b'one\r\ntwo\r\nthree\r\n'
As you know, reading and writing files in text mode in python converts the platform specific line separator to '\n' and vice versa. But if you would read a file in binary mode, no conversion takes place. Then you can explicitly convert the line endings using string.replace(os.linesep, '\n'). This can be useful if a file (or stream or whatever) contains a combination of binary and text data.

disable the automatic change from \r\n to \n in python

I am working under ubuntu on a python3.4 script where I take in parameter a file (encoded to UTF-8), generated under Windows. I have to go through the file line by line (separated by \r\n) knowing that the "lines" contain some '\n' that I want to keep.
My problem is that Python transforms the file's "\r\n" to "\n" when opening. I've tried to open with different modes ("r", "rt", "rU").
The only solution I found is to work in binary mode and not text mode, opening with the "rb" mode.
Is there a way to do it without working in binary mode or a proper way to do it?
Set the newline keyword argument to open() to '\r\n', or perhaps to the empty string:
with open(filename, 'r', encoding='utf-8', newline='\r\n') as f:
This tells Python to only split lines on the \r\n line terminator; \n is left untouched in the output. If you set it to '' instead, \n is also seen as a line terminator but \r\n is not translated to \n.
From the open() function documentation:
newline controls how universal newlines mode works (it only applies to text mode). It can be None, '', '\n', '\r', and '\r\n'. [...] If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated. If it has any of the other legal values, input lines are only terminated by the given string, and the line ending is returned to the caller untranslated.
Bold emphasis mine.
From Martijn Pieters the solution is:
with open(filename, "r", newline='\r\n') as f:
This answer was posted as an edit to the question disable the automatic change from \r\n to \n in python by the OP lu1her under CC BY-SA 3.0.

How to disable universal newlines in Python 2.7 when using open()

I have a csv file that contains two different newline terminators (\n and \r\n). I want my Python script to use \r\n as the newline terminator and NOT \n. But the problem is that Python's universal newlines feature keeps normalizing everything to be \n when I open the file using open().
The strange thing is that it never used to normalize my newlines when I wrote this script, that's why I used Python 2.7 and it worked fine. But all of a sudden today it started normalizing everything and my script no longer works as needed.
How can I disable universal newlines when opening a file using open() (without opening in binary mode)?
You need to open the file in binary mode, as stated in the module documentation:
with open(csvfilename, 'rb') as fileobj:
reader = csv.reader(fileobj)
From the csv.reader() documentation:
If csvfile is a file object, it must be opened with the ‘b’ flag on platforms where that makes a difference.
In binary mode no line separator translations take place.

Don't convert newline when reading a file

I'm reading a text file:
f = open('data.txt')
data = f.read()
However newline in data variable is normalized to LF ('\n') while the file contains CRLF ('\r\n').
How can I instruct Python to read the file as is?
In Python 2.x:
f = open('data.txt', 'rb')
As the docs say:
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading. Thus, when opening a binary file, you should append 'b' to the mode value to open the file in binary mode, which will improve portability. (Appending 'b' is useful even on systems that don’t treat binary and text files differently, where it serves as documentation.)
In Python 3.x, there are three alternatives:
f1 = open('data.txt', 'rb')
This will leave newlines untransformed, but will also return bytes instead of str, which you will have to explicitly decode to Unicode yourself. (Of course the 2.x version also returned bytes that had to be decoded manually if you wanted Unicode, but in 2.x that's what a str object is; in 3.x str is Unicode.)
f2 = open('data.txt', 'r', newline='')
This will return str, and leave newlines untranslated. Unlike the 2.x equivalent, however, readline and friends will treat '\r\n' as a newline, instead of a regular character followed by a newline. Usually this won't matter, but if it does, keep it in mind.
f3 = open('data.txt', 'rb', encoding=locale.getpreferredencoding(False))
This treats newlines exactly the same way as the 2.x code, and returns str using the same encoding you'd get if you just used all of the defaults… but it's no longer valid in current 3.x.
When reading input from the stream, if newline is None, universal newlines mode is enabled. Lines in the input can end in '\n', '\r', or '\r\n', and these are translated into '\n' before being returned to the caller. If it is '', universal newlines mode is enabled, but line endings are returned to the caller untranslated.
The reason you need to specify an explicit encoding for f3 is that opening a file in binary mode means the default changes from "decode with locale.getpreferredencoding(False)" to "don't decode, and return raw bytes instead of str". Again, from the docs:
In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)
However:
'encoding' … should only be used in text mode.
And, at least as of 3.3, this is enforced; if you try it with binary mode, you get ValueError: binary mode doesn't take an encoding argument.
So, if you want to write code that works on both 2.x and 3.x, what do you use? If you want to deal in bytes, obviously f and f1are the same. But if you want to deal instr, as appropriate for each version, the simplest answer is to write different code for each, probablyfandf2`, respectively. If this comes up a lot, consider writing either wrapper function:
if sys.version_info >= (3, 0):
def crlf_open(path, mode):
return open(path, mode, newline='')
else:
def crlf_open(path, mode):
return open(path, mode+'b')
Another thing to watch out for in writing multi-version code is that, if you're not writing locale-aware code, locale.getpreferredencoding(False) almost always returns something reasonable in 3.x, but it will usually just return 'US-ASCII' in 2.x. Using locale.getpreferredencoding(True) is technically incorrect, but may be more likely to be what you actually want if you don't want to think about encodings. (Try calling it both ways in your 2.x and 3.x interpreters to see why—or read the docs.)
Of course if you actually know the file's encoding, that's always better than guessing anyway.
In either case, the 'r' means "read-only". If you don't specify a mode, the default is 'r', so the binary-mode equivalent to the default is 'rb'.
You need to open the file in the binary mode:
f = open('data.txt', 'rb')
data = f.read()
('r' for "read", 'b' for "binary")
Then everything is returned as is, nothing is normalized
You can use the codecs module to write 'version-agnostic' code:
Underlying encoded files are always opened in binary mode. No automatic conversion of '\n' is done on reading and writing. The mode argument may be any binary mode acceptable to the built-in open() function; the 'b' is automatically added.
import codecs
with codecs.open('foo', mode='r', encoding='utf8') as f:
# python2: u'foo\r\n'
# python3: 'foo\r\n'
f.readline()
Just request "read binary" in the open:
f = open('data.txt', 'rb')
data = f.read()
Open the file using open('data.txt', 'rb'). See the doc.

Categories

Resources