Python handling newline and tab characters when writing to file - python

I am writing some text (which includes \n and \t characters) taken from one source file onto a (text) file ; for example:
source file (test.cpp):
/*
* test.cpp
*
* 2013.02.30
*
*/
is taken from the source file and stored in a string variable like so
test_str = "/*\n test.cpp\n *\n *\n *\n\t2013.02.30\n *\n */\n"
which when I write onto a file using
with open(test.cpp, 'a') as out:
print(test_str, file=out)
is being written with the newline and tab characters converted to new lines and tab spaces (exactly like test.cpp had them) whereas I want them to remain \n and \t exactly like the test_str variable holds them in the first place.
Is there a way to achieve that in Python when writing to a file these 'special characters' without them being translated?

You can use str.encode:
with open('test.cpp', 'a') as out:
print(test_str.encode('unicode_escape').decode('utf-8'), file=out)
This'll escape all the Python recognised special escape characters.
Given your example:
>>> test_str = "/*\n test.cpp\n *\n *\n *\n\t2013.02.30\n *\n */\n"
>>> test_str.encode('unicode_escape')
b'/*\\n test.cpp\\n *\\n *\\n *\\n\\t2013.02.30\\n *\\n */\\n'

Use replace(). And since you need to use it multiple times, you might want to look at this.
test_str = "/*\n test.cpp\n *\n *\n *\n\t2013.02.30\n *\n */\n"
with open("somefile", "w") as f:
test_str = test_str.replace('\n','\\n')
test_str = test_str.replace('\t','\\t')
f.write(test_str)

I want them to remain \n and \t exactly like the test_str variable holds them in the first place.
test_str does NOT contain the backslash \ + t (two characters). It contains a single character ord('\t') == 9 (the same character as in the test.cpp). Backslash is special in Python string literals e.g., u'\U0001f600' is NOT ten characters—it is a single character 😀 Don't confuse a string object in memory during runtime and its text representation as a string literal in Python source code.
JSON could be a better alternative than unicode-escape encoding to store text (more portable) i.e., use:
import json
with open('test.json', 'w') as file:
json.dump({'test.cpp': test_str}, file)
instead of test_str.encode('unicode_escape').decode('ascii').
To read json back:
with open('test.json') as file:
test_str = json.load(file)['test.cpp']

Related

How to replace double quotes (") with apostrophes (') when loading a .txt file with Pipe delimited fields in Pandas?

Problem Summary
I'm trying to load .txt files in Python using Pandas.
The .txt files uses | delimiter between fields
Each field is captured between double quotes "" as a string: e.g. "i_am_a_string"
The problem is some fields have apostrophes represented with double quotes. e.g. "I"m_not_a_valid_string" (it should be "I'm_not_a_valid_string")
Sample file
To demonstrate my issue I have created a test file which is as follows when edited in vi:
"Name"|"Surname"|"Address"|"Notes"^M
"Angelo"|""|"Kenton Square 5"|"Note 1"^M
"Angelo"|""|"Kenton’s ^M
Sqr5"|"note2"^M
"Angelo"|""|"Kenton"s ^M
Road"|"Note3"^M
Loading data
To load this file I run the following command in Jupyter notebook:
test = pd.read_csv('test.txt', sep ='|')
which loads up the file like the screenshot below:
Questions
There's 2 issues I'm looking to address represented in examples "note2" and and "Note3" in the file:
note2 question
How can I get rid of the ^M when loading the file? i.e. how can I remove the "\r\r\n" from the Address column when loaded up in Jupyter. The "note2" example should have loaded like this in the Address column:
Should I remove these before loading the file using bash commands or
Should I remove these after I load it in Jupyter using Python?
Can you please suggest the code to do it in each case and which one would you recommend (and why)?
Note3 question
How do I replace the double quote within the string expression with apostrophe? here it breaks it to another line which is incorrect. This should be loaded in row 2 as follows:
"Note3" example is a compounded one as it also has the "^M" characters in the string but here I'm interested in replacing the double quotes with an apostrophe so it doesn't break it to another line corrupting the loading.
Thank you for your help, much appreciated.
Angelo
How do I replace the double quote within the string expression with apostrophe?
If " which are to be converted into ' are always between letters (word characters) you might preprocess your file using regular expression (re) following way
import re
txt = '''"Name"|"Surname"|"Address"|"Notes"
"Angelo"|""|"Kenton Square 5"|"Note 1"
"Angelo"|""|"Kenton’s
Sqr5"|"note2"
"Angelo"|""|"Kenton"s
Road"|"Note3"'''
clean_text = re.sub(r'(?<=\w)"(?=\w)', "'", txt)
print(clean_text)
output
"Name"|"Surname"|"Address"|"Notes"
"Angelo"|""|"Kenton Square 5"|"Note 1"
"Angelo"|""|"Kenton’s
Sqr5"|"note2"
"Angelo"|""|"Kenton's
Road"|"Note3"
Explanation: use zero-length assertion to find " which are after word character and before word character.
If you have text in file, firstly read it as text file i.e.
with open("test.txt","r") as f:
txt = f.read()
then clean it
import re
clean_text = re.sub(r'(?<=\w)"(?=\w)', "'", txt)
then put it into pandas.DataFrame using io.StringIO as follows
import io
import pandas as pd
test = pd.read_csv(io.StringIO(clean_text), sep ='|')

Escape commas when writing string to CSV

I need to prepend a comma-containing string to a CSV file using Python. Some say enclosing the string in double quotes escapes the commas within. This does not work. How do I write this string without the commas being recognized as seperators?
string = "WORD;WORD 45,90;WORD 45,90;END;"
with open('doc.csv') as f:
prepended = string + '\n' + f.read()
with open('doc.csv', 'w') as f:
f.write(prepended)
So as you point out, you can typically quote the string as below. Is the system that reads these files not recognizing that syntax? If you use python's csv module it will handle the proper escaping:
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(myIterable, quoting=csv.QUOTE_ALL)
The quoted strings would look like:
"string1","string 2, with, commas"
Note if you have a quote character within your string it will be written as "" (two quote chars in a row):
"string1","string 2, with, commas, and "" a quote"

Python: Converting Binary Literal text file to Normal Text

I have a text file in this format:
b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'
And I want to read those lines and covert them to
Chapter 1 - BlaBla
Boy's Dead.
and replace them on the same file.
I tried encoding and decoding already with print(line.encode("UTF-8", "replace")) and that didn't work
strings = [
b'Chapter 1 \xe2\x80\x93 BlaBla',
b'Boy\xe2\x80\x99s Dead.',
]
for string in strings:
print(string.decode('utf-8', 'ignore'))
--output:--
Chapter 1 – BlaBla
Boy’s Dead.
and replace them on the same file.
There is no computer programming language in the world that can do that. You have to write the output to a new file, delete the old file, and rename the newfile to the oldfile. However, python's fileinput module can perform that process for you:
import fileinput as fi
import sys
with open('data.txt', 'wb') as f:
f.write(b'Chapter 1 \xe2\x80\x93 BlaBla\n')
f.write(b'Boy\xe2\x80\x99s Dead.\n')
with open('data.txt', 'rb') as f:
for line in f:
print(line)
with fi.input(
files = 'data.txt',
inplace = True,
backup = '.bak',
mode = 'rb') as f:
for line in f:
string = line.decode('utf-8', 'ignore')
print(string, end="")
~/python_programs$ python3.4 prog.py
b'Chapter 1 \xe2\x80\x93 BlaBla\n'
b'Boy\xe2\x80\x99s Dead.\n'
~/python_programs$ cat data.txt
Chapter 1 – BlaBla
Boy’s Dead.
Edit:
import fileinput as fi
import re
pattern = r"""
\\ #Match a literal slash...
x #Followed by an x...
[a-f0-9]{2} #Followed by any hex character, 2 times
"""
repl = ''
with open('data.txt', 'w') as f:
print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)
print(r"b'Boy\xe2\x80\x99s Dead.'", file=f)
with open('data.txt') as f:
for line in f:
print(line.rstrip()) #Output goes to terminal window
with fi.input(
files = 'data.txt',
inplace = True,
backup = '.bak') as f:
for line in f:
line = line.rstrip()[2:-1]
new_line = re.sub(pattern, "", line, flags=re.X)
print(new_line) #Writes to file, not your terminal window
~/python_programs$ python3.4 prog.py
b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'
~/python_programs$ cat data.txt
Chapter 1 BlaBla
Boys Dead.
Your file does not contain binary data, so you can read it (or write it) in text mode. It's just a matter of escaping things correctly.
Here is the first part:
print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)
Python converts certain backslash escape sequences inside a string to something else. One of the backslash escape sequences that python converts is of the format:
\xNN #=> e.g. \xe2
The backslash escape sequence is four characters long, but python converts the backslash escape sequence into a single character.
However, I need each of the four characters to be written to the sample file I created. To keep python from converting the backslash escape sequence into one character, you can escape the beginning '\' with another '\':
\\xNN
But being lazy, I didn't want to go through your strings and escape each backslash escape sequence by hand, so I used:
r"...."
An r string escapes all the backslashes for you. As a result, python writes all four characters of the \xNN sequence to the file.
The next problem is replacing a backslash in a string using a regex--I think that was your problem to begin with. When a file contains a \, python reads that into a string as \\ to represent a literal backslash. As a result, if the file contains the four characters:
\xe2
python reads that into a string as:
"\\xe2"
which when printed looks like:
\xe2
The bottom line is: if you can see a '\' in a string that you print out, then the backslash is being escaped in the string. To see what's really inside a string, you should always use repr().
string = "\\xe2"
print(string)
print(repr(string))
--output:--
\xe2
'\\xe2'
Note that if the output has quotes around it, then you are seeing everything in the string. If the output doesn't have quotes around it, then you can't be sure exactly what's in the string.
To construct a regex pattern that matches a literal back slash in a string, the short answer is: you need to use double the amount of back slashes that you would think. With the string:
"\\xe2"
you would think that the pattern would be:
pattern = "\\x"
but based on the doubling rule, you actually need:
pattern = "\\\\x"
And remember r strings? If you use an r string for the pattern, then you can write what seems reasonable, and then the r string will escape all the slashes, doubling them:
pattern r"\\x" #=> equivalent to "\\\\x"

(Python) Parsing tab delimited strings with newline characters

I am trying to read a file that is tab delimited but fields may contain newline characters and I would like to maintain the field that has newlines. My current implementation creates new fields from each "\n".
I have tried the csv module and just splitting on "\t" with no success on what I'm looking for. The following is a sample line from a given file:
*Field_1 \t Field_2 \t Field_3 \n Additional Text \n More text \t Field_4*
I would like to generate a list of 4 elements from the data above.
*["Field_1", "Field_2", "Field3 \n Additional Text \n More text", "Field_4"]*
Any thoughts or suggestions would be helpful.
Did you try splitting on the tab like this?
data = 'Field_1 \t Field_2 \t Field_3 \n Additional Text \n More text \t Field_4'
print data.split('\t')
Replacing fileName with the path to the file you're reading from:
inFile = open(fileName, "r")
rawData = inFile.read() # Entire file's contents as one multiline string (if there's a line break)
data = rawData.split("\t")
inFile.close()
There is also the option (generally recommended) of using the with statement for File I/O:
with open(fileName, "r") as inFile:
rawData = inFile.read() # Entire file's contents as one multiline string (if there's a line break)
data = rawData.split("\t")
# you can omit the inFile.close() statement.
With the with statement, the opened file stream will be automatically closed in the event of an error that appears at runtime, but it's less clear to people learning File I/O on how it works.

Write a list to file containing text and hex values. How?

I need to write a list of values to a text file. Because of Windows, when I need to write a line feed character, windows does \n\r and other systems do \n.
It occurred to me that maybe I should write to file in binary.
How to I create a list like the following example and write to file in binary?
output = ['my first line', hex_character_for_line_feed_here, 'my_second_line']
How come the following does not work?
output = ['my first line', '\x0a', 'my second line']
Don't. Open the file in text mode and just let Python handle the newlines for you.
When you use the open() function you can set how Python should handle newlines with the newline keyword parameter:
When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.
So the default method is to write the correct line separator for your platform:
with open(outputfilename, 'w') as outputfile:
outputfile.write('\n'.join(output))
and does the right thing; on Windows \r\n characters are saved instead of \n.
If you specifically want to write \n only and not have Python translate these for you, use newline='':
with open(outputfilename, 'w', newline='') as outputfile:
outputfile.write('\n'.join(output))
Note that '\x0a' is exactly the same character as \n; \r is \x0d:
>>> '\x0a'
'\n'
>>> '\x0d'
'\r'
Create a text file, "myTextFile" in the same directory as your Python script. Then write something like:
# wb opens the file in "Write Binary" mode
myTextFile = open("myTextFile.txt", 'wb')
output = ['my first line', '369as3', 'my_second_line']
for member in output:
member.encode("utf-8") # Or whatever encoding you like =)
myTextFile.write(member + "\n")
This outputs a binary text file that looks like:
my first line
369as3
my_second_line
Edit: Updated for Python 3

Categories

Resources