Python: Converting Binary Literal text file to Normal Text - python

I have a text file in this format:
b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'
And I want to read those lines and covert them to
Chapter 1 - BlaBla
Boy's Dead.
and replace them on the same file.
I tried encoding and decoding already with print(line.encode("UTF-8", "replace")) and that didn't work

strings = [
b'Chapter 1 \xe2\x80\x93 BlaBla',
b'Boy\xe2\x80\x99s Dead.',
]
for string in strings:
print(string.decode('utf-8', 'ignore'))
--output:--
Chapter 1 – BlaBla
Boy’s Dead.
and replace them on the same file.
There is no computer programming language in the world that can do that. You have to write the output to a new file, delete the old file, and rename the newfile to the oldfile. However, python's fileinput module can perform that process for you:
import fileinput as fi
import sys
with open('data.txt', 'wb') as f:
f.write(b'Chapter 1 \xe2\x80\x93 BlaBla\n')
f.write(b'Boy\xe2\x80\x99s Dead.\n')
with open('data.txt', 'rb') as f:
for line in f:
print(line)
with fi.input(
files = 'data.txt',
inplace = True,
backup = '.bak',
mode = 'rb') as f:
for line in f:
string = line.decode('utf-8', 'ignore')
print(string, end="")
~/python_programs$ python3.4 prog.py
b'Chapter 1 \xe2\x80\x93 BlaBla\n'
b'Boy\xe2\x80\x99s Dead.\n'
~/python_programs$ cat data.txt
Chapter 1 – BlaBla
Boy’s Dead.
Edit:
import fileinput as fi
import re
pattern = r"""
\\ #Match a literal slash...
x #Followed by an x...
[a-f0-9]{2} #Followed by any hex character, 2 times
"""
repl = ''
with open('data.txt', 'w') as f:
print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)
print(r"b'Boy\xe2\x80\x99s Dead.'", file=f)
with open('data.txt') as f:
for line in f:
print(line.rstrip()) #Output goes to terminal window
with fi.input(
files = 'data.txt',
inplace = True,
backup = '.bak') as f:
for line in f:
line = line.rstrip()[2:-1]
new_line = re.sub(pattern, "", line, flags=re.X)
print(new_line) #Writes to file, not your terminal window
~/python_programs$ python3.4 prog.py
b'Chapter 1 \xe2\x80\x93 BlaBla'
b'Boy\xe2\x80\x99s Dead.'
~/python_programs$ cat data.txt
Chapter 1 BlaBla
Boys Dead.
Your file does not contain binary data, so you can read it (or write it) in text mode. It's just a matter of escaping things correctly.
Here is the first part:
print(r"b'Chapter 1 \xe2\x80\x93 BlaBla'", file=f)
Python converts certain backslash escape sequences inside a string to something else. One of the backslash escape sequences that python converts is of the format:
\xNN #=> e.g. \xe2
The backslash escape sequence is four characters long, but python converts the backslash escape sequence into a single character.
However, I need each of the four characters to be written to the sample file I created. To keep python from converting the backslash escape sequence into one character, you can escape the beginning '\' with another '\':
\\xNN
But being lazy, I didn't want to go through your strings and escape each backslash escape sequence by hand, so I used:
r"...."
An r string escapes all the backslashes for you. As a result, python writes all four characters of the \xNN sequence to the file.
The next problem is replacing a backslash in a string using a regex--I think that was your problem to begin with. When a file contains a \, python reads that into a string as \\ to represent a literal backslash. As a result, if the file contains the four characters:
\xe2
python reads that into a string as:
"\\xe2"
which when printed looks like:
\xe2
The bottom line is: if you can see a '\' in a string that you print out, then the backslash is being escaped in the string. To see what's really inside a string, you should always use repr().
string = "\\xe2"
print(string)
print(repr(string))
--output:--
\xe2
'\\xe2'
Note that if the output has quotes around it, then you are seeing everything in the string. If the output doesn't have quotes around it, then you can't be sure exactly what's in the string.
To construct a regex pattern that matches a literal back slash in a string, the short answer is: you need to use double the amount of back slashes that you would think. With the string:
"\\xe2"
you would think that the pattern would be:
pattern = "\\x"
but based on the doubling rule, you actually need:
pattern = "\\\\x"
And remember r strings? If you use an r string for the pattern, then you can write what seems reasonable, and then the r string will escape all the slashes, doubling them:
pattern r"\\x" #=> equivalent to "\\\\x"

Related

Write to file does not go to the new line when '\n' in string

I have a .txt file which contains only one line of text. For example:
command1;\ncommand2, output;\ncommand3\ncommand4, output;\n (but much longer). Since it is hard to read, I want to change this file to some more readable version. I want to remove all ';' and replace '\n' with a new line.
I have few working solutions for this problem:
For example I could remove all '\n' and use print function. Or, replace \\n with \n:
def clean_file(file):
# read file
with open(file) as f:
content = f.readline()
# get rid of ';' and '\n'
content = content.split(';')
for ind, val in enumerate(content):
content[ind] = val.replace('\\n', '\n') # it can be also replace(r'\n', '\n')
# write to file
with open(file, 'w') as f:
for line in content:
f.write(line)
OUT:
command1
command2, output
command3
command4, output
And in this scenario, it works properly!
But I have no idea why it is not working when I remove replace part:
def clean_file(file):
# read file
with open(file) as f:
content = f.readline()
# get rid of ';'
content = content.split(';')
# write to file
with open(file, 'w') as f:
for line in content:
f.write(line)
OUT:
command1\ncommand2, output\ncommand3\ncommand4, output\n
This will print everything in one line.
Can someone explain to me why I have to replace '\n' with the same value?
The file was created, and I am opening it on windows, but the script I am running on Linux.
Most editors in the Windows world (starting with notepad) require \r\n to correctly display an end of line and ignore \n alone. On the other hand, on Linux a single \n is enough for an end of line. If you run a Python script on Windows, it will be smart enough to automatically replace any '\n' with a \r\n at write time and symetrically replace \r\n from a file with a single \n provided the file is opened in text mode. But nothing of that will happen on Linux.
Long story short, text files have different end of lines on Linux and Windows, and text files having \r\n are known as dos text files on Linux.
You have probably been caught by that, and the only way to be sure is to open the file in binary mode and display the byte values (in hex to be more readable for people used to ASCII code)
You are not replacing the same value, you are removing the \ before \n. When handling strings a backslash often means that you have a fancy character (such as newline \n, tab \t, etc..), BUT sometimes you want to print an actual backslash! To do this in python we use \\ to add in a single backslash.
So, when printing out in your first example, python comes up to \n and thinks "new line", in your second example python sees \\n so the first two \ mean print a backslash, then the n is treated and printed like a normal n

Escape commas when writing string to CSV

I need to prepend a comma-containing string to a CSV file using Python. Some say enclosing the string in double quotes escapes the commas within. This does not work. How do I write this string without the commas being recognized as seperators?
string = "WORD;WORD 45,90;WORD 45,90;END;"
with open('doc.csv') as f:
prepended = string + '\n' + f.read()
with open('doc.csv', 'w') as f:
f.write(prepended)
So as you point out, you can typically quote the string as below. Is the system that reads these files not recognizing that syntax? If you use python's csv module it will handle the proper escaping:
with open('output.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(myIterable, quoting=csv.QUOTE_ALL)
The quoted strings would look like:
"string1","string 2, with, commas"
Note if you have a quote character within your string it will be written as "" (two quote chars in a row):
"string1","string 2, with, commas, and "" a quote"

Python - converting single backslash into comma

I have a gz file with the first 5 columns delimited by a backslash and the next 5 delimited by a comma. I'm reading in the file as follows:
with gzip.open(myfile, 'r') as fin:
for line in fin:
print line
The data looks like this:
a\b\c\d\e,f,g,h,i,j
How can I convert the backslashes into commas, so that it looks like this?
a,b,c,d,e,f,g,h,i,j
I've tried:
>>> g = 'a\b\c\d\e,f,g,h,i,j'
>>> g2 = g.replace('\\', ',')
>>> g2
'a\x08,c,d,e,f,g,h,i,j'
Reading the string in its raw format solves the problem:
>>> g = r'a\b\c\d\e,f,g,h,j,k'
>>> g.replace('\\', ',')
'a,b,c,d,e,f,g,h,j,k'
But how would I read lines from a gzip'd file as raw strings?
Just read it like you're already reading it. Reading from files doesn't apply string literal escape processing. String literal escape processing only applies to string literals.
it exists the methods .read() and .readlines(), but don't really know if it works with gz files, try with that and use r'content'. The problem is \b, \t, \n, \u2003, and others are special caracters and using .replace('\', ',') doesn't work

Python handling newline and tab characters when writing to file

I am writing some text (which includes \n and \t characters) taken from one source file onto a (text) file ; for example:
source file (test.cpp):
/*
* test.cpp
*
* 2013.02.30
*
*/
is taken from the source file and stored in a string variable like so
test_str = "/*\n test.cpp\n *\n *\n *\n\t2013.02.30\n *\n */\n"
which when I write onto a file using
with open(test.cpp, 'a') as out:
print(test_str, file=out)
is being written with the newline and tab characters converted to new lines and tab spaces (exactly like test.cpp had them) whereas I want them to remain \n and \t exactly like the test_str variable holds them in the first place.
Is there a way to achieve that in Python when writing to a file these 'special characters' without them being translated?
You can use str.encode:
with open('test.cpp', 'a') as out:
print(test_str.encode('unicode_escape').decode('utf-8'), file=out)
This'll escape all the Python recognised special escape characters.
Given your example:
>>> test_str = "/*\n test.cpp\n *\n *\n *\n\t2013.02.30\n *\n */\n"
>>> test_str.encode('unicode_escape')
b'/*\\n test.cpp\\n *\\n *\\n *\\n\\t2013.02.30\\n *\\n */\\n'
Use replace(). And since you need to use it multiple times, you might want to look at this.
test_str = "/*\n test.cpp\n *\n *\n *\n\t2013.02.30\n *\n */\n"
with open("somefile", "w") as f:
test_str = test_str.replace('\n','\\n')
test_str = test_str.replace('\t','\\t')
f.write(test_str)
I want them to remain \n and \t exactly like the test_str variable holds them in the first place.
test_str does NOT contain the backslash \ + t (two characters). It contains a single character ord('\t') == 9 (the same character as in the test.cpp). Backslash is special in Python string literals e.g., u'\U0001f600' is NOT ten characters—it is a single character 😀 Don't confuse a string object in memory during runtime and its text representation as a string literal in Python source code.
JSON could be a better alternative than unicode-escape encoding to store text (more portable) i.e., use:
import json
with open('test.json', 'w') as file:
json.dump({'test.cpp': test_str}, file)
instead of test_str.encode('unicode_escape').decode('ascii').
To read json back:
with open('test.json') as file:
test_str = json.load(file)['test.cpp']

Write a list to file containing text and hex values. How?

I need to write a list of values to a text file. Because of Windows, when I need to write a line feed character, windows does \n\r and other systems do \n.
It occurred to me that maybe I should write to file in binary.
How to I create a list like the following example and write to file in binary?
output = ['my first line', hex_character_for_line_feed_here, 'my_second_line']
How come the following does not work?
output = ['my first line', '\x0a', 'my second line']
Don't. Open the file in text mode and just let Python handle the newlines for you.
When you use the open() function you can set how Python should handle newlines with the newline keyword parameter:
When writing output to the stream, if newline is None, any '\n' characters written are translated to the system default line separator, os.linesep. If newline is '' or '\n', no translation takes place. If newline is any of the other legal values, any '\n' characters written are translated to the given string.
So the default method is to write the correct line separator for your platform:
with open(outputfilename, 'w') as outputfile:
outputfile.write('\n'.join(output))
and does the right thing; on Windows \r\n characters are saved instead of \n.
If you specifically want to write \n only and not have Python translate these for you, use newline='':
with open(outputfilename, 'w', newline='') as outputfile:
outputfile.write('\n'.join(output))
Note that '\x0a' is exactly the same character as \n; \r is \x0d:
>>> '\x0a'
'\n'
>>> '\x0d'
'\r'
Create a text file, "myTextFile" in the same directory as your Python script. Then write something like:
# wb opens the file in "Write Binary" mode
myTextFile = open("myTextFile.txt", 'wb')
output = ['my first line', '369as3', 'my_second_line']
for member in output:
member.encode("utf-8") # Or whatever encoding you like =)
myTextFile.write(member + "\n")
This outputs a binary text file that looks like:
my first line
369as3
my_second_line
Edit: Updated for Python 3

Categories

Resources