I'm having a bad time with character encoding. It's kinda to understand why this happens when I open my .txt file:
Questions:
What's this type of encoding? Why this happens?
How can I rewrite my txt file to use normal accents or even without accents and special chars?
Is there any special library to handle this? I could create a huge function that will replace() all these chars, but I don't know when or which chars will appear in my future txts.
My code:
folder = 'E:\\WinPython\\notebooks\\scripts\\script1\\'
txtFile = folder + 'PROF_SAI_318_210117_310117_orig.txt'
with open(txtFile, 'r') as f:
with open('PROF_SAI_318_210117_310117_clean.txt', 'w') as g:
for line in f:
do_something() # what should I write here to 'clean' my file?
g.write(line)
print("Ok!")
Output excerpt:
SPLEONARDO SIM\xc3\x83O ESTARLING
GOFLORESTA S/A A\xc3\x87UCAR E ALCOOL
SPFOCO REPRESENTA\xc3\x87\xc3\x95ES E CONSULTORIA
It looks like you are using Notepad++ to display your file. The encoding displayed looks like cp1252:
>>> b'COMUNICA\xc7\xc3O M\xc1QUINAS'.decode('cp1252')
'COMUNICAÇÃO MÁQUINAS'
In Notepad++, on the menu select Encoding->Character sets->Western European->Windows-1252 and your file should display correctly.
Here's an example that converts to UTF-8 (your output excerpt):
>>> b'SPLEONARDO SIM\xc3O ESTARLING'.decode('cp1252')
'SPLEONARDO SIMÃO ESTARLING'
>>> b'SPLEONARDO SIM\xc3O ESTARLING'.decode('cp1252').encode('utf8')
b'SPLEONARDO SIM\xc3\x83O ESTARLING'
For your example code, you can do:
with open(txtFile, 'r', encoding='cp1252') as f:
with open('PROF_SAI_318_210117_310117_clean.txt', 'w', encoding='utf8') as g:
for line in f:
g.write(line)
If your files aren't too large, you can just do:
with open(txtFile, 'r', encoding='cp1252') as f:
with open('PROF_SAI_318_210117_310117_clean.txt', 'w', encoding='utf8') as g:
g.write(f.read())
Related
Hi I am trying to use csv library to convert my CSV file into a new one.
The code that I wrote is the following:
import csv
import re
file_read=r'C:\Users\Comarch\Desktop\Test.csv'
file_write=r'C:\Users\Comarch\Desktop\Test_new.csv'
def find_txt_in_parentheses(cell_txt):
pattern = r'\(.+\)'
return set(re.findall(pattern, cell_txt))
with open(file_write, 'w', encoding='utf-8-sig') as file_w:
csv_writer = csv.writer(file_w, lineterminator="\n")
with open(file_read, 'r',encoding='utf-8-sig') as file_r:
csv_reader = csv.reader(file_r)
for row in csv_reader:
cell_txt = row[0]
txt_in_parentheses = find_txt_in_parentheses(cell_txt)
if len(txt_in_parentheses) == 1:
txt_in_parentheses = txt_in_parentheses.pop()
cell_txt_new = cell_txt.replace(' ' + txt_in_parentheses,'')
cell_txt_new = txt_in_parentheses + '\n' + cell_txt_new
row[0] = cell_txt_new
csv_writer.writerow(row)
The only problem is that in the resulting file (Test_new.csv file), I have CRLF instead of LF.
Here is a sample image of:
read file on the left
write file on the right:
And as a result when I copy the csv column into Google docs Excel file I am getting a blank line after each row with CRLF.
Is it possible to write my code with the use of csv library so that LF is left inside a cell instead of CRLF.
From the documentation of csv.reader
If csvfile is a file object, it should be opened with newline=''1
[...]
Footnotes
1(1,2)
If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly, and on platforms that use \r\n linendings on write an extra \r will be added. It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.
This is precisely the issue you're seeing. So...
with open(file_read, 'r', encoding='utf-8-sig', newline='') as file_r, \
open(file_write, 'w', encoding='utf-8-sig', newline='') as file_w:
csv_reader = csv.reader(file_r, dialect='excel')
csv_writer = csv.writer(file_w, dialect='excel')
# ...
You are on Windows, and you open the file with mode 'w' -- which gives you windows style line endings. Using mode 'wb' should give you the preferred behaviour.
I read the text format using below code,
f = open("document.txt", "r+", encoding='utf-8-sig')
f.read()
But the type of f is _io.TextIOWrapper. But I need type as string to move on.
Please help me to convert _io.TextIOWrapper to string.
You need to use the output of f.read().
string = f.read()
I think your confusion is that f will be turned into a string just by calling its method .read(), but that's not the case. I don't think it's even possible for builtins to do that.
For reference, _io.TextIOWrapper is the class of an open text file. See the documentation for io.TextIOWrapper.
By the way, best practice is to use a with-statement for opening files:
with open("document.txt", "r", encoding='utf-8-sig') as f:
string = f.read()
This is good:
with open(file, 'r', encoding='utf-8-sig') as f:
data = f.read()
This is not good:
with open(file, 'r', encoding='utf-8-sig') as file:
data = file.read()
It's not a super elegant solution but it works for me
def extractPath(innie):
iggy = str(innie)
getridofme ="<_io.TextIOWrapper name='"
getridofmetoo ="' mode='r' encoding='UTF-8'>"
iggy = iggy.replace(getridofme, "")
iggy = iggy.replace(getridofmetoo, "")
#iggy.trim()
print(iggy)
return iggy
The following code produces a file with content test\\nstring, but I need the file to contain test\nstring. I can't figure out a way to replace the \\symbol either.
s = "test\nstring"
with open('test.txt', 'w') as f:
f.write(s)
How can I make sure that the file contains only \n instead of \\n?
use s = "test\\nstring"
I tried with the following code and worked.
s = "test\\nstring"
with open('test.txt', 'w') as f:
f.write(s)
and the test.txt file contains
test\nstring
Besides of escaping and raw string, you can encode it (2 or 3) with 'string_escape':
s = "test\nstring".encode('string_escape')
with open('test.txt', 'w') as f:
f.write(s)
The raw strings may help
s = r"test\nstring"
with open('test.txt', 'w') as f:
f.write(s)
I am trying to write a python script to convert rows in a file to json output, where each line contains a json blob.
My code so far is:
with open( "/Users/me/tmp/events.txt" ) as f:
content = f.readlines()
# strip to remove newlines
lines = [x.strip() for x in content]
i = 1
for line in lines:
filename = "input" + str(i) + ".json"
i += 1
f = open(filename, "w")
f.write(line)
f.close()
However, I am running into an issue where if I have an entry in the file that is quoted, for example:
client:"mac"
This will be output as:
"client:""mac"""
Using a second strip on writing to file will give:
client:""mac
But I want to see:
client:"mac"
Is there any way to force Python to read text in the format ' "something" ' without appending extra quotes around it?
Instead of creating an auxiliary list to strip the newline from content, just open the input and output files at the same time. Write to the output file as you iterate through the lines of the input and stripping whatever you deem necessary. Try something like this:
with open('events.txt', 'rb') as infile, open('input1.json', 'wb') as outfile:
for line in infile:
line = line.strip('"')
outfile.write(line)
I have this code
import collections
import csv
import sys
import codecs
from xml.dom.minidom import parse
import xml.dom.minidom
String = collections.namedtuple("String", ["tag", "text"])
def read_translations(filename): #Reads a csv file with rows made up of 2 columns: the string tag, and the translated tag
with codecs.open(filename, "r", encoding='utf-8') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=",")
result = [String(tag=row[0], text=row[1]) for row in csv_reader]
return result
The CSV file I'm reading contains Brazilian portuguese characters. When I try to run this, I get an error:
'utf8' codec can't decode byte 0x88 in position 21: invalid start byte
I'm using Python 2.7. As you can see, I'm encoding with codecs, but it doesn't work.
Any ideas?
The idea of this line:
with codecs.open(filename, "r", encoding='utf-8') as csvfile:
is to say "This file was saved as utf-8. Please make appropriate conversions when reading from it."
That works fine if the file was actually saved as utf-8. If some other encoding was used, then it is bad.
What then?
Determine which encoding was used. Assuming the information cannot be obtained from the software which created the file - guess.
Open the file normally and print each line:
with open(filename, 'rt') as f:
for line in f:
print repr(line)
Then look for a character which is not ASCII, e.g. ñ - this letter will be printed as some code, e.g.:
'espa\xc3\xb1ol'
Above, ñ is represented as \xc3\xb1, because that is the utf-8 sequence for it.
Now, you can check what various encodings would give and see which is right:
>>> ntilde = u'\N{LATIN SMALL LETTER N WITH TILDE}'
>>>
>>> print repr(ntilde.encode('utf-8'))
'\xc3\xb1'
>>> print repr(ntilde.encode('windows-1252'))
'\xf1'
>>> print repr(ntilde.encode('iso-8859-1'))
'\xf1'
>>> print repr(ntilde.encode('macroman'))
'\x96'
Or print all of them:
for c in encodings.aliases.aliases:
try:
encoded = ntilde.encode(c)
print c, repr(encoded)
except:
pass
Then, when you have guessed which encoding it is, use that, e.g.:
with codecs.open(filename, "r", encoding='iso-8859-1') as csvfile: