Python codecs encoding not working - python

I have this code
import collections
import csv
import sys
import codecs
from xml.dom.minidom import parse
import xml.dom.minidom
String = collections.namedtuple("String", ["tag", "text"])
def read_translations(filename): #Reads a csv file with rows made up of 2 columns: the string tag, and the translated tag
with codecs.open(filename, "r", encoding='utf-8') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=",")
result = [String(tag=row[0], text=row[1]) for row in csv_reader]
return result
The CSV file I'm reading contains Brazilian portuguese characters. When I try to run this, I get an error:
'utf8' codec can't decode byte 0x88 in position 21: invalid start byte
I'm using Python 2.7. As you can see, I'm encoding with codecs, but it doesn't work.
Any ideas?

The idea of this line:
with codecs.open(filename, "r", encoding='utf-8') as csvfile:
is to say "This file was saved as utf-8. Please make appropriate conversions when reading from it."
That works fine if the file was actually saved as utf-8. If some other encoding was used, then it is bad.
What then?
Determine which encoding was used. Assuming the information cannot be obtained from the software which created the file - guess.
Open the file normally and print each line:
with open(filename, 'rt') as f:
for line in f:
print repr(line)
Then look for a character which is not ASCII, e.g. ñ - this letter will be printed as some code, e.g.:
'espa\xc3\xb1ol'
Above, ñ is represented as \xc3\xb1, because that is the utf-8 sequence for it.
Now, you can check what various encodings would give and see which is right:
>>> ntilde = u'\N{LATIN SMALL LETTER N WITH TILDE}'
>>>
>>> print repr(ntilde.encode('utf-8'))
'\xc3\xb1'
>>> print repr(ntilde.encode('windows-1252'))
'\xf1'
>>> print repr(ntilde.encode('iso-8859-1'))
'\xf1'
>>> print repr(ntilde.encode('macroman'))
'\x96'
Or print all of them:
for c in encodings.aliases.aliases:
try:
encoded = ntilde.encode(c)
print c, repr(encoded)
except:
pass
Then, when you have guessed which encoding it is, use that, e.g.:
with codecs.open(filename, "r", encoding='iso-8859-1') as csvfile:

Related

How can I replace UTF chars using Python?

I'm having a bad time with character encoding. It's kinda to understand why this happens when I open my .txt file:
Questions:
What's this type of encoding? Why this happens?
How can I rewrite my txt file to use normal accents or even without accents and special chars?
Is there any special library to handle this? I could create a huge function that will replace() all these chars, but I don't know when or which chars will appear in my future txts.
My code:
folder = 'E:\\WinPython\\notebooks\\scripts\\script1\\'
txtFile = folder + 'PROF_SAI_318_210117_310117_orig.txt'
with open(txtFile, 'r') as f:
with open('PROF_SAI_318_210117_310117_clean.txt', 'w') as g:
for line in f:
do_something() # what should I write here to 'clean' my file?
g.write(line)
print("Ok!")
Output excerpt:
SPLEONARDO SIM\xc3\x83O ESTARLING
GOFLORESTA S/A A\xc3\x87UCAR E ALCOOL
SPFOCO REPRESENTA\xc3\x87\xc3\x95ES E CONSULTORIA
It looks like you are using Notepad++ to display your file. The encoding displayed looks like cp1252:
>>> b'COMUNICA\xc7\xc3O M\xc1QUINAS'.decode('cp1252')
'COMUNICAÇÃO MÁQUINAS'
In Notepad++, on the menu select Encoding->Character sets->Western European->Windows-1252 and your file should display correctly.
Here's an example that converts to UTF-8 (your output excerpt):
>>> b'SPLEONARDO SIM\xc3O ESTARLING'.decode('cp1252')
'SPLEONARDO SIMÃO ESTARLING'
>>> b'SPLEONARDO SIM\xc3O ESTARLING'.decode('cp1252').encode('utf8')
b'SPLEONARDO SIM\xc3\x83O ESTARLING'
For your example code, you can do:
with open(txtFile, 'r', encoding='cp1252') as f:
with open('PROF_SAI_318_210117_310117_clean.txt', 'w', encoding='utf8') as g:
for line in f:
g.write(line)
If your files aren't too large, you can just do:
with open(txtFile, 'r', encoding='cp1252') as f:
with open('PROF_SAI_318_210117_310117_clean.txt', 'w', encoding='utf8') as g:
g.write(f.read())

Unicode Error when extracting XML file Python

import os, csv, io
from xml.etree import ElementTree
file_name = "example.xml"
full_file = os.path.abspath(os.path.join("xml", file_name))
dom = ElementTree.parse(full_file)
Fruit = dom.findall("Fruit")
with io.open('test.csv','w', encoding='utf8') as fp:
a = csv.writer(fp, delimiter=',')
for f in Fruit:
Explanation = f.findtext("Explanation")
Types = f.findall("Type")
for t in Types:
Type = t.text
a.writerow([Type, Explanation])
I am extracting data from a XML file, and put it into a CSV file. I am getting this error message below. It is probably because the extracted data contains a Fahrenheit sign. How could I get rid of these Unicode errors, without fixing it manually the XML file?
For the last line of my code i get this error message
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xb0’ in position 1267: ordinal not in range(128)
<Fruits>
<Fruit>
<Family>Citrus</Family>
<Explanation>They cannot grow at a temperature below 32 °F</Explanation>
<Type>Orange</Type>
<Type>Lemon</Type>
<Type>Lime</Type>
<Type>Grapefruit</Type>
</Fruit>
</Fruits>
You didn't write, where the error occurs. Probably in the last line. You have to encode the strings yourself:
with open('test.csv','w') as fp:
a = csv.writer(fp, delimiter=',')
for f in Fruit:
explanation = f.findtext("Explanation")
types = f.findall("Type")
for t in types:
a.writerow([t.text.encode('utf8'), explanation.encode('utf8')])

Python issues on character encoding

I'm working on a program that need to take two files and merge them and write the union file as a new one. The problem is that the output file contains chars like this \xf0 or if i change some of the encodings the result is something like that \u0028. The input file are codificated in utf8. How can i print on the output file chars like "è" or "ò" and "-"
I have done this code:
import codecs
import pandas as pd
import numpy as np
goldstandard = "..\\files\file1.csv"
tweets = "..\\files\\file2.csv"
with codecs.open(tweets, "r", encoding="utf8") as t:
tFile = pd.read_csv(t, delimiter="\t",
names=['ID', 'Tweet'],
quoting=3)
IDs = tFile['ID']
tweets = tFile['Tweet']
dict = {}
for i in range(len(IDs)):
dict[np.int64(IDs[i])] = [str(tweets[i])]
with codecs.open(goldstandard, "r", encoding="utf8") as gs:
for line in gs:
columns = line.split("\t")
index = np.int64(columns[0])
rowValue = dict[index]
rowValue.append([columns[1], columns[2], columns[3], columns[5]])
dict[index] = rowValue
import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)
f = codecs.open("out.csv", "w", "utf8")
f.write(ndic)
f.close()
and this is example of the outputs
desired: Beyoncè
obtained: Beyonc\xe9
You are producing Python string literals, here:
import pprint
pprint.pprint(dict)
ndic = pprint.pformat(dict, indent=4)
Pretty-printing is useful for producing debugging output; objects are passed through repr() to make non-printable and non-ASCII characters easily distinguishable and reproducible:
>>> import pprint
>>> value = u'Beyonc\xe9'
>>> value
u'Beyonc\xe9'
>>> print value
Beyoncé
>>> pprint.pprint(value)
u'Beyonc\xe9'
The é character is in the Latin-1 range, outside of the ASCII range, so it is represented with syntax that produces the same value again when used in Python code.
Don't use pprint if you want to write out actual string values to the output file. You'll have to do your own formatting in that case.
Moreover, the pandas dataframe will hold bytestrings, not unicode objects, so you still have undecoded UTF-8 data at that point.
Personally, I'd not even bother using pandas here; you appear to want to write CSV data, so I've simplified your code to use the csv module instead, and I'm not actually bothering to decode the UTF-8 here (this is safe for this case as both input and output is entirely in UTF-8):
import csv
tweets = {}
with open(tweets, "rb") as t:
reader = csv.reader(t, delimiter='\t')
for id_, tweet in reader:
tweets[id_] = tweet
with open(goldstandard, "rb") as gs, open("out.csv", 'wb') as outf:
reader = csv.reader(gs, delimiter='\t')
writer = csv.reader(outf, delimiter='\t')
for columns in reader:
index = columns[0]
writer.writerow([tweets[index]] + columns[1:4] + [columns[5])
Note that you really want to avoid using dict as a variable name; it masks the built-in type, I used tweets instead.

how to remove non utf 8 code and save as a csv file python

I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte
I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?
thank you!
EDIT1:
Here is the code i convert to text to csv:
import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
"product/productId",
"review/userId",
"review/profileName",
"review/helpfulness",
"review/score",
"review/time",
"review/summary",
"review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")
outfile = open(OUTPUT_FILE_NAME,"w")
outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:
line = line.strip()
#need to reomve the , so that the comment review text won't be in many columns
line = line.replace(',','')
if line == "":
outfile.write(",".join(currentLine))
outfile.write("\n")
currentLine = []
continue
parts = line.split(":",1)
currentLine.append(parts[1])
if currentLine != []:
outfile.write(",".join(currentLine))
f.close()
outfile.close()
EDIT2:
Thanks to all of you trying to helping me out.
So I have solved it by modify the output format in my code:
outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...
You have basically 2 ways to deal with decode errors:
use a charset that will accept any byte such as iso-8859-15 also known as latin9
if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)
For example:
f = open(INPUT_FILE_NAME,encoding="latin9")
or
f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
If you are using python3, it provides inbuilt support for unicode content -
f = open('file.csv', encoding="utf-8")
If you still want to remove all unicode data from it, you can read it as a normal text file and remove the unicode content
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, bytes):
string_data = bytes(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
with open('file.csv', 'r+', encoding="utf-8") as csv_file:
content = remove_unicode(csv_file.read())
csv_file.seek(0)
csv_file.write(content)
Now you can read it without any unicode data issues.

Python: Decode base64 multiple strings in a file

I'm new to python and I have a file like this:
cw==ZA==YQ==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==dA==ZQ==cw==dA==
It's an keybord input, coded with base64, and new I want to decode it
I try this by the code is stoping at first character decoded.
import base64
file = "my_file.txt"
fin = open(file, "rb")
binary_data = fin.read()
fin.close()
b64_data = base64.b64decode(binary_data)
b64_fname = "original_b64.txt"
fout = open(b64_fname, "w")
fout.write(b64_data)
fout.close
Any help is welcome. thanks
I assume that you created your test input string yourself.
If I split your test input string in blocks of 4 characters and decode each one apart, I get the following:
>>> import base64
>>> s = 'cw==ZA==YQ==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==dA==ZQ==cw==dA=='
>>> ''.join(base64.b64decode(s[i:i+4]) for i in range(0, len(s), 4))
'sdadasdasdasdasdtest'
However, the correct base64 encoding of your test string sdadasdasdasdasdtest is:
>>> base64.b64encode('sdadasdasdasdasdtest')
'c2RhZGFzZGFzZGFzZGFzZHRlc3Q='
If you place this string in my_file.txt (and rewriting your code to be a bit more concise) then it all works.
import base64
with open("my_file.txt") as f, open("original_b64.txt", 'w') as g:
encoded = f.read()
decoded = base64.b64decode(encoded)
g.write(decoded)

Categories

Resources