Unicode Error when extracting XML file Python - python

import os, csv, io
from xml.etree import ElementTree
file_name = "example.xml"
full_file = os.path.abspath(os.path.join("xml", file_name))
dom = ElementTree.parse(full_file)
Fruit = dom.findall("Fruit")
with io.open('test.csv','w', encoding='utf8') as fp:
a = csv.writer(fp, delimiter=',')
for f in Fruit:
Explanation = f.findtext("Explanation")
Types = f.findall("Type")
for t in Types:
Type = t.text
a.writerow([Type, Explanation])
I am extracting data from a XML file, and put it into a CSV file. I am getting this error message below. It is probably because the extracted data contains a Fahrenheit sign. How could I get rid of these Unicode errors, without fixing it manually the XML file?
For the last line of my code i get this error message
UnicodeEncodeError: ‘ascii’ codec can’t encode character u’\xb0’ in position 1267: ordinal not in range(128)
<Fruits>
<Fruit>
<Family>Citrus</Family>
<Explanation>They cannot grow at a temperature below 32 °F</Explanation>
<Type>Orange</Type>
<Type>Lemon</Type>
<Type>Lime</Type>
<Type>Grapefruit</Type>
</Fruit>
</Fruits>

You didn't write, where the error occurs. Probably in the last line. You have to encode the strings yourself:
with open('test.csv','w') as fp:
a = csv.writer(fp, delimiter=',')
for f in Fruit:
explanation = f.findtext("Explanation")
types = f.findall("Type")
for t in types:
a.writerow([t.text.encode('utf8'), explanation.encode('utf8')])

Related

Python codecs encoding not working

I have this code
import collections
import csv
import sys
import codecs
from xml.dom.minidom import parse
import xml.dom.minidom
String = collections.namedtuple("String", ["tag", "text"])
def read_translations(filename): #Reads a csv file with rows made up of 2 columns: the string tag, and the translated tag
with codecs.open(filename, "r", encoding='utf-8') as csvfile:
csv_reader = csv.reader(csvfile, delimiter=",")
result = [String(tag=row[0], text=row[1]) for row in csv_reader]
return result
The CSV file I'm reading contains Brazilian portuguese characters. When I try to run this, I get an error:
'utf8' codec can't decode byte 0x88 in position 21: invalid start byte
I'm using Python 2.7. As you can see, I'm encoding with codecs, but it doesn't work.
Any ideas?
The idea of this line:
with codecs.open(filename, "r", encoding='utf-8') as csvfile:
is to say "This file was saved as utf-8. Please make appropriate conversions when reading from it."
That works fine if the file was actually saved as utf-8. If some other encoding was used, then it is bad.
What then?
Determine which encoding was used. Assuming the information cannot be obtained from the software which created the file - guess.
Open the file normally and print each line:
with open(filename, 'rt') as f:
for line in f:
print repr(line)
Then look for a character which is not ASCII, e.g. ñ - this letter will be printed as some code, e.g.:
'espa\xc3\xb1ol'
Above, ñ is represented as \xc3\xb1, because that is the utf-8 sequence for it.
Now, you can check what various encodings would give and see which is right:
>>> ntilde = u'\N{LATIN SMALL LETTER N WITH TILDE}'
>>>
>>> print repr(ntilde.encode('utf-8'))
'\xc3\xb1'
>>> print repr(ntilde.encode('windows-1252'))
'\xf1'
>>> print repr(ntilde.encode('iso-8859-1'))
'\xf1'
>>> print repr(ntilde.encode('macroman'))
'\x96'
Or print all of them:
for c in encodings.aliases.aliases:
try:
encoded = ntilde.encode(c)
print c, repr(encoded)
except:
pass
Then, when you have guessed which encoding it is, use that, e.g.:
with codecs.open(filename, "r", encoding='iso-8859-1') as csvfile:

Pandas cannot load data, csv encoding mystery

I am trying to load a dataset into pandas and cannot get seem to get past step 1. I am new so please forgive if this is obvious, I have searched previous topics and not found an answer. The data is mostly in Chinese characters, which may be the issue.
The .csv is very large, and can be found here: http://weiboscope.jmsc.hku.hk/datazip/
I am trying on week 1.
In my code below, I identify 3 types of decoding I attempted, including an attempt to see what encoding was used
import pandas
import chardet
import os
#this is what I tried to start
data = pandas.read_csv('week1.csv', encoding="utf-8")
#spits out error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9a in position 69: invalid start byte
#Code to check encoding -- this spits out ascii
bytes = min(32, os.path.getsize('week1.csv'))
raw = open('week1.csv', 'rb').read(bytes)
chardet.detect(raw)
#so i tried this! it also fails, which isn't that surprising since i don't know how you'd do chinese chars in ascii anyway
data = pandas.read_csv('week1.csv', encoding="ascii")
#spits out error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xe6 in position 0: ordinal not in range(128)
#for god knows what reason this allows me to load data into pandas, but definitely not correct encoding because when I print out first 5 lines its gibberish instead of Chinese chars
data = pandas.read_csv('week1.csv', encoding="latin1")
Any help would be greatly appreciated!
EDIT: The answer provided by #Kristof does in fact work, as does the program a colleague of mine put together yesterday:
import csv
import pandas as pd
def clean_weiboscope(file, nrows=0):
res = []
with open(file, 'r', encoding='utf-8', errors='ignore') as f:
reader = csv.reader(f)
for i, row in enumerate(f):
row = row.replace('\n', '')
if nrows > 0 and i > nrows:
break
if i == 0:
headers = row.split(',')
else:
res.append(tuple(row.split(',')))
df = pd.DataFrame(res)
return df
my_df = clean_weiboscope('week1.csv', nrows=0)
I also wanted to add for future searchers that this is the Weiboscope open data for 2012.
It seems that there's something very wrong with the input file. There are encoding errors throughout.
One thing you could do, is to read the CSV file as a binary, decode the binary string and replace the erroneous characters.
Example (source for the chunk-reading code):
in_filename = 'week1.csv'
out_filename = 'repaired.csv'
from functools import partial
chunksize = 100*1024*1024 # read 100MB at a time
# Decode with UTF-8 and replace errors with "?"
with open(in_filename, 'rb') as in_file:
with open(out_filename, 'w') as out_file:
for byte_fragment in iter(partial(in_file.read, chunksize), b''):
out_file.write(byte_fragment.decode(encoding='utf_8', errors='replace'))
# Now read the repaired file into a dataframe
import pandas as pd
df = pd.read_csv(out_filename)
df.shape
>> (4790108, 11)
df.head()

how to remove non utf 8 code and save as a csv file python

I have some amazon review data and I have converted from the text format to CSV format successfully, now the problem is when I trying to read it into a dataframe using pandas, i got error msg:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 13: invalid start byte
I understand there must be some non utf-8 in the review raw data, how can I remove the non UTF-8 and save to another CSV file?
thank you!
EDIT1:
Here is the code i convert to text to csv:
import csv
import string
INPUT_FILE_NAME = "small-movies.txt"
OUTPUT_FILE_NAME = "small-movies1.csv"
header = [
"product/productId",
"review/userId",
"review/profileName",
"review/helpfulness",
"review/score",
"review/time",
"review/summary",
"review/text"]
f = open(INPUT_FILE_NAME,encoding="utf-8")
outfile = open(OUTPUT_FILE_NAME,"w")
outfile.write(",".join(header) + "\n")
currentLine = []
for line in f:
line = line.strip()
#need to reomve the , so that the comment review text won't be in many columns
line = line.replace(',','')
if line == "":
outfile.write(",".join(currentLine))
outfile.write("\n")
currentLine = []
continue
parts = line.split(":",1)
currentLine.append(parts[1])
if currentLine != []:
outfile.write(",".join(currentLine))
f.close()
outfile.close()
EDIT2:
Thanks to all of you trying to helping me out.
So I have solved it by modify the output format in my code:
outfile = open(OUTPUT_FILE_NAME,"w",encoding="utf-8")
If the input file in not utf-8 encoded, it it probably not a good idea to try to read it in utf-8...
You have basically 2 ways to deal with decode errors:
use a charset that will accept any byte such as iso-8859-15 also known as latin9
if output should be utf-8 but contains errors, use errors=ignore -> silently removes non utf-8 characters, or errors=replace -> replaces non utf-8 characters with a replacement marker (usually ?)
For example:
f = open(INPUT_FILE_NAME,encoding="latin9")
or
f = open(INPUT_FILE_NAME,encoding="utf-8", errors='replace')
If you are using python3, it provides inbuilt support for unicode content -
f = open('file.csv', encoding="utf-8")
If you still want to remove all unicode data from it, you can read it as a normal text file and remove the unicode content
def remove_unicode(string_data):
""" (str|unicode) -> (str|unicode)
recovers ascii content from string_data
"""
if string_data is None:
return string_data
if isinstance(string_data, bytes):
string_data = bytes(string_data.decode('ascii', 'ignore'))
else:
string_data = string_data.encode('ascii', 'ignore')
remove_ctrl_chars_regex = re.compile(r'[^\x20-\x7e]')
return remove_ctrl_chars_regex.sub('', string_data)
with open('file.csv', 'r+', encoding="utf-8") as csv_file:
content = remove_unicode(csv_file.read())
csv_file.seek(0)
csv_file.write(content)
Now you can read it without any unicode data issues.

Struggling with unicode in Python

I'm trying to automate the extraction of data from a large number of files, and it works for the most part. It just falls over when it encounters non-ASCII characters:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position
5: ordinal not in range(128)
How do I set my 'brand' to UTF-8? My code is being repurposed from something else (which was using lxml), and that didn't have any issues. I've seen lots of discussions about encode / decode, but I don't understand how I'm supposed to implement it. The below is cut down to just the relevant code - I've removed the rest.
i = 0
filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]
for i in range (len(filenames)):
pathname = filenames[i]
fin = open(pathname, 'r')
with codecs.open(('Assets'+'.log'), mode='w', encoding='utf-8') as f:
f.write(u'File Path|Brand\n')
lines = fin.read()
brand_start = lines.find("Brand Title")
brand_end = lines.find("/>",brand_start)
brand = lines [brand_start+47:brand_end-2]
f.write(u'{}|{}\n'.format(pathname[4:35],brand))
flog.close()
I'm sure there is a better way to write the whole thing, but at the moment my focus is just on trying to understand how to get the lines / read functions to work with UTF-8.
You are mixing bytestrings with Unicode values; your fin file object produces bytestrings, and you are mixing it with Unicode here:
f.write(u'{}|{}\n'.format(pathname[4:35],brand))
brand is a bytestring, interpolated into a Unicode format string. Either decode brand there, or better yet, use io.open() (rather than codecs.open(), which is not as robust as the newer io module) to manage both your files:
with io.open('Assets.log', 'w', encoding='utf-8') as f,\
io.open(pathname, encoding='utf-8') as fin:
f.write(u'File Path|Brand\n')
lines = fin.read()
brand_start = lines.find(u"Brand Title")
brand_end = lines.find(u"/>", brand_start)
brand = lines[brand_start + 47:brand_end - 2]
f.write(u'{}|{}\n'.format(pathname[4:35], brand))
You also appear to be parsing out an XML file by hand; perhaps you want to use the ElementTree API instead to parse out those values. In that case, you'd open the file without io.open(), so producing byte strings, so that the XML parser can correctly decode the information to Unicode values for you.
This is my final code, using the guidance from above. It's not pretty, but it solves the problem. I'll look at getting it all working using lxml at a later date (as this is something I've encountered before when working with different, larger xml files):
import lxml
import io
import os
from lxml import etree
from glob import glob
nsmap = {'xmlns': 'thisnamespace'}
i = 0
filenames = [y for x in os.walk("Distributor") for y in glob(os.path.join(x[0], '*.xml'))]
with io.open(('Assets.log'),'w',encoding='utf-8') as f:
f.write(u'File Path|Series|Brand\n')
for i in range (len(filenames)):
pathname = filenames[i]
parser = lxml.etree.XMLParser()
tree = lxml.etree.parse(pathname, parser)
root = tree.getroot()
fin = open(pathname, 'r')
with io.open(pathname, encoding='utf-8') as fin:
for info in root.xpath('//somepath'):
series_x = info.find ('./somemorepath')
series = series_x.get('Asset_Name') if series_x != None else 'Missing'
lines = fin.read()
brand_start = lines.find(u"sometext")
brand_end = lines.find(u"/>",brand_start)
brand = lines [brand_start:brand_end-2]
brand = brand[(brand.rfind("/"))+1:]
f.write(u'{}|{}|{}\n'.format(pathname[5:42],series,brand))
f.close()
Someone will now come along and do it all in one line!

Python: Decode base64 multiple strings in a file

I'm new to python and I have a file like this:
cw==ZA==YQ==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==dA==ZQ==cw==dA==
It's an keybord input, coded with base64, and new I want to decode it
I try this by the code is stoping at first character decoded.
import base64
file = "my_file.txt"
fin = open(file, "rb")
binary_data = fin.read()
fin.close()
b64_data = base64.b64decode(binary_data)
b64_fname = "original_b64.txt"
fout = open(b64_fname, "w")
fout.write(b64_data)
fout.close
Any help is welcome. thanks
I assume that you created your test input string yourself.
If I split your test input string in blocks of 4 characters and decode each one apart, I get the following:
>>> import base64
>>> s = 'cw==ZA==YQ==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==YQ==cw==ZA==dA==ZQ==cw==dA=='
>>> ''.join(base64.b64decode(s[i:i+4]) for i in range(0, len(s), 4))
'sdadasdasdasdasdtest'
However, the correct base64 encoding of your test string sdadasdasdasdasdtest is:
>>> base64.b64encode('sdadasdasdasdasdtest')
'c2RhZGFzZGFzZGFzZGFzZHRlc3Q='
If you place this string in my_file.txt (and rewriting your code to be a bit more concise) then it all works.
import base64
with open("my_file.txt") as f, open("original_b64.txt", 'w') as g:
encoded = f.read()
decoded = base64.b64decode(encoded)
g.write(decoded)

Categories

Resources