Read Unicode from CSV [duplicate] - python

This question already has answers here:
General Unicode/UTF-8 support for csv files in Python 2.6
(10 answers)
Closed 9 years ago.
I have a problem reading unicode characters from a csv. The csv file originally had elements with unicode tags:
"[u'Aeron\xe1utica']"
"[u'Ni\u0161']"
"[u'K\xfcnste']"
...
from which I had to remove the u'' tags to give a csv with
Aeron\xe1utica
Ni\u0161
K\xfcnste
....
Now I want to read the csv and output it into a file with the characters i.e.
Aeronáutica
Niš
Künste
....
I tried using the UnicodeWriter in the csv docs, but it gives the same output as the second list
Here's what I did to read and write:
c = open('foo.csv','r')
r = csv.reader(c)
for row in reader:
p = p + row
#The elements in p were ['Aeron\\xe1utica', 'Ni\\u0161', 'K\\xfcnste'...]
c = open('bar.csv','w')
c.write(codecs.BOM_UTF8)
writer = UnicodeWriter(c)
for row in p:
writer.writerow([row])
I also tried codecs.open('','','UTF-8') for both reading and writing, but it didn't help

It appears you have written Python lists directly to your CSV file, resulting in the [...] literal syntax instead of normal columns. You then removed most of the information that could have been used to turn the information back to Python lists with unicode strings again.
What you have left are Python unicode literals, but without the quotes. Use the unicode_escape to decode the values to Unicode again:
with open('foo.csv','r') as b0rken
for line in b0rken:
value = line.rstrip('\r\n').decode('unicode_escape')
print value
or add back the u'..' quoting, using a triple-quoted string in an attempt to avoid needing to escape embedded quotes:
with open('foo.csv','r') as b0rken
for line in b0rken:
value = literal_eval("u'''{}'''".format(line.rstrip('\r\n')))
print value
If you still have the original file (with the [u'...'] formatted lines), use the ast.literal_eval() function to turn those back into Python lists. No point in using the CSV module here:
from ast import literal_eval
with open('foo.csv','r') as b0rken
for line in b0rken:
lis = literal_eval(line)
value = lis[0]
print value
Demo with unicode_escape:
>>> for line in b0rken:
... print line.rstrip('\r\n').decode('unicode_escape')
...
Aeronáutica
Niš
Künste
École de l'Air

Related

Get rid of unicode decimal charater

I have a huge file which looks like this :
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h&#7893mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;h&#432&#417u cao cổ152;298;0
6854;huy&#7873n đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
As you can see the file contains some unicode decimal, I would like to replace all of them with their latin character before using the file. Even opening it with the utf-8 encoding, the errors are not suppress.
do you know a way to do it. I want to create a dictionary and retrieve the Numbers at index 2.
for : 6883;jumarre;83;295;0; => i have 83
for : 6887;khướu;62;325;0 => i have &#7899 => which is false , i should have 62
with codecs.open('JeuxdeMotsPolarise_test.txt', 'r', 'utf-8', errors = 'ignore') as text_file:
text_file =(text_file.read())
#print(text_file)
dico_lexique = ({i.split(";")[1]:i.split(";")[2:]for i in text_file.split("\n") if i})
This is the result given with trying #serge proposition, but it leaves blank spaces between lines.
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hi âu;81;294;0
6819;hi cu;64;338;0
6820;hi yn;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hu cao c;152;298;0
6854;huyn ;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kn kn;73;303;0
6886;khoang;64;323;0
6887;khu;62;325;0
Edit : I redownload the original file and the error of missing ";" was corrected.
for example:
=> 6850;hổ mang;54;298;0 (that is how is appeared in the now update file)
Thank you everybody
#PanagiotisKanavos has correctly guessed that html.unescape was able to replace the xml char reference with their unicode character. The hard part is that some refs are correctly ended with their terminating semicolon (;) while others are not. And in that latter case, if one entity if followed with a semicolon separator, the separator will be eaten by the convertion, shifting the following fields.
So the only reliable way is to:
process the file line by line as as CSV file with ; delimiter
eventually contatenate the middle field from the second to the fourth starting form the end
unescape that middle field
If you want to convert the file, you could do:
with open('file.csv') as fd, open('fixed.csv', 'w', newline='') as fdout:
rd = csv.reader(fd, delimiter=';')
wr = csv.writer(fdout, delimiter=';')
for row in rd:
if len(row)> 5:
row[1] = ';'.join(row[1:len(row)-3])
del row[2:len(row)-3]
row[1] = html.unescape(row[1])
wr.writerow(row)
If you only want to build a mapping from field 0 to field 2:
values = {}
with open('file.csv') as fd:
rd = csv.reader(fd, delimiter=';')
for row in rd:
values[field[0]] = field[-3]
This text isn't UTF8 or Unicode in general. It's HTML-encoded text, most likely Vietnamese. Those escape sequences correspond to Vietnamese characters, for example &#432 is ư - in fact, I just typed the edit sequence in the SO edit box and the correct character appeared. ớ is ớ.
Copying the entire text outside a code block produces
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h&#7893mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;h&#432&#417u cao cổ152;298;0
6854;huy&#7873n đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
Googling for Họ Khướu returns this Wikipedia page about Họ Khướu.
I think it's safe to assume this is HTML-encoded Vietnamese text. To convert it to Unicode you can use html.unescape :
import html
line='6887;khướu;62;325;0'
properLine=html.unescape(line)
UPDATE
The text posted above is just the original text with an extra newline per page. It's SO's markdown renderer that converts the escape sequences to the corresponding glyphs.
The funny thing is that this line :
6853;h&#432&#417u cao cổ152;298;0
Can't be rendered because the HTML entities aren't properly terminated. html.unescape on the other hand will convert the characters. Clearly, html.unescape is far more forgiving than SO's markdown renderer.
Either of these lines :
html.unescape('6853;hươu cao cổ152;298;0')
html.unescape('6853;h&#432&#417u cao cổ152;298;0')
Returns :
6853;h\u01b0\u01a1u cao c\u1ed5152;298;0
Repair the file first before you load it into a CSV parser.
Assuming Maarten in the comments is right, change the encoding:
iconv -f cp1252 -t utf-8 < JeuxdeMotsPolarise_test.txt > JeuxdeMotsPolarise_test.utf8.txt
Then replace the escapes with proper characters.
perl -C -i -lpe'
s/&#([0-9]+);?/chr $1/eg; # replace entities
s/;?(\d+;\d+;\d+)$/;$1/; # put back semicolon
# if it was consumed accidentally
' JeuxdeMotsPolarise_test.utf8.txt
Contents of JeuxdeMotsPolarise_test.utf8.txt after running the substitutions:
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;hổmang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hươu cao cổ;152;298;0
6854;huyền đề;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0

How to skip \n symbols when I export to csv?

I have a list that has this form
[['url', 'date', 'extractRaw', 'extractClean'], ['https://www.congress.gov/crec/2017/01/09/CREC-2017-01-09-senate.pdf', '20170109', 'UR\n\nIB\nU\n\nU\n\nE PL\n\nNU\n\nCo', '20170109', 'URIBUUE PLNUCo'], ['https://www.congress.gov/crec/2017/01/09/CREC-2017-01-09-senate.pdf', '20170109', 'UR\n\nIB\nU\n\nU\n\nE PL\n\nNU\n\nCo', '20170109', 'UURIBUUE PLNUCo']]
I'm exporting it to a CSV with this code
def exportCSV(flatList, filename):
with open(filename+".csv", "wb") as f:
writer = csv.writer(f)
writer.writerows(flatList)
exportCSV(textExport,'textExport')
This version follows the new line and I end up with a CSV that reads to a new line for every one of the \n symbols.
My desire is to get each entry in the list on its own separate line. It would look something like this
url date extractRaw extractClean
https://www.congress.gov/crec/2017/01/09/CREC-2017-01-09-senate.pdf 20170109 UR\n\nIB\nU\n\nU\n\nE PL\n\nNU\n\nCo URIBUUE PLNUCo
https://www.congress.gov/crec/2017/01/09/CREC-2017-01-09-senate.pdf 20170109 UR\n\nIB\nU\n\nU\n\nE PL\n\nNU\n\nCo URIBUUE PLNUCo
https://www.congress.gov/crec/2017/01/09/CREC-2017-01-09-senate.pdf 20170109 UR\n\nIB\nU\n\nU\n\nE PL\n\nNU\n\nCo URIBUUE PLNUCo
Does writer.writerows() support that? Can I get it to ignore the new line symbols?
It's not a duplicate. The \n is part of the block of text and the file is opened as 'wb'.
As you noticed, some of your strings contain \n chars.
"\n" (or ASCII 0x0A (10) or LF (line feed)), is a special char that gets interpreted when such a string is written. In order to solve your problem, you could either:
Make your strings raw (don't know how feasible is that). Anyway for more details check [Python 2]: String literals
Manually escape any "\" (bkslash) character preceding "n", so that "\n" becomes a string that consists of 2 chars ("\" and "n"). Translated into code, you should replace your last line, by (assuming that textExport contents was pasted at the beginning):
escapedTextExport = [[item.replace("\n", "\\n") for item in row] for row in textExport]
exportCSV(escapedTextExport, "textExport")

regular expressions in python using quotes

I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!
You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.
Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo

Zeroes appearing when reading file (where aren't any)

When reading a file (UTF-8 Unicode text, csv) with Python on Linux, either with:
csv.reader()
file()
values of some columns get a zero as their first characeter (there are no zeroues in input), other get a few zeroes, which are not seen when viewing file with Geany or any other editor. For example:
Input
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
Output
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;0378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
See 378983561 > 0378983561
Reading with:
f = file('/home/foo/data.csv', 'r')
data = f.read()
split_data = data.splitlines()
lines = list(line.split(';') for line in split_data)
print data[51220][8]
>>> '0378983561' #should have been '478983561' (reads like this in Geany etc.)
Same result with csv.reader().
Help me solve the mystery, what could be the cause of this? Could it be related to encoding/decoding?
The data you're getting is a string.
print data[51220][8]
>>> '0478983561'
If you want to use this as an integer, you should parse it.
print int(data[51220][8])
>>> 478983561
If you want this as a string, you should convert it back to a string.
print repr(int(data[51220][8]))
>>> '478983561'
csv.reader treats all columns as strings. Conversion to the appropriate type is up to you as in:
print int(data[51220][8])

Downsides to reading strings from Excel in python using encode('utf-8')

I am reading a large amount of data from an excel spreadsheet in which I read (and reformat and rewrite) from the spreadsheet using the following general structure:
book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
z = i + 1
toprint = """formatting of the data im writing. important stuff is to the right -> """ + str(sheettwo.cell(z,y).value) + """ more formatting! """ + str(sheettwo.cell(z,x).value.encode('utf-8')) + """ and done"""
out.write(toprint)
out.write("\n")
where x and y are arbitrary cells in this case, with x being less arbitrary and containing utf-8 characters
So far I have only been using the .encode('utf-8') in cells where I know there will be errors otherwise or foresee an error without using utf-8.
My question is basically this: is there a disadvantage to using .encode('utf-8') on all of the cells even if it is unnecessary? Efficiency is not an issue. the main issue is that it works even if there is a utf-8 character in a place there shouldn't be. If no errors would occur if I just lump the ".encode('utf-8')" onto every cell read, I will probably end up doing that.
The XLRD Documentation states it clearly: "From Excel 97 onwards, text in Excel spreadsheets has been stored as Unicode.". Since you are likely reading in files newer than 97, they are containing Unicode codepoints anyway. It is therefore necessary that keep the content of these cells as Unicode within Python and do not convert them to ASCII (which you do in with the str() function). Use this code below:
book = open_workbook('file.xls')
sheettwo = book.sheet_by_index(1)
#Make sure your writing Unicode encoded in UTF-8
out = open('output.file', 'w')
for i in range(sheettwo.nrows):
z = i + 1
toprint = u"formatting of the data im writing. important stuff is to the right -> " + unicode(sheettwo.cell(z,y).value) + u" more formatting! " + unicode(sheettwo.cell(z,x).value) + u" and done\n"
out.write(toprint.encode('UTF-8'))
This answer is really a few mild comments on the accepted answer, but they need better formatting than the SO comment facility provides.
(1) Avoiding the SO horizontal scrollbar enhances the chance that people will read your code. Try wrapping your lines, for example:
toprint = u"".join([
u"formatting of the data im writing. "
u"important stuff is to the right -> ",
unicode(sheettwo.cell(z,y).value),
u" more formatting! ",
unicode(sheettwo.cell(z,x).value),
u" and done\n"
])
out.write(toprint.encode('UTF-8'))
(2) Presumably you are using unicode() to convert floats and ints to unicode; it does nothing for values that are already unicode. Be aware that unicode(), like str(), gives you only 12 digits of precision for floats:
>>> unicode(123456.78901234567)
u'123456.789012'
If that is a bother, you might like to try something like this:
>>> def full_precision(x):
>>> ... return unicode(repr(x) if isinstance(x, float) else x)
>>> ...
>>> full_precision(u'\u0400')
u'\u0400'
>>> full_precision(1234)
u'1234'
>>> full_precision(123456.78901234567)
u'123456.78901234567'
(3) xlrd builds Cell objects on the fly when demanded.
sheettwo.cell(z,y).value # slower
sheettwo.cell_value(z,y) # faster

Categories

Resources