When reading a file (UTF-8 Unicode text, csv) with Python on Linux, either with:
csv.reader()
file()
values of some columns get a zero as their first characeter (there are no zeroues in input), other get a few zeroes, which are not seen when viewing file with Geany or any other editor. For example:
Input
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
Output
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;0378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
See 378983561 > 0378983561
Reading with:
f = file('/home/foo/data.csv', 'r')
data = f.read()
split_data = data.splitlines()
lines = list(line.split(';') for line in split_data)
print data[51220][8]
>>> '0378983561' #should have been '478983561' (reads like this in Geany etc.)
Same result with csv.reader().
Help me solve the mystery, what could be the cause of this? Could it be related to encoding/decoding?
The data you're getting is a string.
print data[51220][8]
>>> '0478983561'
If you want to use this as an integer, you should parse it.
print int(data[51220][8])
>>> 478983561
If you want this as a string, you should convert it back to a string.
print repr(int(data[51220][8]))
>>> '478983561'
csv.reader treats all columns as strings. Conversion to the appropriate type is up to you as in:
print int(data[51220][8])
Related
I have a huge file which looks like this :
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;hổmang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hươu cao cổ152;298;0
6854;huyền đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
As you can see the file contains some unicode decimal, I would like to replace all of them with their latin character before using the file. Even opening it with the utf-8 encoding, the errors are not suppress.
do you know a way to do it. I want to create a dictionary and retrieve the Numbers at index 2.
for : 6883;jumarre;83;295;0; => i have 83
for : 6887;khướu;62;325;0 => i have ớ => which is false , i should have 62
with codecs.open('JeuxdeMotsPolarise_test.txt', 'r', 'utf-8', errors = 'ignore') as text_file:
text_file =(text_file.read())
#print(text_file)
dico_lexique = ({i.split(";")[1]:i.split(";")[2:]for i in text_file.split("\n") if i})
This is the result given with trying #serge proposition, but it leaves blank spaces between lines.
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hi âu;81;294;0
6819;hi cu;64;338;0
6820;hi yn;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hu cao c;152;298;0
6854;huyn ;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kn kn;73;303;0
6886;khoang;64;323;0
6887;khu;62;325;0
Edit : I redownload the original file and the error of missing ";" was corrected.
for example:
=> 6850;hổ mang;54;298;0 (that is how is appeared in the now update file)
Thank you everybody
#PanagiotisKanavos has correctly guessed that html.unescape was able to replace the xml char reference with their unicode character. The hard part is that some refs are correctly ended with their terminating semicolon (;) while others are not. And in that latter case, if one entity if followed with a semicolon separator, the separator will be eaten by the convertion, shifting the following fields.
So the only reliable way is to:
process the file line by line as as CSV file with ; delimiter
eventually contatenate the middle field from the second to the fourth starting form the end
unescape that middle field
If you want to convert the file, you could do:
with open('file.csv') as fd, open('fixed.csv', 'w', newline='') as fdout:
rd = csv.reader(fd, delimiter=';')
wr = csv.writer(fdout, delimiter=';')
for row in rd:
if len(row)> 5:
row[1] = ';'.join(row[1:len(row)-3])
del row[2:len(row)-3]
row[1] = html.unescape(row[1])
wr.writerow(row)
If you only want to build a mapping from field 0 to field 2:
values = {}
with open('file.csv') as fd:
rd = csv.reader(fd, delimiter=';')
for row in rd:
values[field[0]] = field[-3]
This text isn't UTF8 or Unicode in general. It's HTML-encoded text, most likely Vietnamese. Those escape sequences correspond to Vietnamese characters, for example ư is ư - in fact, I just typed the edit sequence in the SO edit box and the correct character appeared. ớ is ớ.
Copying the entire text outside a code block produces
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;hổmang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hươu cao cổ152;298;0
6854;huyền đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
Googling for Họ Khướu returns this Wikipedia page about Họ Khướu.
I think it's safe to assume this is HTML-encoded Vietnamese text. To convert it to Unicode you can use html.unescape :
import html
line='6887;khướu;62;325;0'
properLine=html.unescape(line)
UPDATE
The text posted above is just the original text with an extra newline per page. It's SO's markdown renderer that converts the escape sequences to the corresponding glyphs.
The funny thing is that this line :
6853;hươu cao cổ152;298;0
Can't be rendered because the HTML entities aren't properly terminated. html.unescape on the other hand will convert the characters. Clearly, html.unescape is far more forgiving than SO's markdown renderer.
Either of these lines :
html.unescape('6853;hươu cao cổ152;298;0')
html.unescape('6853;hươu cao cổ152;298;0')
Returns :
6853;h\u01b0\u01a1u cao c\u1ed5152;298;0
Repair the file first before you load it into a CSV parser.
Assuming Maarten in the comments is right, change the encoding:
iconv -f cp1252 -t utf-8 < JeuxdeMotsPolarise_test.txt > JeuxdeMotsPolarise_test.utf8.txt
Then replace the escapes with proper characters.
perl -C -i -lpe'
s/&#([0-9]+);?/chr $1/eg; # replace entities
s/;?(\d+;\d+;\d+)$/;$1/; # put back semicolon
# if it was consumed accidentally
' JeuxdeMotsPolarise_test.utf8.txt
Contents of JeuxdeMotsPolarise_test.utf8.txt after running the substitutions:
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;hổmang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hươu cao cổ;152;298;0
6854;huyền đề;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
I open my file like so :
f = open("filename.ext", "rb") # ensure binary reading with b
My first line of data looks like this (when using f.readline()):
'\x04\x00\x00\x00\x12\x00\x00\x00\x04\x00\x00\x00\xb4\x00\x00\x00\x01\x00\x00\x00\x08\x00\x00\x00\x00\x00\x00\x00\x18\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x05\x00\x00\x00\x06\x00\x00\x00:\x00\x00\x00;\x00\x00\x00<\x00\x00\x007\x00\x00\x008\x00\x00\x009\x00\x00\x00\x07\x00\x00\x00\x08\x00\x00\x00\t\x00\x00\x00\n'
Thing is, I want to read this data byte by byte (f.read(4)). While debugging, I realized that when it gets to the end of the first line, it still takes in the newline character \n and it is used as the first byte of the following int I read. I don't want to simply use .splitlines()because some data could have an n inside and I don't want to corrupt it. I'm using Python 2.7.10, by the way. I also read that opening a binary file with the b parameter "takes care" of the new line/end of line characters; why is not the case with me?
This is what happens in the console as the file's position is right before the newline character:
>>> d = f.read(4)
>>> d
'\n\x00\x00\x00'
>>> s = struct.unpack("i", d)
>>> s
(10,)
(Followed from discussion with OP in chat)
Seems like the file is in binary format and the newlines are just mis-interpreted values. This can happen when writing 10 to the file for example.
This doesn't mean that newline was intended, and it is probably not. You can just ignore it being printed as \n and just use it as data.
You should just be able to replace the bytes that indicate it is a newline.
>>> d = f.read(4).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
>>> diff = 4 - len(d)
>>> while diff > 0: # You can probably make this more sophisticated
... d += f.read(diff).replace(b'\x0d\x0a', b'') #\r\n should be bytes b'\x0d\x0a'
... diff = 4 - len(d)
>>>
>>> s = struct.unpack("i", d)
This should give you an idea of how it will work. This approach could mess with your data's byte alignment.
If you really are seeing "\n" in your print of d then try .replace(b"\n", b"")
I am attempting to create a regular expression pattern for strings similar to the below which are stored in a file. The aim is to get any column for any row, the rows need not be on a single line. So for example, consider the following file:
"column1a","column2a","column
3a,", #entity 1
"column\"this is, a test\"4a"
"column1b","colu
mn2b,","column3b", #entity 2
"column\"this is, a test\"4b"
"column1c,","column2c","column3c", #entity 3
"column\"this is, a test\"4c"
Each entity consists of four columns, column 4 for entity 2 would be "column\"this is, a test\"4b", column 2 for entity 3 would be "column2c". Each column begins with a quote and closes with a quote, however you must be careful because some columns have escaped quotes. Thanks in advance!
You could do like this, ie
Read the whole file.
Split the input according to the newline character which was not preceded by a comma.
Iterate over the spitted elements and again do splitting on the comma (and also the following optional newline character) which was preceded and followed by double quotes.
Code:
import re
with open(file) as f:
fil = f.read()
m = re.split(r'(?<!,)\n', fil.strip())
for i in m:
print(re.split('(?<="),\n?(?=")', i))
Output:
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
Here is the check..
$ cat f
"column1a","column2a","column3a,",
"column\"this is, a test\"4a"
"column1b","column2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"
$ python3 f.py
['"column1a"', '"column2a"', '"column3a,"', '"column\\"this is, a test\\"4a"']
['"column1b"', '"column2b,"', '"column3b"', '"column\\"this is, a test\\"4b"']
['"column1c,"', '"column2c"', '"column3c"', '"column\\"this is, a test\\"4c"']
f is the input file name and f.py is the file-name which contains the python script.
Your problem is terribly familiar to what I have to deal thrice every month :) Except I'm not using python to solve it, but I can 'translate' what I usually do:
text = r'''"column1a","column2a","column
3a,",
"column\"this is, a test\"4a"
"column1a2","column2a2","column3a2","column4a2"
"column1b","colu
mn2b,","column3b",
"column\"this is, a test\"4b"
"column1c,","column2c","column3c",
"column\"this is, a test\"4c"'''
import re
# Number of columns one line is supposed to have
columns = 4
# Temporary variable to hold partial lines
buffer = ""
# Our regex to check for each column
check = re.compile(r'"(?:[^"\\]*|\\.)*"')
# Read the file line by line
for line in text.split("\n"):
# If there's no stored partial line, this is a new line
if buffer == "":
# Check if we get 4 columns and print, if not, put the line
# into buffer so we store a partial line for later
if len(check.findall(line)) == columns:
print matches
else:
# use line.strip() if you need to trim whitespaces
buffer = line
else:
# Update the variable (containing a partial line) with the
# next line and recheck if we get 4 columns
# use line.strip() if you need to trim whitespaces
buffer = buffer + line
# If we indeed get 4, our line is complete and print
# We must not forget to empty buffer now that we got a whole line
if len(check.findall(buffer)) == columns:
print matches
buffer = ""
# Optional; always good to have a safety backdoor though
# If there is a problem with the csv itself like a weird unescaped
# quote, you send it somewhere else
elif len(check.findall(buffer)) > columns:
print "Error: cannot parse line:\n" + buffer
buffer = ""
ideone demo
I am trying to write a pit array to a file in python as in this example: python bitarray to and from file
however, I get garbage in my actual test file:
test_1 = ^#^#
test_2 = ^#^#
code:
from bitarray import bitarray
def test_function(myBitArray):
test_bitarray=bitarray(10)
test_bitarray.setall(0)
with open('test_file.inp','w') as output_file:
output_file.write('test_1 = ')
myBitArray.tofile(output_file)
output_file.write('\ntest_2 = ')
test_bitarray.tofile(output_file)
Any help with what's going wrong would be appreciated.
That's not garbage. The tofile function writes binary data to a binary file. A 10-bit-long bitarray with all 0's will be output as two bytes of 0. (The docs explain that when the length is not a multiple of 8, it's padded with 0 bits.) When you read that as text, two 0 bytes will look like ^#^#, because ^# is the way (many) programs represent a 0 byte as text.
If you want a human-readable text-friendly representation, use the to01 method, which returns a human-readable strings. For example:
with open('test_file.inp','w') as output_file:
output_file.write('test_1 = ')
output_file.write(myBitArray.to01())
output_file.write('\ntest_2 = ')
output_file(test_bitarray.to01())
Or maybe you want this instead:
output_file(str(test_bitarray))
… which will give you something like:
bitarray('0000000000')
This question already has answers here:
General Unicode/UTF-8 support for csv files in Python 2.6
(10 answers)
Closed 9 years ago.
I have a problem reading unicode characters from a csv. The csv file originally had elements with unicode tags:
"[u'Aeron\xe1utica']"
"[u'Ni\u0161']"
"[u'K\xfcnste']"
...
from which I had to remove the u'' tags to give a csv with
Aeron\xe1utica
Ni\u0161
K\xfcnste
....
Now I want to read the csv and output it into a file with the characters i.e.
Aeronáutica
Niš
Künste
....
I tried using the UnicodeWriter in the csv docs, but it gives the same output as the second list
Here's what I did to read and write:
c = open('foo.csv','r')
r = csv.reader(c)
for row in reader:
p = p + row
#The elements in p were ['Aeron\\xe1utica', 'Ni\\u0161', 'K\\xfcnste'...]
c = open('bar.csv','w')
c.write(codecs.BOM_UTF8)
writer = UnicodeWriter(c)
for row in p:
writer.writerow([row])
I also tried codecs.open('','','UTF-8') for both reading and writing, but it didn't help
It appears you have written Python lists directly to your CSV file, resulting in the [...] literal syntax instead of normal columns. You then removed most of the information that could have been used to turn the information back to Python lists with unicode strings again.
What you have left are Python unicode literals, but without the quotes. Use the unicode_escape to decode the values to Unicode again:
with open('foo.csv','r') as b0rken
for line in b0rken:
value = line.rstrip('\r\n').decode('unicode_escape')
print value
or add back the u'..' quoting, using a triple-quoted string in an attempt to avoid needing to escape embedded quotes:
with open('foo.csv','r') as b0rken
for line in b0rken:
value = literal_eval("u'''{}'''".format(line.rstrip('\r\n')))
print value
If you still have the original file (with the [u'...'] formatted lines), use the ast.literal_eval() function to turn those back into Python lists. No point in using the CSV module here:
from ast import literal_eval
with open('foo.csv','r') as b0rken
for line in b0rken:
lis = literal_eval(line)
value = lis[0]
print value
Demo with unicode_escape:
>>> for line in b0rken:
... print line.rstrip('\r\n').decode('unicode_escape')
...
Aeronáutica
Niš
Künste
École de l'Air