How to open and edit an incorrect json file?

How to open and edit an incorrect json file? - python

Maybe this question is a bit dumb but I actually don't know how to fix it. I have a json file that is in the wrong format. it has a b' before the first { and also it uses single quotes instead of double quotes which is not an accepted format for json.
I know that I have to replace the single quotes with double quotes. I would use something like this:
json = file.decode('utf8').replace("'", '"')
But the problem is how can I replace the quotes in the file if I can't open it?
import json
f = open("data/qa_Automotive.json",'r')
file = json.load(f)
Opening the file gives me an error because It has single quotes and not double quotes:
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
How am I supposed to change the quotes in the json file if I can't open the file because it has the wrong format? (This is the json file btw: https://jmcauley.ucsd.edu/data/amazon/qa/qa_Appliances.json.gz)

The file isn't JSON (the filename is incorrect); instead, it's composed of valid Python literals. There's no reason to try to transform it to JSON. Don't do that; instead, just tell Python to parse it as-is.
#!/usr/bin/env python3
import ast, json
results = [ ast.literal_eval(line) for line in open('qa_Appliances.json') ]
print(json.dumps(results))
...properly gives you a list named results with all your lines in it, and (for demonstration purposes) dumps it to stdout as JSON.

There are multiple issue with that file :
it is ndjson (new line delimited, each line is a single object)
there is a mixing of quote
First issue is easily solve, you can just load each line individually, second issue is tricker.
I tried (as you did) to simply replace ' with ". But that is not enough, as there is ' in the text (like "you're"), there is a single quote there. If you replace all the ', that will be transformed into a " and break your string.
You'll be left with something like "message": "you"re" which is invalid.
Since there is a lot of issue with that file, I suggest to use something a little dirty: the python eval function.
eval simply play a string as if it was a command.
>>> eval("4 + 2")
6
Because json format, and Python native dict type are really close (minus some difference like Python use True/False and json use true/false. Python use None and json use null, and maybe other that I forgot), the structure with the curly and squared bracket are the same. But here, "eval" will help you because Python support both single, and double quote
Using a line from your file:
>>> import json
>>>
>>> data = "{'questionType': 'yes/no', 'asin': 'B0002Z1GG0', 'answerTime': 'Dec 10, 2013', 'unixTime': 1386662400, 'question': 'would this work on my hotpoint model ctx14ayxkrwh serial hr749157 refrigerator do the drawers slide into slots ?', 'answerType': '?', 'answer': 'the drawers do fit into the slots.'}"
>>> parsed = eval(data)
>>> print(parsed)
{'questionType': 'yes/no', 'asin': 'B0002Z1GG0', 'answerTime': 'Dec 10, 2013', 'unixTime': 1386662400, 'question': 'would this work on my hotpoint model ctx14ayxkrwh serial hr749157 refrigerator do the drawers slide into slots ?', 'answerType': '?', 'answer': 'the drawers do fit into the slots.'}
>>> type(parsed)
<class 'dict'>
>>> print(json.dumps(parsed, indent=2))
{
"questionType": "yes/no",
"asin": "B0002Z1GG0",
"answerTime": "Dec 10, 2013",
"unixTime": 1386662400,
"question": "would this work on my hotpoint model ctx14ayxkrwh serial hr749157 refrigerator do the drawers slide into slots ?",
"answerType": "?",
"answer": "the drawers do fit into the slots."
}
I can do that on the whole file:
>>> data = open("<path to file>").readlines()
>>> parsed = [ eval(line) for line in data ]
>>>
>>> len(parsed)
9011
>>> parsed[0]
{'questionType': 'yes/no', 'asin': 'B00004U9JP', 'answerTime': 'Jun 27, 2014', 'unixTime': 1403852400, 'question': 'I have a 9 year old Badger 1 that needs replacing, will this Badger 1 install just like the original one?', 'answerType': '?', 'answer': 'I replaced my old one with this without a hitch.'}
>>> parsed[0]['questionType']
'yes/no'
BEWARE You should never use eval on unsanitazed data, as it can be use to breach your system, but if you use it in a controlled environment, you do you.

Related

How can I use .replace() on a .txt file with accented characters?

So I have a code that takes a .txt file and adds it to a variable as a string.
Then, I try to use .replace() on it to change the character "ó" to "o", but it is not working! The console prints the same thing.
Code:
def normalize(filename):
#Ignores errors because I get the .txt from my WhatsApp conversations and emojis raise an error.
#File says: "Es una rubrica de evaluación." (among many emojis)
txt_raw = open(filename, "r", errors="ignore")
txt_read = txt_raw.read()
#Here, only the "o" is replaced. In the real code, I use a for loop to iterate through all chrs.
rem_accent_txt = txt_read.replace("ó", "o")
print(rem_accent_txt)
return
Expected output:
"Es una rubrica de evaluacion."
Current Output:
"Es una rubrica de evaluación."
It does not print an error or anything, it just prints it as it is.
I believe the problem lies on the fact that the string comes from a file because when I just create a string and use the code, it does work, but it does not work when I get the string from a file.
EDIT: SOLUTION!
Thanks to #juanpa.arrivillaga and #das-g I came up with this solution:
from unidecode import unidecode
def get_txt(filename):
txt_raw = open(filename, "r", encoding="utf8")
txt_read = txt_raw.read()
txt_decode = unidecode(txt_read)
print(txt_decode)
return txt_decode

Almost certainly, what is occuring is that you have a unormalized unicode strings. Essentially, there are two ways to create "ó" in unicode:
>>> combining = 'ó'
>>> composed = 'ó'
>>> len(combining), len(composed)
(2, 1)
>>> list(combining)
['o', '́']
>>> list(composed)
['ó']
>>> import unicodedata
>>> list(map(unicodedata.name, combining))
['LATIN SMALL LETTER O', 'COMBINING ACUTE ACCENT']
>>> list(map(unicodedata.name, composed))
['LATIN SMALL LETTER O WITH ACUTE']
Just normalize your strings:
>>> composed == combining
False
>>> composed == unicodedata.normalize("NFC", combining)
True
Although, taking a step back, do you really want to remove accents? Or do you just want to normalize to composed, like the above?
As an aside, you shouldn't ignore the errors when reading your text file. You should use the correct encoding. I suspect what is happening is that you are writing your text file using an incorrect encoding, because you should be able to handle emojis just fine, they aren't anything special in unicode.
>>> emoji = "😀"
>>> print(emoji)
😀
>>>
>>> unicodedata.name(emoji)
'GRINNING FACE'

How to extract multiple time from same string in Python?

I'm trying to extract time from single strings where in one string there will be texts other than only time. An example is s = 'Dates : 12/Jul/2019 12/Aug/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58'.
I've tried using datefinder module like this :
from datetime import datetime as dt
import datefinder as dfn
for m in dfn.find_dates(s):
print(dt.strftime(m, "%H:%M:%S"))
Which gives me this :
17:58:00
In this case the time "06:00" is missed out. Now if I try without datefinder with only datetime module like this :
dt.strftime(s, "%H:%M")
It notifies me that the input must be a datetime object already, not a string with the following error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'str'
So I tried to use dateutil module to parse this string s to a datetime object with this :
from dateutil.parser import parse
parse(s)
but, now it now says that my string is not in proper format (which in most cases will not be in any fixed format), showing me this error :
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1358, in parse
return DEFAULTPARSER.parse(timestr, **kwargs)
File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 649, in parse
raise ValueError("Unknown string format:", timestr)
ValueError: ('Unknown string format:', '12/Jul/2019 12/Aug/2019 MEISHAN BRIDGE 06:00 17:58')
I have thought of getting the time with regex like
import re
p = r"\d{2}\:\d{2}"
times = [i.group() for i in re.finditer(p, s)]
# Gives me ['06:00', '17:58']
But doing this way will need me to check again whether this regex matched chunks are actually time or not because even "99:99" could be regex matched rightly and told as time wrongly. Is there any work around without regex to get all the times from a single string?
Please note that the string might contain or might not contain any date, but it will contain a time always. Even if it contains date, the date format might be anything on earth and also this string might or might not contain other irrelevant texts.

I don't see many options here, so I would go with a heuristic. I would run the following against the whole dataset and extend the config/regexes until it covers all/most of the cases:
import re
import logging
from datetime import datetime as dt
s = 'Dates : 12/Jul/2019 12/08/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58:59'
SUPPORTED_DATE_FMTS = {
re.compile(r"(\d{2}/\w{3}/\d{4})"): "%d/%b/%Y",
re.compile(r"(\d{2}/\d{2}/\d{4})"): "%d/%m/%Y",
re.compile(r"(\d{2}/\w{3}\w+/\d{4})"): "%d/%B/%Y",
# Capture more here
}
SUPPORTED_TIME_FMTS = {
re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9])[^:]"): "%H:%M",
re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9]:[0-5][0-9])"): "%H:%M:%S",
# Capture more here
}
def extract_supported_dt(config, s):
"""
Loop thru the given config (keys are regexes, values are date/time format)
and attempt to gather all valid data.
"""
valid_data = []
for regex, fmt in config.items():
# Extract what you think looks like date
valid_ish_data = regex.findall(s)
if not valid_ish_data:
continue
print("Checking " + str(valid_ish_data))
# validate it
for d in valid_ish_data:
try:
valid_data.append(dt.strptime(d, fmt))
except ValueError:
pass
return valid_data
# Handle dates
dates = extract_supported_dt(SUPPORTED_DATE_FMTS, s)
# Handle times
times = extract_supported_dt(SUPPORTED_TIME_FMTS, s)
print("Found dates: ")
for date in dates:
print("\t" + str(date.date()))
print("Found times: ")
for t in times:
print("\t" + str(t.time()))
Example output:
Checking ['12/Jul/2019']
Checking ['12/08/2019']
Checking ['06:00']
Checking ['17:58:59']
Found dates:
2019-07-12
2019-08-12
Found times:
06:00:00
17:58:59
This is a trial and error approach but I do not think there is an alternative in your case. Thus my goal here is to make it as easy as possible to extend support with more date/time formats as opposed to try to find a solution that covers 100% of the data day-1. This way, the more data you run against the more complete your config will be.
One thing to note is that you will have to detect strings that appear to have no dates and log them somewhere. Later you will need to manually revise and see if something that was missed could be captured.
Now, assuming that your data are being generated by another system, sooner or later you will be able to match 100% of it. If the data input is from human, then you will probably never manage to get 100%! (people tend to make spelling mistakes and sometimes import random stuff... date=today :) )

How to extract multiple time from same string in Python?
If you need only time this regex should work fine
r"[0-2][0-9]\:[0-5][0-9]"
If there could be spaces in time like 23 : 59 use this
r"[0-2][0-9]\s*\:\s*[0-5][0-9]"

Use Regex But Something Like This,
(?=[0-1])[0-1][0-9]\:[0-5][0-9]|(?=2)[2][0-3]\:[0-5][0-9]
This Matched
00:00, 00:59 01:00 01:59 02:00 02: 59
09:00 10:00 11:59 20:00 21:59 23:59
Not work for
99:99 23:99 01:99
Check Here Dude if it works for You
Check on Repl.it

you could use dictionaries:
my_dict = {}
for i in s.split(', '):
m = i.strip().split(' : ', 1)
my_dict[m[0]] = m[1].split()
my_dict
Out:
{'Dates': ['12/Jul/2019', '12/Aug/2019'],
'Loc': ['MEISHAN', 'BRIDGE'],
'Time': ['06:00', '17:58']}

How to reconstruct and change structure of a dataset using python?

I have a dataset and I need to reconstruct some data from this dataset to a new style
My dataset is something like below (Stored in a file named train1.txt):
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
I need to convert to below style (I need to store in a new file as train.txt):
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
And other numbers ….
My python version is 2.7.13
My operating system is Ubuntu 14.04 LTS
I will appreciate you for any help.
Thank you so much.

I would suggest using regex (regular expressions). This might be a little overkill, but in the long run, knowing regex is super powerful.
import re
def return_no_commas(string):
regex = r'\d*'
matches = re.findall(regex, string)
for match in matches:
print(match)
numbers = """
2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468
"""
return_no_commas(numbers)
Let me explain what everything does.
import re
just imports regular expressions. The regular expression I wrote is
regex = r'\d*'
the "r" at the beginning says it's a regex and it just looks for any number (which is the "\d" part) and says it can repeat any number of times (which is the "*" part). Then we print out all the matches.
I saved your numbers in a string called numbers, but you could just as easily read in a file and worked with those contents.
You'll get something like:
2342728
2414939
2397722
2386848
2398737
2367906
2384003
2399896
2359702
2414293
2411228
2416802
2322710
2387437
2397274
2344681
2396522
2386676
2413824
2328225
2413833
2335374
2328594
497966
2384001
2372746
2386538
2348518
2380037
2374364
2352054
2377990
2367915
2412520
2348070
2356469
2353541
2413446
2391930
2366968
2364762
2347618
2396550
2370538
2393212

It sounds to me like your original data is separated by commas. However, you want the data separated by new-line characters (\n) instead. This is very easy to do.
def covert_comma_to_newline(rfilename, wfilename):
"""
rfilename -- name of file to read-from
wfilename -- name of file to write-to
"""
assert(rfilename != wfilename)
# open two files, one in read-mode
# the other in write-mode
rfile = open(rfilename, "r")
wfile = open(wfilename, "w")
# read the file into a string
rstryng = rfile.read()
lyst = rstryng.split(",")
# EXAMPLE:
# rstryng == "1,2,3,4"
# lyst == ["1", "2", "3", "4"]
# remove leading and trailing whitespace
lyst = [s.strip() for s in lyst]
wstryng = "\n".join(lyst)
wfile.writelines(wstryng)
rfile.close()
wfile.close()
return
covert_comma_to_newline("train1.txt", "train.txt")
# open and check the contents of `train.txt`

Since others have added answers, I will include one using numpy.
If you are ok using numpy, it is as simple as:
data = np.genfromtxt('train1.txt', dtype=int, delimiter=',')
If you want a list instead of numpy array,
data.tolist()
[2342728,
2414939,
2397722,
2386848,
2398737,
2367906,
2384003,
2399896,
....
]

Get rid of unicode decimal charater

I have a huge file which looks like this :
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h&#7893mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;h&#432&#417u cao cổ152;298;0
6854;huy&#7873n đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
As you can see the file contains some unicode decimal, I would like to replace all of them with their latin character before using the file. Even opening it with the utf-8 encoding, the errors are not suppress.
do you know a way to do it. I want to create a dictionary and retrieve the Numbers at index 2.
for : 6883;jumarre;83;295;0; => i have 83
for : 6887;khướu;62;325;0 => i have &#7899 => which is false , i should have 62
with codecs.open('JeuxdeMotsPolarise_test.txt', 'r', 'utf-8', errors = 'ignore') as text_file:
text_file =(text_file.read())
#print(text_file)
dico_lexique = ({i.split(";")[1]:i.split(";")[2:]for i in text_file.split("\n") if i})
This is the result given with trying #serge proposition, but it leaves blank spaces between lines.
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hi âu;81;294;0
6819;hi cu;64;338;0
6820;hi yn;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hu cao c;152;298;0
6854;huyn ;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kn kn;73;303;0
6886;khoang;64;323;0
6887;khu;62;325;0
Edit : I redownload the original file and the error of missing ";" was corrected.
for example:
=> 6850;hổ mang;54;298;0 (that is how is appeared in the now update file)
Thank you everybody

#PanagiotisKanavos has correctly guessed that html.unescape was able to replace the xml char reference with their unicode character. The hard part is that some refs are correctly ended with their terminating semicolon (;) while others are not. And in that latter case, if one entity if followed with a semicolon separator, the separator will be eaten by the convertion, shifting the following fields.
So the only reliable way is to:
process the file line by line as as CSV file with ; delimiter
eventually contatenate the middle field from the second to the fourth starting form the end
unescape that middle field
If you want to convert the file, you could do:
with open('file.csv') as fd, open('fixed.csv', 'w', newline='') as fdout:
rd = csv.reader(fd, delimiter=';')
wr = csv.writer(fdout, delimiter=';')
for row in rd:
if len(row)> 5:
row[1] = ';'.join(row[1:len(row)-3])
del row[2:len(row)-3]
row[1] = html.unescape(row[1])
wr.writerow(row)
If you only want to build a mapping from field 0 to field 2:
values = {}
with open('file.csv') as fd:
rd = csv.reader(fd, delimiter=';')
for row in rd:
values[field[0]] = field[-3]

This text isn't UTF8 or Unicode in general. It's HTML-encoded text, most likely Vietnamese. Those escape sequences correspond to Vietnamese characters, for example &#432 is ư - in fact, I just typed the edit sequence in the SO edit box and the correct character appeared. ớ is ớ.
Copying the entire text outside a code block produces
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;h&#7893mang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;h&#432&#417u cao cổ152;298;0
6854;huy&#7873n đề62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0
Googling for Họ Khướu returns this Wikipedia page about Họ Khướu.
I think it's safe to assume this is HTML-encoded Vietnamese text. To convert it to Unicode you can use html.unescape :
import html
line='6887;khướu;62;325;0'
properLine=html.unescape(line)
UPDATE
The text posted above is just the original text with an extra newline per page. It's SO's markdown renderer that converts the escape sequences to the corresponding glyphs.
The funny thing is that this line :
6853;h&#432&#417u cao cổ152;298;0
Can't be rendered because the HTML entities aren't properly terminated. html.unescape on the other hand will convert the characters. Clearly, html.unescape is far more forgiving than SO's markdown renderer.
Either of these lines :
html.unescape('6853;hươu cao cổ152;298;0')
html.unescape('6853;h&#432&#417u cao cổ152;298;0')
Returns :
6853;h\u01b0\u01a1u cao c\u1ed5152;298;0

Repair the file first before you load it into a CSV parser.
Assuming Maarten in the comments is right, change the encoding:
iconv -f cp1252 -t utf-8 < JeuxdeMotsPolarise_test.txt > JeuxdeMotsPolarise_test.utf8.txt
Then replace the escapes with proper characters.
perl -C -i -lpe'
s/&#([0-9]+);?/chr $1/eg; # replace entities
s/;?(\d+;\d+;\d+)$/;$1/; # put back semicolon
# if it was consumed accidentally
' JeuxdeMotsPolarise_test.utf8.txt
Contents of JeuxdeMotsPolarise_test.utf8.txt after running the substitutions:
6814;gymnocéphale;185;151;49
6815;gymnodonte;83;330;0
6816;gymnosome;287;105;42
6817;hà mã;69;305;0
6818;hải âu;81;294;0
6819;hải cẩu;64;338;0
6820;hải yến;62;269;0
6848;histiophore;57;262;0
6849;hiverneur;56;248;0
6850;hổmang;54;298;0
6851;holobranche;97;329;0
6852;hoplopode;65;296;0
6853;hươu cao cổ;152;298;0
6854;huyền đề;62;324;0
6855;hyalosome;73;371;0
6883;jumarre;83;295;0
6884;kéc;86;326;0
6885;kền kền;73;303;0
6886;khoang;64;323;0
6887;khướu;62;325;0

Zeroes appearing when reading file (where aren't any)

When reading a file (UTF-8 Unicode text, csv) with Python on Linux, either with:
csv.reader()
file()
values of some columns get a zero as their first characeter (there are no zeroues in input), other get a few zeroes, which are not seen when viewing file with Geany or any other editor. For example:
Input
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
Output
10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;0378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL
See 378983561 > 0378983561
Reading with:
f = file('/home/foo/data.csv', 'r')
data = f.read()
split_data = data.splitlines()
lines = list(line.split(';') for line in split_data)
print data[51220][8]
>>> '0378983561' #should have been '478983561' (reads like this in Geany etc.)
Same result with csv.reader().
Help me solve the mystery, what could be the cause of this? Could it be related to encoding/decoding?

The data you're getting is a string.
print data[51220][8]
>>> '0478983561'
If you want to use this as an integer, you should parse it.
print int(data[51220][8])
>>> 478983561
If you want this as a string, you should convert it back to a string.
print repr(int(data[51220][8]))
>>> '478983561'

csv.reader treats all columns as strings. Conversion to the appropriate type is up to you as in:
print int(data[51220][8])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to open and edit an incorrect json file? - python

Related

How can I use .replace() on a .txt file with accented characters?

How to extract multiple time from same string in Python?

How to reconstruct and change structure of a dataset using python?

Get rid of unicode decimal charater

Zeroes appearing when reading file (where aren't any)

Categories

Resources