How to open and edit an incorrect json file? - python
Maybe this question is a bit dumb but I actually don't know how to fix it. I have a json file that is in the wrong format. it has a b' before the first { and also it uses single quotes instead of double quotes which is not an accepted format for json.
I know that I have to replace the single quotes with double quotes. I would use something like this:
json = file.decode('utf8').replace("'", '"')
But the problem is how can I replace the quotes in the file if I can't open it?
import json
f = open("data/qa_Automotive.json",'r')
file = json.load(f)
Opening the file gives me an error because It has single quotes and not double quotes:
JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 2 (char 1)
How am I supposed to change the quotes in the json file if I can't open the file because it has the wrong format? (This is the json file btw: https://jmcauley.ucsd.edu/data/amazon/qa/qa_Appliances.json.gz)
The file isn't JSON (the filename is incorrect); instead, it's composed of valid Python literals. There's no reason to try to transform it to JSON. Don't do that; instead, just tell Python to parse it as-is.
#!/usr/bin/env python3
import ast, json
results = [ ast.literal_eval(line) for line in open('qa_Appliances.json') ]
print(json.dumps(results))
...properly gives you a list named results with all your lines in it, and (for demonstration purposes) dumps it to stdout as JSON.
There are multiple issue with that file :
it is ndjson (new line delimited, each line is a single object)
there is a mixing of quote
First issue is easily solve, you can just load each line individually, second issue is tricker.
I tried (as you did) to simply replace ' with ". But that is not enough, as there is ' in the text (like "you're"), there is a single quote there. If you replace all the ', that will be transformed into a " and break your string.
You'll be left with something like "message": "you"re" which is invalid.
Since there is a lot of issue with that file, I suggest to use something a little dirty: the python eval function.
eval simply play a string as if it was a command.
>>> eval("4 + 2")
6
Because json format, and Python native dict type are really close (minus some difference like Python use True/False and json use true/false. Python use None and json use null, and maybe other that I forgot), the structure with the curly and squared bracket are the same. But here, "eval" will help you because Python support both single, and double quote
Using a line from your file:
>>> import json
>>>
>>> data = "{'questionType': 'yes/no', 'asin': 'B0002Z1GG0', 'answerTime': 'Dec 10, 2013', 'unixTime': 1386662400, 'question': 'would this work on my hotpoint model ctx14ayxkrwh serial hr749157 refrigerator do the drawers slide into slots ?', 'answerType': '?', 'answer': 'the drawers do fit into the slots.'}"
>>> parsed = eval(data)
>>> print(parsed)
{'questionType': 'yes/no', 'asin': 'B0002Z1GG0', 'answerTime': 'Dec 10, 2013', 'unixTime': 1386662400, 'question': 'would this work on my hotpoint model ctx14ayxkrwh serial hr749157 refrigerator do the drawers slide into slots ?', 'answerType': '?', 'answer': 'the drawers do fit into the slots.'}
>>> type(parsed)
<class 'dict'>
>>> print(json.dumps(parsed, indent=2))
{
"questionType": "yes/no",
"asin": "B0002Z1GG0",
"answerTime": "Dec 10, 2013",
"unixTime": 1386662400,
"question": "would this work on my hotpoint model ctx14ayxkrwh serial hr749157 refrigerator do the drawers slide into slots ?",
"answerType": "?",
"answer": "the drawers do fit into the slots."
}
I can do that on the whole file:
>>> data = open("<path to file>").readlines()
>>> parsed = [ eval(line) for line in data ]
>>>
>>> len(parsed)
9011
>>> parsed[0]
{'questionType': 'yes/no', 'asin': 'B00004U9JP', 'answerTime': 'Jun 27, 2014', 'unixTime': 1403852400, 'question': 'I have a 9 year old Badger 1 that needs replacing, will this Badger 1 install just like the original one?', 'answerType': '?', 'answer': 'I replaced my old one with this without a hitch.'}
>>> parsed[0]['questionType']
'yes/no'
BEWARE You should never use eval on unsanitazed data, as it can be use to breach your system, but if you use it in a controlled environment, you do you.
Related
How can I use .replace() on a .txt file with accented characters?
So I have a code that takes a .txt file and adds it to a variable as a string. Then, I try to use .replace() on it to change the character "ó" to "o", but it is not working! The console prints the same thing. Code: def normalize(filename): #Ignores errors because I get the .txt from my WhatsApp conversations and emojis raise an error. #File says: "Es una rubrica de evaluación." (among many emojis) txt_raw = open(filename, "r", errors="ignore") txt_read = txt_raw.read() #Here, only the "o" is replaced. In the real code, I use a for loop to iterate through all chrs. rem_accent_txt = txt_read.replace("ó", "o") print(rem_accent_txt) return Expected output: "Es una rubrica de evaluacion." Current Output: "Es una rubrica de evaluación." It does not print an error or anything, it just prints it as it is. I believe the problem lies on the fact that the string comes from a file because when I just create a string and use the code, it does work, but it does not work when I get the string from a file. EDIT: SOLUTION! Thanks to #juanpa.arrivillaga and #das-g I came up with this solution: from unidecode import unidecode def get_txt(filename): txt_raw = open(filename, "r", encoding="utf8") txt_read = txt_raw.read() txt_decode = unidecode(txt_read) print(txt_decode) return txt_decode
Almost certainly, what is occuring is that you have a unormalized unicode strings. Essentially, there are two ways to create "ó" in unicode: >>> combining = 'ó' >>> composed = 'ó' >>> len(combining), len(composed) (2, 1) >>> list(combining) ['o', '́'] >>> list(composed) ['ó'] >>> import unicodedata >>> list(map(unicodedata.name, combining)) ['LATIN SMALL LETTER O', 'COMBINING ACUTE ACCENT'] >>> list(map(unicodedata.name, composed)) ['LATIN SMALL LETTER O WITH ACUTE'] Just normalize your strings: >>> composed == combining False >>> composed == unicodedata.normalize("NFC", combining) True Although, taking a step back, do you really want to remove accents? Or do you just want to normalize to composed, like the above? As an aside, you shouldn't ignore the errors when reading your text file. You should use the correct encoding. I suspect what is happening is that you are writing your text file using an incorrect encoding, because you should be able to handle emojis just fine, they aren't anything special in unicode. >>> emoji = "😀" >>> print(emoji) 😀 >>> >>> unicodedata.name(emoji) 'GRINNING FACE'
How to extract multiple time from same string in Python?
I'm trying to extract time from single strings where in one string there will be texts other than only time. An example is s = 'Dates : 12/Jul/2019 12/Aug/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58'. I've tried using datefinder module like this : from datetime import datetime as dt import datefinder as dfn for m in dfn.find_dates(s): print(dt.strftime(m, "%H:%M:%S")) Which gives me this : 17:58:00 In this case the time "06:00" is missed out. Now if I try without datefinder with only datetime module like this : dt.strftime(s, "%H:%M") It notifies me that the input must be a datetime object already, not a string with the following error : Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: descriptor 'strftime' requires a 'datetime.date' object but received a 'str' So I tried to use dateutil module to parse this string s to a datetime object with this : from dateutil.parser import parse parse(s) but, now it now says that my string is not in proper format (which in most cases will not be in any fixed format), showing me this error : Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 1358, in parse return DEFAULTPARSER.parse(timestr, **kwargs) File "/home/michael/anaconda3/envs/sec_img/lib/python3.7/site-packages/dateutil/parser/_parser.py", line 649, in parse raise ValueError("Unknown string format:", timestr) ValueError: ('Unknown string format:', '12/Jul/2019 12/Aug/2019 MEISHAN BRIDGE 06:00 17:58') I have thought of getting the time with regex like import re p = r"\d{2}\:\d{2}" times = [i.group() for i in re.finditer(p, s)] # Gives me ['06:00', '17:58'] But doing this way will need me to check again whether this regex matched chunks are actually time or not because even "99:99" could be regex matched rightly and told as time wrongly. Is there any work around without regex to get all the times from a single string? Please note that the string might contain or might not contain any date, but it will contain a time always. Even if it contains date, the date format might be anything on earth and also this string might or might not contain other irrelevant texts.
I don't see many options here, so I would go with a heuristic. I would run the following against the whole dataset and extend the config/regexes until it covers all/most of the cases: import re import logging from datetime import datetime as dt s = 'Dates : 12/Jul/2019 12/08/2019, Loc : MEISHAN BRIDGE, Time : 06:00 17:58:59' SUPPORTED_DATE_FMTS = { re.compile(r"(\d{2}/\w{3}/\d{4})"): "%d/%b/%Y", re.compile(r"(\d{2}/\d{2}/\d{4})"): "%d/%m/%Y", re.compile(r"(\d{2}/\w{3}\w+/\d{4})"): "%d/%B/%Y", # Capture more here } SUPPORTED_TIME_FMTS = { re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9])[^:]"): "%H:%M", re.compile(r"((?:[0-1][0-9]|2[0-4]):[0-5][0-9]:[0-5][0-9])"): "%H:%M:%S", # Capture more here } def extract_supported_dt(config, s): """ Loop thru the given config (keys are regexes, values are date/time format) and attempt to gather all valid data. """ valid_data = [] for regex, fmt in config.items(): # Extract what you think looks like date valid_ish_data = regex.findall(s) if not valid_ish_data: continue print("Checking " + str(valid_ish_data)) # validate it for d in valid_ish_data: try: valid_data.append(dt.strptime(d, fmt)) except ValueError: pass return valid_data # Handle dates dates = extract_supported_dt(SUPPORTED_DATE_FMTS, s) # Handle times times = extract_supported_dt(SUPPORTED_TIME_FMTS, s) print("Found dates: ") for date in dates: print("\t" + str(date.date())) print("Found times: ") for t in times: print("\t" + str(t.time())) Example output: Checking ['12/Jul/2019'] Checking ['12/08/2019'] Checking ['06:00'] Checking ['17:58:59'] Found dates: 2019-07-12 2019-08-12 Found times: 06:00:00 17:58:59 This is a trial and error approach but I do not think there is an alternative in your case. Thus my goal here is to make it as easy as possible to extend support with more date/time formats as opposed to try to find a solution that covers 100% of the data day-1. This way, the more data you run against the more complete your config will be. One thing to note is that you will have to detect strings that appear to have no dates and log them somewhere. Later you will need to manually revise and see if something that was missed could be captured. Now, assuming that your data are being generated by another system, sooner or later you will be able to match 100% of it. If the data input is from human, then you will probably never manage to get 100%! (people tend to make spelling mistakes and sometimes import random stuff... date=today :) )
How to extract multiple time from same string in Python? If you need only time this regex should work fine r"[0-2][0-9]\:[0-5][0-9]" If there could be spaces in time like 23 : 59 use this r"[0-2][0-9]\s*\:\s*[0-5][0-9]"
Use Regex But Something Like This, (?=[0-1])[0-1][0-9]\:[0-5][0-9]|(?=2)[2][0-3]\:[0-5][0-9] This Matched 00:00, 00:59 01:00 01:59 02:00 02: 59 09:00 10:00 11:59 20:00 21:59 23:59 Not work for 99:99 23:99 01:99 Check Here Dude if it works for You Check on Repl.it
you could use dictionaries: my_dict = {} for i in s.split(', '): m = i.strip().split(' : ', 1) my_dict[m[0]] = m[1].split() my_dict Out: {'Dates': ['12/Jul/2019', '12/Aug/2019'], 'Loc': ['MEISHAN', 'BRIDGE'], 'Time': ['06:00', '17:58']}
How to reconstruct and change structure of a dataset using python?
I have a dataset and I need to reconstruct some data from this dataset to a new style My dataset is something like below (Stored in a file named train1.txt): 2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468 I need to convert to below style (I need to store in a new file as train.txt): 2342728 2414939 2397722 2386848 2398737 2367906 2384003 2399896 2359702 2414293 And other numbers …. My python version is 2.7.13 My operating system is Ubuntu 14.04 LTS I will appreciate you for any help. Thank you so much.
I would suggest using regex (regular expressions). This might be a little overkill, but in the long run, knowing regex is super powerful. import re def return_no_commas(string): regex = r'\d*' matches = re.findall(regex, string) for match in matches: print(match) numbers = """ 2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, 2359702, 2414293, 2411228, 2416802, 2322710, 2387437, 2397274, 2344681, 2396522, 2386676, 2413824, 2328225, 2413833, 2335374, 2328594, 497966, 2384001, 2372746, 2386538, 2348518, 2380037, 2374364, 2352054, 2377990, 2367915, 2412520, 2348070, 2356469, 2353541, 2413446, 2391930, 2366968, 2364762, 2347618, 2396550, 2370538, 2393212, 2364244, 2387901, 4752, 2343855, 2331890, 2341328, 2413686, 2359209, 2342027, 2414843, 2378401, 2367772, 2357576, 2416791, 2398673, 2415237, 2383922, 2371110, 2365017, 2406357, 2383444, 2385709, 2392694, 2378109, 2394742, 2318516, 2354062, 2380081, 2395546, 2328407, 2396727, 2316901, 2400923, 2360206, 971, 2350695, 2341332, 2357275, 2369945, 2325241, 2408952, 2322395, 2415137, 2372785, 2382132, 2323580, 2368945, 2413009, 2348581, 2365287, 2408766, 2382349, 2355549, 2406839, 2374616, 2344619, 2362449, 2380907, 2327352, 2347183, 2384375, 2368019, 2365927, 2370027, 2343649, 2415694, 2335035, 2389182, 2354073, 2363977, 2346358, 2373500, 2411328, 2348913, 2372324, 2368727, 2323717, 2409571, 2403981, 2353188, 2343362, 285721, 2376836, 2368107, 2404464, 2417233, 2382750, 2366329, 675, 2360991, 2341475, 2346242, 2391969, 2345287, 2321367, 2416019, 2343732, 2384793, 2347111, 2332212, 138, 2342178, 2405886, 2372686, 2365963, 2342468 """ return_no_commas(numbers) Let me explain what everything does. import re just imports regular expressions. The regular expression I wrote is regex = r'\d*' the "r" at the beginning says it's a regex and it just looks for any number (which is the "\d" part) and says it can repeat any number of times (which is the "*" part). Then we print out all the matches. I saved your numbers in a string called numbers, but you could just as easily read in a file and worked with those contents. You'll get something like: 2342728 2414939 2397722 2386848 2398737 2367906 2384003 2399896 2359702 2414293 2411228 2416802 2322710 2387437 2397274 2344681 2396522 2386676 2413824 2328225 2413833 2335374 2328594 497966 2384001 2372746 2386538 2348518 2380037 2374364 2352054 2377990 2367915 2412520 2348070 2356469 2353541 2413446 2391930 2366968 2364762 2347618 2396550 2370538 2393212
It sounds to me like your original data is separated by commas. However, you want the data separated by new-line characters (\n) instead. This is very easy to do. def covert_comma_to_newline(rfilename, wfilename): """ rfilename -- name of file to read-from wfilename -- name of file to write-to """ assert(rfilename != wfilename) # open two files, one in read-mode # the other in write-mode rfile = open(rfilename, "r") wfile = open(wfilename, "w") # read the file into a string rstryng = rfile.read() lyst = rstryng.split(",") # EXAMPLE: # rstryng == "1,2,3,4" # lyst == ["1", "2", "3", "4"] # remove leading and trailing whitespace lyst = [s.strip() for s in lyst] wstryng = "\n".join(lyst) wfile.writelines(wstryng) rfile.close() wfile.close() return covert_comma_to_newline("train1.txt", "train.txt") # open and check the contents of `train.txt`
Since others have added answers, I will include one using numpy. If you are ok using numpy, it is as simple as: data = np.genfromtxt('train1.txt', dtype=int, delimiter=',') If you want a list instead of numpy array, data.tolist() [2342728, 2414939, 2397722, 2386848, 2398737, 2367906, 2384003, 2399896, .... ]
Get rid of unicode decimal charater
I have a huge file which looks like this : 6814;gymnocéphale;185;151;49 6815;gymnodonte;83;330;0 6816;gymnosome;287;105;42 6817;hà mã;69;305;0 6818;hải âu;81;294;0 6819;hải cẩu;64;338;0 6820;hải yến;62;269;0 6848;histiophore;57;262;0 6849;hiverneur;56;248;0 6850;hổmang;54;298;0 6851;holobranche;97;329;0 6852;hoplopode;65;296;0 6853;hươu cao cổ152;298;0 6854;huyền đề62;324;0 6855;hyalosome;73;371;0 6883;jumarre;83;295;0 6884;kéc;86;326;0 6885;kền kền;73;303;0 6886;khoang;64;323;0 6887;khướu;62;325;0 As you can see the file contains some unicode decimal, I would like to replace all of them with their latin character before using the file. Even opening it with the utf-8 encoding, the errors are not suppress. do you know a way to do it. I want to create a dictionary and retrieve the Numbers at index 2. for : 6883;jumarre;83;295;0; => i have 83 for : 6887;khướu;62;325;0 => i have ớ => which is false , i should have 62 with codecs.open('JeuxdeMotsPolarise_test.txt', 'r', 'utf-8', errors = 'ignore') as text_file: text_file =(text_file.read()) #print(text_file) dico_lexique = ({i.split(";")[1]:i.split(";")[2:]for i in text_file.split("\n") if i}) This is the result given with trying #serge proposition, but it leaves blank spaces between lines. 6814;gymnocéphale;185;151;49 6815;gymnodonte;83;330;0 6816;gymnosome;287;105;42 6817;hà mã;69;305;0 6818;hi âu;81;294;0 6819;hi cu;64;338;0 6820;hi yn;62;269;0 6848;histiophore;57;262;0 6849;hiverneur;56;248;0 6850;h mang;54;298;0 6851;holobranche;97;329;0 6852;hoplopode;65;296;0 6853;hu cao c;152;298;0 6854;huyn ;62;324;0 6855;hyalosome;73;371;0 6883;jumarre;83;295;0 6884;kéc;86;326;0 6885;kn kn;73;303;0 6886;khoang;64;323;0 6887;khu;62;325;0 Edit : I redownload the original file and the error of missing ";" was corrected. for example: => 6850;hổ mang;54;298;0 (that is how is appeared in the now update file) Thank you everybody
#PanagiotisKanavos has correctly guessed that html.unescape was able to replace the xml char reference with their unicode character. The hard part is that some refs are correctly ended with their terminating semicolon (;) while others are not. And in that latter case, if one entity if followed with a semicolon separator, the separator will be eaten by the convertion, shifting the following fields. So the only reliable way is to: process the file line by line as as CSV file with ; delimiter eventually contatenate the middle field from the second to the fourth starting form the end unescape that middle field If you want to convert the file, you could do: with open('file.csv') as fd, open('fixed.csv', 'w', newline='') as fdout: rd = csv.reader(fd, delimiter=';') wr = csv.writer(fdout, delimiter=';') for row in rd: if len(row)> 5: row[1] = ';'.join(row[1:len(row)-3]) del row[2:len(row)-3] row[1] = html.unescape(row[1]) wr.writerow(row) If you only want to build a mapping from field 0 to field 2: values = {} with open('file.csv') as fd: rd = csv.reader(fd, delimiter=';') for row in rd: values[field[0]] = field[-3]
This text isn't UTF8 or Unicode in general. It's HTML-encoded text, most likely Vietnamese. Those escape sequences correspond to Vietnamese characters, for example ư is ư - in fact, I just typed the edit sequence in the SO edit box and the correct character appeared. ớ is ớ. Copying the entire text outside a code block produces 6814;gymnocéphale;185;151;49 6815;gymnodonte;83;330;0 6816;gymnosome;287;105;42 6817;hà mã;69;305;0 6818;hải âu;81;294;0 6819;hải cẩu;64;338;0 6820;hải yến;62;269;0 6848;histiophore;57;262;0 6849;hiverneur;56;248;0 6850;hổmang;54;298;0 6851;holobranche;97;329;0 6852;hoplopode;65;296;0 6853;hươu cao cổ152;298;0 6854;huyền đề62;324;0 6855;hyalosome;73;371;0 6883;jumarre;83;295;0 6884;kéc;86;326;0 6885;kền kền;73;303;0 6886;khoang;64;323;0 6887;khướu;62;325;0 Googling for Họ Khướu returns this Wikipedia page about Họ Khướu. I think it's safe to assume this is HTML-encoded Vietnamese text. To convert it to Unicode you can use html.unescape : import html line='6887;khướu;62;325;0' properLine=html.unescape(line) UPDATE The text posted above is just the original text with an extra newline per page. It's SO's markdown renderer that converts the escape sequences to the corresponding glyphs. The funny thing is that this line : 6853;hươu cao cổ152;298;0 Can't be rendered because the HTML entities aren't properly terminated. html.unescape on the other hand will convert the characters. Clearly, html.unescape is far more forgiving than SO's markdown renderer. Either of these lines : html.unescape('6853;hươu cao cổ152;298;0') html.unescape('6853;hươu cao cổ152;298;0') Returns : 6853;h\u01b0\u01a1u cao c\u1ed5152;298;0
Repair the file first before you load it into a CSV parser. Assuming Maarten in the comments is right, change the encoding: iconv -f cp1252 -t utf-8 < JeuxdeMotsPolarise_test.txt > JeuxdeMotsPolarise_test.utf8.txt Then replace the escapes with proper characters. perl -C -i -lpe' s/&#([0-9]+);?/chr $1/eg; # replace entities s/;?(\d+;\d+;\d+)$/;$1/; # put back semicolon # if it was consumed accidentally ' JeuxdeMotsPolarise_test.utf8.txt Contents of JeuxdeMotsPolarise_test.utf8.txt after running the substitutions: 6814;gymnocéphale;185;151;49 6815;gymnodonte;83;330;0 6816;gymnosome;287;105;42 6817;hà mã;69;305;0 6818;hải âu;81;294;0 6819;hải cẩu;64;338;0 6820;hải yến;62;269;0 6848;histiophore;57;262;0 6849;hiverneur;56;248;0 6850;hổmang;54;298;0 6851;holobranche;97;329;0 6852;hoplopode;65;296;0 6853;hươu cao cổ;152;298;0 6854;huyền đề;62;324;0 6855;hyalosome;73;371;0 6883;jumarre;83;295;0 6884;kéc;86;326;0 6885;kền kền;73;303;0 6886;khoang;64;323;0 6887;khướu;62;325;0
Zeroes appearing when reading file (where aren't any)
When reading a file (UTF-8 Unicode text, csv) with Python on Linux, either with: csv.reader() file() values of some columns get a zero as their first characeter (there are no zeroues in input), other get a few zeroes, which are not seen when viewing file with Geany or any other editor. For example: Input 10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL Output 10016;9167DE1;Tom;Sawyer ;Street 22;2610;Wil;;0378983561;tom#hotmail.com;1979-08-10 00:00:00.000;0;1;Wil;081208608;NULL;2;IZMH726;2010-08-30 15:02:55.777;2013-06-24 08:17:22.763;0;1;1;1;NULL See 378983561 > 0378983561 Reading with: f = file('/home/foo/data.csv', 'r') data = f.read() split_data = data.splitlines() lines = list(line.split(';') for line in split_data) print data[51220][8] >>> '0378983561' #should have been '478983561' (reads like this in Geany etc.) Same result with csv.reader(). Help me solve the mystery, what could be the cause of this? Could it be related to encoding/decoding?
The data you're getting is a string. print data[51220][8] >>> '0478983561' If you want to use this as an integer, you should parse it. print int(data[51220][8]) >>> 478983561 If you want this as a string, you should convert it back to a string. print repr(int(data[51220][8])) >>> '478983561'
csv.reader treats all columns as strings. Conversion to the appropriate type is up to you as in: print int(data[51220][8])