I am reading a CSV file with 8 columns into Pandas data frame. The final column contains an error message, some of which contain commas. This causes the file read to fail with the error ParserError: Error tokenizing data. C error: Expected 8 fields in line 21922, saw 9
Is there a way to ignore all commas after the 8th field, rather than having to go through the file and remove excess commas?
Code to read file:
import pandas as pd
df = pd.read_csv('C:\\somepath\\output.csv')
Line that works:
061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,some message
Line that fails:
061AE,Active,001,2017_02_24 15_18_01,00006,1,00013,longer message, with commas
You can use the parameter usecols in the read_csv function to limit what columns you read in. For example:
import pandas as pd
pd.read_csv(path, usecols=range(8))
if you only want to read the first 8 columns.
You can use re.sub to replace the first few commas with, say, the '|', save the intermediate results in a StringIO then process that.
import pandas as pd
from io import StringIO
import re
for_pd = StringIO()
with open('MikeS159.csv') as mike:
for line in mike:
new_line = re.sub(r',', '|', line.rstrip(), count=7)
print (new_line, file=for_pd)
for_pd.seek(0)
df = pd.read_csv(for_pd, sep='|', header=None)
print (df)
I put the two lines from your question into a file to get this output.
0 1 2 3 4 5 6 \
0 061AE Active 1 2017_02_24 15_18_01 6 1 13
1 061AE Active 1 2017_02_24 15_18_01 6 1 13
7
0 some message
1 longer message, with commas
You can take a shot at this roundabout posted on the Pandas issues page:
import csv
import pandas as pd
import numpy as np
df = pd.read_csv('filename.csv', parse_dates=True, dtype=Object, delimiter="\t", quoting=csv.QUOTE_NONE, encoding='utf-8')
You can also preprocess the data, basically changing all first 7 (0th to 6th, both inclusive) commas to semicolons, and leaving the ones after that as commas* using something like:
to_write = []
counter = 0
with open("sampleCSV.csv", "r") as f:
for line in f:
while counter < 7:
line = list(line)
line[line.index(",")] = ";"
counter += 1
counter = 0
to_write.append("".join(line))
You can now read this to_write list as a Pandas object like
data = pd.DataFrame(to_write)
data = pd.DataFrame(data[0].str.split(";").values.tolist()),
or write it back into a csv and read using pandas with a semicolon delimiter such as read_csv(csv_path, sep=';').
I kinda drafted this quickly without rigorous testing, but should give you some ideas to try. Please comment if it does or doesn't help, and I'll edit it.
*Another option is to delete all commas after 7th, and keep using the comma separator. Either way the point is to differentiate the first 7 delimiters from the subsequent punctuation.
to join #Tblaz answer If you use GoogleColab you can use this solution, in my case the extra comma was on the column 24 so I have only to read 23 columns:
import pandas as pd
from google.colab import files
import io
uploaded = files.upload()
x_train = pd.read_csv(io.StringIO(uploaded['x_train.csv'].decode('utf-8')), skiprows=1, usecols=range(23) ,header=None)
Related
I've read all related topics - like this, this and this - but couldn't get a solution to work.
I have an input csv file like this:
ItemId,Content
i0000008,{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010,{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I've tried several different approaches but couldn't get it to work. I want to read this csv file into a Dataframe like this:
ItemId Content
-------- -------------------------------------------------------------------------------
i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
With following code (Python 3.9):
df = pd.read_csv('test.csv', sep=',', skipinitialspace = True, quotechar = '"')
As far as I understand, commas inside dictionary column and commas inside quotation marks are being treated as regular separators, so it raises following error:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 4 fields in line 3, saw 6
Is it possible to produce desired result? Thanks.
The problem is that the comma's in the Content column are interpreted as separators. You can solve this by using pd.read_fwf to manually set the number of characters on which to split:
df = pd.read_fwf('test.csv', colspecs=[(0, 8),(9,100)], header=0, names=['ItemId', 'Content'])
Result:
ItemId
Content
0
i0000008
{"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1
i0000010
{"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I don't think you'll be able to read it normally with pandas because it has the delimiter used multiple times for a single value; however, reading it with python and doing some processing, you should be able to convert it to pandas dataframe:
def splitValues(x):
index = x.find(',')
return x[:index], x[index+1:].strip()
import pandas as pd
data = open('file.csv')
columns = next(data)
columns = columns.strip().split(',')
df = pd.DataFrame(columns=columns, data=(splitValues(row) for row in data))
OUTPUT:
ItemId Content
0 i0000008 {"Title":"Edison Kinetoscopic Record of a Sneeze","Year":"1894","Rated":"N/A"}
1 i0000010 {"Title":"Employees, Leaving the Lumiére, Factory","Year":"1895","Rated":"N/A"}
I have a text file containing many lines with many '' I tried read(),readline(),readlines(),splitlines() at f.read() or p.splitlines() but none of them are working. Most of the time count is zero or Seven (total no of '').
Please tell me where I am making mistake.
LLL*LLL*LL
AA*AAAA
NN**NNN
My current code
import re
import pandas as pd
from io import StringIO
with open('Test.txt','r') as f:
p = f.read()
print(p)
df12= []
for l in p.splitlines():
x=p.count('*')
df12.append(x)
print(pd.DataFrame(df12))
pandas is probably overkill here, but if you want you can read in each line by specifying a '\n' as the separator then you just want to Series.str.count the character (need to escape with '\*' since '*' is a special character). squeeze=True forces it to be a Series since we know we should only have a single field.
s = pd.read_csv('Test.txt', header=None, sep='\n', squeeze=True)
s.str.count('\*')
0 2
1 1
2 2
Name: 0, dtype: int64
I have a number of .csv files that I download into a directory.
Each .csv is suppose to have 3 columns of information. The head of one of these files looks like:
17/07/2014,637580,10.755
18/07/2014,61996,10.8497
21/07/2014,126758,10.8208
22/07/2014,520926,10.8201
23/07/2014,370843,9.2883
The code that I am using to read the .csv into a dataframe (df) is:
df = pd.read_csv(adj_directory+'\\'+filename, error_bad_lines=False,names=['DATE', 'PX', 'RAW'])
Where I name the three columns (DATE, PX and RAW).
This works fine when the file is formatted correctly. However I have noticed that sometimes the .csv has a slightly different format and can look like for example:
09/07/2014,26268315,,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,,
15/07/2014,205019,10.8607
where there is a column value missing and an extra comma appears in the values place. This means that the file fails to load into the dataframe (the df dataframe is empty).
Is there a way to read the data into a dataframe with the extra comma (ignoring the offending row) so the df would look like:
09/07/2014,26268315,NaN
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,NaN
15/07/2014,205019,10.8607
Probably best to fix the file upstream so that missing values aren't filled with a ,. But if necessary you can correct the file in python, by replacing ,, with just , (line-by-line). Taking your bad file as test.csv:
import re
import csv
patt = re.compile(r",,")
with open('corrected.csv', 'w') as f2:
with open('test.csv') as f:
for line in csv.reader(map(lambda s: patt.sub(',', s), f)):
f2.write(','.join(str(x) for x in line))
f2.write('\n')
f2.close()
f.close()
Output: corrected.csv
09/07/2014,26268315,
10/07/2014,6601181,16.3857
11/07/2014,916651,12.5879
14/07/2014,213357,
15/07/2014,205019,10.8607
Then you should be able to read in this file without issue
import pandas as pd
df = pd.read_csv('corrected.csv', names=['DATE', 'PX', 'RAW'])
DATE PX RAW
0 09/07/2014 26268315 NaN
1 10/07/2014 6601181 16.3857
2 11/07/2014 916651 12.5879
3 14/07/2014 213357 NaN
4 15/07/2014 205019 10.8607
Had this problem yesterday.
Have you tried:
pd.read_csv(adj_directory+'\\'+filename,
error_bad_lines=False,names=['DATE', 'PX', 'RAW'],
keep_default_na=False,
na_values=[''])
I need to access values of a column that occur after the address column, but due to presence of comma in the address field, I causes the file to count extra columns.
Example csv:
id,name,place,address,age,type,dob,date
1,Murtaza,someplace,Street,MA,22,B,somedate,somedate,
2,Murtaza,someplace,somestreet,45,C,somedate,somedate,
3,Murtaza,someplace,somestreet,MA,44,V,somedate,somedate
Excel output:
id name place address age type dob date newcolumn9
1 Murtaza someplace somestreet MA 22 B somedate somedate
2 Murtaza someplace somestreet 45 C somedate somedate
3 Murtaza someplace somestreet MA 44 V somedate somedate
This is what I tried:
# I was able to see that all columns before the column with extra commas displayed fine using this code.
import pandas as pd
import csv
with open('Myfile', 'rb') as f,
open('Newfile', 'wb') as g:
writer = csv.writer(g, delimiter=',')
for line in f:
row = line.split(',', 2)
writer.writerow(row)
I am trying to do this in python pandas.
If I can parse the csv in reverse ill be able to get the proper values regardless of the error.
From the above example, I want extract age column.
panda, or simply re.split():
import re
your_csv_file=open('your_csv_file.csv','r').read()
i_column=2 #index of desired column, counted from back
lines=re.split('\n',your_csv_file)[:-1] #eventually remove last (empty) line
your_column=[]
for line in lines:
your_column.append(re.split(',',line)[-i_column]) #the minus affects indexing beginning at the end
print(your_column)
executed on a .csv-file like the one below
4rth,askj,fpou,ABC,aekert
kjgf,poiuf,pejhh,,oeiu,DEF,akdhg
iuzrit,fslgk,gth,,rhf,,rhe,GHI,ozug
pwiuto,,,,eflgjkhrlguiazg,JKL,rgj
this returns
['ABC', 'DEF', 'GHI', 'JKL']
I think the best way to do this might be to write a separate script that removes the faulty commas. But if you want to just ignore the faulty lines, then that can be done by reading in each line into a StringIO and ignore the line with the incorrect number of commas. So if you're expecting 4 columns:
from cStringIO import StringIO
import pandas
s = StringIO()
correct_columns = 4
with open('MyData.csv') as file:
for line in file:
if len(','.split(line)) == correct_columns:
s.write(line)
s.seek(0)
pandas.read_csv(s)
I have csv with below details
Name,Desc,Year,Location
Jhon,12" Main Third ,2012,GR
Lew,"291" Line (12,596,3)",2012,GR
,All, 1992,FR
...
It is very long file. i just showed problematic lines.I am confused how can i read it in Pandas data frame, I tried
quotechar,
quoting,
sep
like attribute of pandas read_csv.
Still no success.
I have no control on how csv is being designed.
You can do something like this. Try if this works for you:
import pandas as pd
import re
l1=[]
with open('/home/yusuf/Desktop/c1') as f:
headers = f.readline().strip('\n').split(',')
for a in f.readlines():
if a:
q = re.findall("^(\w*),(.*),\s?(\d+),(\w+)",a)
if q:
l1.append(q)
l2 = [list(b[0]) for b in l1]
df = pd.DataFrame(data=l2, columns=headers)
df
Output:
Regex Demo: https://regex101.com/r/AU2WcO/1
You can't have the separator character inside a field.
For example, in
Lew,"291" Line (12,596,3)",2012,GR
Pandas will assume you have 6 fields because you have 5 commas, even if two of them are between quotes. You would need to do some pre-processing of the text file to get rid of this issue, or ask for a different separator character (# or | seem to work well in my experience.
Pandas has no problems reading the other lines:
import pandas as pd
print pd.read_csv('untitled.txt')
Name Desc Year Location
0 Jhon 12" Main Third 2012 GR
1 NaN All 1992 FR