Pandas CParserError: Error tokenizing data - python

I have a large csv file with 25 columns, that I want to read as a pandas dataframe. I am using pandas.read_csv().
The problem is that some rows have extra columns, something like that:
col1 col2 stringColumn ... col25
1 12 1 str1 3
...
33657 2 3 str4 6 4 3 #<- that line has a problem
33658 1 32 blbla #<-some columns have missing data too
When I try to read it, I get the error
CParserError: Error tokenizing data. C error: Expected 25 fields in line 33657, saw 28
The problem does not happen if the extra values appear in the first rows. For example if I add values to the third row of the same file it works fine
#that example works:
col1 col2 stringColumn ... col25
1 12 1 str1 3
2 12 1 str1 3
3 12 1 str1 3 f 4
...
33657 2 3 str4 6 4 3 #<- that line has a problem
33658 1 32 blbla #<-some columns have missing data too
My guess is that pandas checks the first (n) rows to determine the number of columns, and if you have extra columns after that it has a problem parsing it.
Skipping the offending lines like suggested here is not an option, those lines contain valuable information.
Does anybody know a way around this?

In my initial post I mentioned not using "error_bad_lines" = False in pandas.read_csv. I decided that actually doing so is the more proper and elegant solution. I found this post quite useful.
Can I redirect the stdout in python into some sort of string buffer?
I added a little twist to the code shown in the answer.
import sys
import re
from cStringIO import StringIO
import pandas as pd
fake_csv = '''1,2,3\na,b,c\na,b,c\na,b,c,d,e\na,b,c\na,b,c,d,e\na,b,c\n''' #bad data
fname = "fake.csv"
old_stderr = sys.stderr
sys.stderr = mystderr = StringIO()
df1 = pd.read_csv(StringIO(fake_csv),
error_bad_lines=False)
sys.stderr = old_stderr
log = mystderr.getvalue()
isnum = re.compile("\d+")
lines_skipped_log = [
isnum.findall(i) + [fname]\
for i in log.split("\n") if isnum.search(i)
]
columns=["line_num","flds_expct","num_fields","file"]
lines_skipped_log.insert(0,columns)
From there you can do anything you want with lines_skipped_log such as output to csv, create a dataframe etc.
Perhaps you have a directory full of files. You can create a list of pandas data frames out of each log and concatenate. From there you will have a log of what rows were skipped and for which files at your fingertips (literally!).

A possible workaround is to specify the column names. Please refer my answer to a similar issue: https://stackoverflow.com/a/43145539/6466550

Since I did not find an answer that completely solves the problem, here is my work around: I found out that explicitly passing the column names with the option names=('col1', 'col2', 'stringColumn' ... 'column25', '', '', '') allows me to read the file. It forces me to read and parse every column, which is not ideal since I only need about half of them, but at least I can read the file now.
Combinining the arguments names and usecols and does not work, if somebody has another solution I would be happy to hear it.

Related

Cell-wise calculations in a Pandas Dataframe

I have what I'm sure is a fundamental lack of understanding about how dataframes work in Python. I am sure this is an easy question, but I have looked everywhere and can't find a good explanation. I am trying to understand why sometimes dataframe calculations seem to run on a row-by-row (or cell by cell) basis, and sometimes seem to run for an entire column... For example:
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df
Which gives:
Name Depth
0 49-037-23094 20
1 49-029-21476 21
2 49-029-20812 7
3 49-041-21318 18
Now I know I can do:
df['DepthDouble']=df['Depth']*2
And get:
Name Depth DepthDouble
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
Which is what I would expect. But this doesn't always work, and I'm trying to understand why. For example, I am trying to run this code to modify the name:
df['newName']=''.join(re.findall('\d',str(df['Name'])))
which gives:
Name Depth DepthDouble \
0 49-037-23094 20 40
1 49-029-21476 21 42
2 49-029-20812 7 14
3 49-041-21318 18 36
newName
0 04903723094149029214762490292081234904121318
1 04903723094149029214762490292081234904121318
2 04903723094149029214762490292081234904121318
3 04903723094149029214762490292081234904121318
So it is taking all the values in my name column, removing the dashes, and concatenating them. Of course, I'd just like it to be a new name column exactly the same as the original "Name" column, but without the dashes.
So, can anyone help me understand what I am doing wrong here? I Don't understand why sometimes Dataframe calculations for one column are done row by row (e.g., the Depth Doubled column) and sometimes Python seems to take all values in the entire column and run the calculation (e.g., the newName column).
Surely the way to get around this isn't by making a loop for every index in the df to force it to run individually for each row for a given column?
If the output you're looking for is:
Name Depth newName
0 49-037-23094 20 4903723094
1 49-029-21476 21 4902921476
2 49-029-20812 7 4902920812
3 49-041-21318 18 4904121318
The way to get this is:
df['newName']=df['Name'].map(lambda name: ''.join(re.findall('\d', name)))
map is like apply but specifically for Series objects. Since you're applying to only the Name column you are operating on a Series.
If the lambda part is confusing, an equivalent way to write it is:
def find_digits(name):
return ''.join(re.findall('\d', name))
df['newName']=df['Name'].map(find_digits)
The equivalent operation in traditional for loops is:
newNameSeries = pd.Series(name='newName')
for name in df['Name']:
newNameSeries = newNameSeries.append(pd.Series(''.join(re.findall('\d', name))), ignore_index=True)
pd.concat([df, newNameSeries], axis=1).rename(columns={0:'newName'})
While there might be a slightly cleaner way to do the loop, you can see how much simpler the first approach is compared to trying to use for-loops. It's also faster. As you already have indicated you know, avoid for loops when using pandas.
The issue is that with str(df['Name']) you are converting the entire Name-column of your DataFrame to one single string. What you want to do instead is to use one of pandas' own methods for strings, which will be applied to every single element of the column.
For example, you could use pandas' replace method for strings:
import pandas as pd
data = {'Name':['49-037-23094', '49-029-21476', '49-029-20812', '49-041-21318'], 'Depth':[20, 21, 7, 18]}
df = pd.DataFrame(data)
df['newName'] = df['Name'].str.replace('-', '')

Read_Csv() not giving expected output ,After If i update any cell in csv

1) Creation of DF
import pandas as pd
li=[["10","Data","String","01249","0199"],["10","Data","String","",""]]
df=pd.DataFrame(li)
df.to_csv("Dummy.csv")
2)Dummy.csv Looks like
0 1 2 3 4
0 10 Data String 1249 199
1 10 Data String
3) Tested this piece of code for clarification:
D=pd.read_csv(Dummy.csv,dtype=str)
print(D['3'])
Gave Expected output:
0 01249
1 NAN
4) I need to manually fill the empty cell in the 3rd and 4th column with value in before cell here say "01249" and "0199" by opening the csv in a excel.**Successfully changed and saved the Dummy.csv in excel.
5) So again i have validation.py file will open the dummy.csv to validate.So to read the csv and further process by following code same as 3rd step.
D=pd.read_csv(Dummy.csv,dtype=str)
print(D['3'])
Its not my expected output:
0 1249
1 1249
My expected output is:
0 01249
1 01249
So it is some serious issue i cannot omit that leading zeros at front. you all can see that if i not edit that column in csv and tried to read with read_csv giving expected output in string dtype,If i update the cell with "01249" in csv,and tried to read with read_csv not giving expected output in same string dtype. I think excel changing that total column datatype to General, I doesn't what general means.
So at end, After updating the cells in csv also i need to get my expected output please, How can i do this.
Sorry for the long body, i need to clearly explain my all steps i have tried.
It is an Excel issue as opposed to python.
Change the type of the whole column (by clicking on the column letter) to text instead of general.
General means that you are allowing Excel to guess the datatype, and since these are numbers it omits the leading zero.

How can I read a range('A5:B10') and place these values into a dataframe using openpyxl

Being able to define the ranges in a manner similar to excel, i.e. 'A5:B10' is important to what I need so reading the entire sheet to a dataframe isn't very useful.
So what I need to do is read the values from multiple ranges in the Excel sheet to multiple different dataframes.
valuerange1 = ['a5:b10']
valuerange2 = ['z10:z20']
df = pd.DataFrame(values from valuerange)
df = pd.DataFrame(values from valuerange1)
or
df = pd.DataFrame(values from ['A5:B10'])
I have searched but either I have done a very poor job of searching or everyone else has gotten around this problem but I really can't.
Thanks.
Using openpyxl
Since you have indicated, that you are looking into a very user friendly way to specify the range (like the excel-syntax) and as Charlie Clark already suggested, you can use openpyxl.
The following utility function takes a workbook and a column/row range and returns a pandas DataFrame:
from openpyxl import load_workbook
from openpyxl.utils import get_column_interval
import re
def load_workbook_range(range_string, ws):
col_start, col_end = re.findall("[A-Z]+", range_string)
data_rows = []
for row in ws[range_string]:
data_rows.append([cell.value for cell in row])
return pd.DataFrame(data_rows, columns=get_column_interval(col_start, col_end))
Usage:
wb = load_workbook(filename='excel-sheet.xlsx',
read_only=True)
ws = wb.active
load_workbook_range('B1:C2', ws)
Output:
B C
0 5 6
1 8 9
Pandas only Solution
Given the following data in an excel sheet:
A B C
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
You can load it with the following command:
pd.read_excel('excel-sheet.xlsx')
If you were to limit the data being read, the pandas.read_excel method offers a number of options. Use the parse_cols, skiprows and skip_footer to select the specific subset that you want to load:
pd.read_excel(
'excel-sheet.xlsx', # name of excel sheet
names=['B','C'], # new column header
skiprows=range(0,1), # list of rows you want to omit at the beginning
skip_footer=1, # number of rows you want to skip at the end
parse_cols='B:C' # columns to parse (note the excel-like syntax)
)
Output:
B C
0 5 6
1 8 9
Some notes:
The API of the read_excel method is not meant to support more complex selections. In case you require a complex filter it is much easier (and cleaner) to load the whole data into a DataFrame and use the excellent slicing and indexing mechanisms provided by pandas.
The most easiest way is to use pandas for getting the range of values from excel.
import pandas as pd
#if you want to choose single range, you can use the below method
src=pd.read_excel(r'August.xlsx',usecols='A:C',sheet_name='S')
#if you have multirange, which means a dataframe with A:S and as well some other range
src=pd.read_excel(r'August.xlsx',usecols='A:C,G:I',sheet_name='S')
If you want to use particular range, for example "B3:E5", you can use the following structure.
src=pd.read_excel(r'August.xlsx',usecols='B:E',sheet_name='S',header=2)[0:2]

select certain value then output

I have a file containing mixed information while I only need certain columns of them.
Below is my example file.
A B C D
1 2 3 abcdef
5 6 7 abcdef
1 2 3 abcdef
And I want to extract the file to get the information I need. For example, looks like below in my output file.
A C D # I only need A, C, and D column.
1 3 ab # For D column, I only need ab.
5 7 ab
1 3 ab
It is not a csv or txt file, but with a space between each column.
You can still read a space-separated file with csv module by using the delimiter kwarg:
>>> with open('/tmp/data.txt') as f:
... reader = csv.DictReader(f, delimiter=' ')
... for row in reader:
... print row['A'], row['C'], row['D'][:2]
...
1 3 ab
5 7 ab
1 3 ab
If you want to do something generical for managing data structures the easiest thing you can do is use python libraries to ease the job.
You can use Pandas Lib: Python Data Analysis Library to rapidly parse the file to a DataFrame that provides methods to make what you want.
You also need Numpy lib because as_matrix method (below) returns a numpyArray.
You can see your data file as a csv (Comma separated value) file with spaces as separators.
With pd you can easily parse the file with read_csv:
import pandas as pd
import numpy as np
dataFrame = pd.read_csv("file.txt", sep = ' ')
For selecting columns you use as_matrix method:
selection = dataFrame.as_matrix((A,C,D))
Then you probably want to can cast it back to dataFrame to continue using its methods:
newDataFrame = pd.DataFrame(selection)
Dropping "cdef" of the "abcdef" values in the column D looks like a thing that can be solved by a simple for, and using [String][5] methods provided by Python. Its a very particular instruction and i don't know any implemented method of any library that accomplishes this.
I hope i helped you.
PD: I tried to post a lot of links but the system didn't let me. I recomend you to look for Numpy and Pandas in Google if you dont have them.
You should check the pandas DataFrame docs to check the methods. I the case you didn't understand what i did look for pandas.read_csv, pandas.dataFrame.as_matrix docs in Google.
And if you don't know how to operate Strings look in Python docs for String.
Edit: Anyway, if you don't want to use libs you can parse the txt file to a list of lists imitating a matrix or using the csv structure that wim mentions in his answer. Then create a function to drop columns, checking the first element of every column (Column identifier) and with some fors export that to other matrix.
Then create another function that deletes the desired values of a column, with some other fors.
The point is that using functions to accomplish what you want makes the solution generical for any table managed as a matrix.
If you have more than one columns like D and want to do the same thing as D, you can do below if you're ok with selecting columns with indices instead of letters:
# your data like this
A B C D E
1 2 3 abcdef abbbb
5 6 7 abcdef abbbb
1 2 3 abcdef abbbb
You import csv then
>>> with open('yourdata.txt') as f:
... reader = csv.reader(f, delimiter=' ')
... for row in reader:
... print(row[0], row[1], *[c[:2] for c in row[3:]])
...
A B D E
1 2 ab ab
5 6 ab ab
1 2 ab ab
The * operator before the [c[:2] for c in row[3:]] is for list argument unpacking. * basicly converts [1,2,3] into 1,2,3, so print(*[1,2,3]) is identical to print(1,2,3). Works on tuples as well.
However, this is python3. If you are using python2, print will give you syntax error, but you can make a wrapper function that takes in the unpacked list arguments, and replace print with this function:
def myprint(*args):
print ' '.join([str(i) for i in args])

Pandas advanced read_excel or ExcelFile.parse

I'm trying to do some conditional parsing of excel files into Pandas dataframes. I have a group of excel files and each has some number of lines at the top of the file that are not part of the data -- some identification data based on what report parameters were used to create the report.
I want to use the ExcelFile.parse() method with skiprows=some_number but I don't know what some_number will be for each file.
I do know that the HeaderRow will start with one member of a list of possibilities. How can I tell Pandas to create the dataframe starting on the row that includes any some_string in my list of possibilities?
Or, is there a way to import the entire sheet and then remove the rows preceding the row that includes any some_string in my list of possibilities?
Most of the time I would just post-process this in pandas, i.e. diagnose, remove the rows, and correct the dtypes, in pandas. This has the benefit of being easier but is arguably less elegant (I suspect it'll also be faster doing it this way!):
In [11]: df = pd.DataFrame([['blah', 1, 2], ['some_string', 3, 4], ['foo', 5, 6]])
In [12]: df
Out[12]:
0 1 2
0 blah 1 2
1 some_string 3 4
2 foo 5 6
In [13]: df[0].isin(['some_string']).argmax() # assuming it's found
Out[13]: 1
I may actually write this in python, as it's probably little/no benefit in vectorizing (and I find this more readable):
def to_skip(df, preceding):
for s in enumerate(df[0]):
if s in preceding:
return i
raise ValueError("No preceding string found in first column")
In [21]: preceding = ['some_string']
In [22]: to_skip(df, preceding)
Out[22]: 1
In [23]: df.iloc[1:]  # or whatever you need to do
Out[23]:
0 1 2
1 some_string 3 4
2 foo 5 6
The other possibility, messing about with ExcelFile and finding the row number could be doing (again with a for-loop as above but in openpyxl or similar). However, I don't think there would be a way to read the excel file (xml) just once if you do this.
This is somewhat unfortunate when compared to how you could do this on a csv, where you can read the first few lines (until you see the row/entry you want), and then pass this opened file to read_csv. (If you can export your Excel spreadsheet to csv then parse in pandas, that would be faster/cleaner...)
Note: read_excel isn't really that fast anyways (esp. compared to read_csv)... so IMO you want to get to pandas asap.

Categories

Resources