Dealing with Parse Errors when reading in csv via dask.dataframe

Dealing with Parse Errors when reading in csv via dask.dataframe - python

I am working with a massive csv file (>3million rows, 76 columns) and have decided to use dask to read the data before converting to a pandas dataframe.
However, I am running into an issue of what looks like column bleeding in the last column. See the code and error below.
import dask.dataframe as dd
import pandas as pd
dataframe = dd.read_csv("SAS url",
delimiter = ",",
encoding = "UTF-8", blocksize = 25e6,
engine = 'python')
Then to see if all the columns are present I use
dataframe.columns
When using
dataframe.compute()
I see the following error:
ParseError image
When using the read_csv parameter error_bad_lines = False, it shows that many of the rows have 77 or 78 fields instead of the expected 76.
Note: Omitting these faulty rows is unfortunately not an option.
Solution I am seeking
Is there a way to keep all the fields and append these extra fields to new columns when necessary?

Yes there is. You can use the names= parameter to add extra columns before you read the full CSV. I have not tried this with Dask but Dask read_csv calls Pandas read_csv under the covers so this should be applicable to dd.read_csv as well.
To demonstrate using a simulated CSV file:
sim_csv = io.StringIO(
'''A,B,C
11,21,31
12,22,32
13,23,33,43,53
14,24,34
15,25,35'''
)
By default, read_csv fails:
df = pd.read_csv(sim_csv)
ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 5
Capture the column names:
sim_csv.seek(0) # Not needed for a real CSV file
df = pd.read_csv(sim_csv, nrows=1)
save_cols = df.columns.to_list()
Add a couple column names to the end of the names list and read your CSV:
sim_csv.seek(0) # Not needed for a real CSV file
df = pd.read_csv(sim_csv, skiprows=1, names=save_cols+['D','E'])
df
A B C D E
0 11 21 31 NaN NaN
1 12 22 32 NaN NaN
2 13 23 33 43.0 53.0
3 14 24 34 NaN NaN
4 15 25 35 NaN NaN

Related

Python: Excel to data frame : removing top rows and columns that doesnt contain 'right' data

I got a rather very basic excel to Pandas issue which I am unable to get around. Any help in this regard will be highly appreciated.
Source file
I got some data in an excel like below(apologies for pasting a picture and not Table):
Columns A,B,C are not required. I need the highlighted data to be read/moved into a pandas dataframe.

df = pd.read_excel('Miscel.xlsx',sheet_name='Sheet2',skiprows=8, usecols=[3,4,5,6])
df
Date Customers Location Sales
0 2021-10-05 A NSW 12
1 2021-10-03 B NSW 10
2 2021-10-01 C NSW 33
If your data is small, you can also read in and then drop the Nan.
df = pd.read_excel('Miscel.xlsx',sheet_name='Sheet2',skiprows=8).dropna(how='all',axis=1)

Troubles converting column names according to dictionary using map function

I am having troubles converting the column names in my pandas dataframe according to the dictionary I have created
housing = pd.read_csv('City_Zhvi_AllHomes.csv')
cols = housing.iloc[:,51:251]
housing = housing.drop(list(housing)[6:251],axis=1)
cols = cols.groupby(np.arange(len(cols.columns))//3, axis=1).mean()
a= pd.read_excel('gdplev.xls', header=None, skiprows=220,index_col=0, names=['GDP'], parse_cols=[4,6])
col_names = list(a.index)
col_names = col_names + ['2016q3']
vals = list(cols.columns.values)
cols_dict = dict(zip(col_names,vals))
cols = cols.rename(columns = cols_dict)
I also tried using the map function:
cols.columns.map([cols_dict])
The desired outcome is to convert all the column names (0-66) to they keys listed in my dictionary (2000q1-2016q3)
However, the two solutions I have implemented yield the same results and the columns remain with the same names.
UPDATE
As requested here is a list of the first few rows from my dataframe:
0 1 2 3 4 5 6 7 8 9 ... 57 58 59 60 61 62 63 64 65 66
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 5.154667e+05 5.228000e+05 5.280667e+05 5.322667e+05 5.408000e+05 5.572000e+05 5.728333e+05 5.828667e+05 5.916333e+05 587200.0
1 2.070667e+05 2.144667e+05 2.209667e+05 2.261667e+05 2.330000e+05 2.391000e+05 2.450667e+05 2.530333e+05 2.619667e+05 2.727000e+05 ... 4.980333e+05 5.090667e+05 5.188667e+05 5.288000e+05 5.381667e+05 5.472667e+05 5.577333e+05 5.660333e+05 5.774667e+05 584050.0
2 1.384000e+05 1.436333e+05 1.478667e+05 1.521333e+05 1.569333e+05 1.618000e+05 1.664000e+05 1.704333e+05 1.755000e+05 1.775667e+05 ... 1.926333e+05 1.957667e+05 2.012667e+05 2.010667e+05 2.060333e+05 2.083000e+05 2.079000e+05 2.060667e+05 2.082000e+05 212000.0
3 5.300000e+04 5.363333e+04 5.413333e+04 5.470000e+04 5.533333e+04 5.553333e+04 5.626667e+04 5.753333e+04 5.913333e+04 6.073333e+04 ... 1.137333e+05 1.153000e+05 1.156667e+05 1.162000e+05 1.179667e+05 1.212333e+05 1.222000e+05 1.234333e+05 1.269333e+05 128700.0
And a sample of my dictionary:
{0: '2000q1',
1: '2000q2',
2: '2000q3',
3: '2000q4',
4: '2001q1',
5: '2001q2',

you can rename columns in this way:
#Rename Columns
df.rename(columns={'old name1':'new name1','old name2':'new name2'}, inplace=True)
so you'd just use:
housing.rename(columns=cols_dict, inplace=True)
But change you dictionary where the keys are the old column names, and the value of those keys are the new names
cols_dict = dict(zip(vals, col_names))
Looking at your code though, it looks like you’re applying that to a grouby Object. So you’d need to change that ‘cols’ Object back to just simple normal dataframe, then do your rename, or they explain how to do it here with that groupby function Renaming Column Names in Pandas Groupby function

Why is pandas data frame interpreting all data as NaN?

I am importing data from a csv file for use in a pandas data frame. My data file has 102 rows and 5 columns, and all of them are clearly labelled as 'Number' in Excel. My code is as follows:
import pandas as pd
data = pd.read_csv('uni.csv', header=None, names = ['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])
print data.head()
The output looks like this:
TopThird Oxbridge Russell Other Low
0 14\t1\t12\t35\t1 NaN NaN NaN NaN
1 14\t1\t12\t32\t0 NaN NaN NaN NaN
2 16\t0\t13\t33\t0 NaN NaN NaN NaN
3 10\t0\t9\t44\t1 NaN NaN NaN NaN
4 18\t1\t13\t28\t1 NaN NaN NaN NaN
And this continues to the bottom of the data frame. I have attempted to change the cell type in Excel to 'General' or use decimal points on the 'Number' type, but this has not changed anything.
Why is this happening? How can it be prevented?

It seems like your file is a file of tab separated values. You'll need to explicitly let read_csv know that it is dealing with whitespace characters as delimiters.
In most cases, passing sep='\t' should work.
df = pd.read_csv('uni.csv',
sep='\t',
header=None,
names=['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])
In some cases, however, columns are not perfectly tab separated. Assuming you have a TSV of numbers, it should be alright to use delim_whitespace=True -
df = pd.read_csv('uni.csv',
delim_whitespace=True,
header=None,
names=['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])
Which is equivalent to sep='\s+', and is a little more generalised, use with caution. On the upside, if your columns have stray whitespaces, this should take care of that automatically.
As mentioned by #Vaishali, there's an alternative function pd.read_table that is useful for width TSV files, and will work with the same arguments that you passed to read_csv -
df = pd.read_table('uni.csv', header=None, names=[...])

Looks like tab delimited data. Try sep='\t'
data = pd.read_csv('uni.csv', sep='\t', header=None, names = ['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])

how to load only valid rows (without na) from excel to pandas dataframe [duplicate]

I have tried to delete blank rows from my cvs file, however this is not working, it only writes out the first line
please take a look and tell me how i can get all the rows with text and skip the rows that are blank
Here is the code:
I just reads out the first line of the csv file
Thank you in advance!

First read your csv file with pandas with
df=pd.read_csv('input.csv')
then remove blank rows,
df=df.dropna()
For more details in dropna, check the documentation.

There is problem:
for line in df:
print (line)
return columns names.

If I have a csv file like below with blank row
B;D;K;N;M;R
0;2017-04-27 01:35:30;C;3.5;A;01:15:00;23.0
1;2017-04-27 01:37:30;B;3.5;B;01:13:00;24.0
2;2017-04-27 01:39:00;K;3.5;C;00:02:00;99.0
4;2017-04-27 01:39:00;K;3.5;C;00:02:00;99.0
df = pd.read_csv('input.csv',delimiter=';') will give the dataframe ignoring the blank lines.
B D K N M R
0 2017-04-27 01:35:30 C 3.5 A 01:15:00 23.0
1 2017-04-27 01:37:30 B 3.5 B 01:13:00 24.0
2 2017-04-27 01:39:00 K 3.5 C 00:02:00 99.0
4 2017-04-27 01:39:00 K 3.5 C 00:02:00 99.0
Your code works when you use open. Pandas read_csv will convert the csv file into dataframe. You might be confused with one another.
df = open('input.csv')
new_contents = []
for line in df:
if not line.strip():
continue
else:
new_contents.append(line)

With the latest pandas (v 1.3.0), there is an argument where you can tell it to skip blank rows. It's enabled by default, but if you want to make it True anyway (e.g. self-documenting code), just set that flag to True. This is from the doc:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
skip_blank_lines: bool, default True
      If True, skip over blank lines rather than interpreting as NaN values.
So, in your code it is:
df = pd.read_csv(path, sep = ';', skip_blank_lines=True)

Column Manipulations with date-Time Pandas

I am trying to do some column manipulations with row and column at same time including date and time series in Pandas. Traditionally with no series python dictionaries are great. But with Pandas it a new thing for me.
Input file : N number of them.
File1.csv, File2.csv, File3.csv, ........... Filen.csv
Ids,Date-time-1 Ids,Date-time-2 Ids,Date-time-1
56,4568 645,5545 25,54165
45,464 458,546
I am trying to merge the Date-time column of all the files into a big data file with respect to Ids
Ids,Date-time-ref,Date-time-1,date-time-2
56,100,4468,NAN
45,150,314,NAN
645,50,NAN,5495
458,200,NAN,346
25,250,53915,NAN
Check for date-time column - If not matched create one and then fill the values with respect to Ids by Subtracting the current date-time value with the value of date-time-ref of that respective Ids.
Fill in empty place with NAN and if next file has that value then replace the new value with NAN
If it were straight column subtract it was pretty much easy but in sync with date-time series and with respect to Ids seems a bit confusing.
Appreciate some suggestions to begin with. Thanks in advance.

Here is one way to do it.
import pandas as pd
import numpy as np
from StringIO import StringIO
# your csv file contents
csv_file1 = 'Ids,Date-time-1\n56,4568\n45,464\n'
csv_file2 = 'Ids,Date-time-2\n645,5545\n458,546\n'
# add a duplicated Ids record for testing purpose
csv_file3 = 'Ids,Date-time-1\n25,54165\n645, 4354\n'
csv_file_all = [csv_file1, csv_file2, csv_file3]
# read csv into df using list comprehension
# I use buffer here, replace stringIO with your file path
df_all = [pd.read_csv(StringIO(csv_file)) for csv_file in csv_file_all]
# processing
# =====================================================
# concat along axis=0, outer join on axis=1
merged = pd.concat(df_all, axis=0, ignore_index=True, join='outer').set_index('Ids')
Out[206]:
Date-time-1 Date-time-2
Ids
56 4568 NaN
45 464 NaN
645 NaN 5545
458 NaN 546
25 54165 NaN
645 4354 NaN
# custom function to handle/merge duplicates on Ids (axis=0)
def apply_func(group):
return group.fillna(method='ffill').iloc[-1]
# remove Ids duplicates
merged_unique = merged.groupby(level='Ids').apply(apply_func)
Out[207]:
Date-time-1 Date-time-2
Ids
25 54165 NaN
45 464 NaN
56 4568 NaN
458 NaN 546
645 4354 5545
# do the subtraction
master_csv_file = 'Ids,Date-time-ref\n56,100\n45,150\n645,50\n458,200\n25,250\n'
df_master = pd.read_csv(io.StringIO(master_csv_file), index_col=['Ids']).sort_index()
# select matching records and horizontal concat
df_matched = pd.concat([df_master,merged_unique.reindex(df_master.index)], axis=1)
# use broadcasting
df_matched.iloc[:, 1:] = df_matched.iloc[:, 1:].sub(df_matched.iloc[:, 0], axis=0)
Out[208]:
Date-time-ref Date-time-1 Date-time-2
Ids
25 250 53915 NaN
45 150 314 NaN
56 100 4468 NaN
458 200 NaN 346
645 50 4304 5495

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dealing with Parse Errors when reading in csv via dask.dataframe - python

Related

Python: Excel to data frame : removing top rows and columns that doesnt contain 'right' data

Troubles converting column names according to dictionary using map function

Why is pandas data frame interpreting all data as NaN?

how to load only valid rows (without na) from excel to pandas dataframe [duplicate]

Column Manipulations with date-Time Pandas

Categories

Resources