Why is pandas data frame interpreting all data as NaN? - python

I am importing data from a csv file for use in a pandas data frame. My data file has 102 rows and 5 columns, and all of them are clearly labelled as 'Number' in Excel. My code is as follows:
import pandas as pd
data = pd.read_csv('uni.csv', header=None, names = ['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])
print data.head()
The output looks like this:
TopThird Oxbridge Russell Other Low
0 14\t1\t12\t35\t1 NaN NaN NaN NaN
1 14\t1\t12\t32\t0 NaN NaN NaN NaN
2 16\t0\t13\t33\t0 NaN NaN NaN NaN
3 10\t0\t9\t44\t1 NaN NaN NaN NaN
4 18\t1\t13\t28\t1 NaN NaN NaN NaN
And this continues to the bottom of the data frame. I have attempted to change the cell type in Excel to 'General' or use decimal points on the 'Number' type, but this has not changed anything.
Why is this happening? How can it be prevented?

It seems like your file is a file of tab separated values. You'll need to explicitly let read_csv know that it is dealing with whitespace characters as delimiters.
In most cases, passing sep='\t' should work.
df = pd.read_csv('uni.csv',
sep='\t',
header=None,
names=['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])
In some cases, however, columns are not perfectly tab separated. Assuming you have a TSV of numbers, it should be alright to use delim_whitespace=True -
df = pd.read_csv('uni.csv',
delim_whitespace=True,
header=None,
names=['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])
Which is equivalent to sep='\s+', and is a little more generalised, use with caution. On the upside, if your columns have stray whitespaces, this should take care of that automatically.
As mentioned by #Vaishali, there's an alternative function pd.read_table that is useful for width TSV files, and will work with the same arguments that you passed to read_csv -
df = pd.read_table('uni.csv', header=None, names=[...])

Looks like tab delimited data. Try sep='\t'
data = pd.read_csv('uni.csv', sep='\t', header=None, names = ['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])

Related

Dealing with Parse Errors when reading in csv via dask.dataframe

I am working with a massive csv file (>3million rows, 76 columns) and have decided to use dask to read the data before converting to a pandas dataframe.
However, I am running into an issue of what looks like column bleeding in the last column. See the code and error below.
import dask.dataframe as dd
import pandas as pd
dataframe = dd.read_csv("SAS url",
delimiter = ",",
encoding = "UTF-8", blocksize = 25e6,
engine = 'python')
Then to see if all the columns are present I use
dataframe.columns
When using
dataframe.compute()
I see the following error:
ParseError image
When using the read_csv parameter error_bad_lines = False, it shows that many of the rows have 77 or 78 fields instead of the expected 76.
Note: Omitting these faulty rows is unfortunately not an option.
Solution I am seeking
Is there a way to keep all the fields and append these extra fields to new columns when necessary?
Yes there is. You can use the names= parameter to add extra columns before you read the full CSV. I have not tried this with Dask but Dask read_csv calls Pandas read_csv under the covers so this should be applicable to dd.read_csv as well.
To demonstrate using a simulated CSV file:
sim_csv = io.StringIO(
'''A,B,C
11,21,31
12,22,32
13,23,33,43,53
14,24,34
15,25,35'''
)
By default, read_csv fails:
df = pd.read_csv(sim_csv)
ParserError: Error tokenizing data. C error: Expected 3 fields in line 4, saw 5
Capture the column names:
sim_csv.seek(0) # Not needed for a real CSV file
df = pd.read_csv(sim_csv, nrows=1)
save_cols = df.columns.to_list()
Add a couple column names to the end of the names list and read your CSV:
sim_csv.seek(0) # Not needed for a real CSV file
df = pd.read_csv(sim_csv, skiprows=1, names=save_cols+['D','E'])
df
A B C D E
0 11 21 31 NaN NaN
1 12 22 32 NaN NaN
2 13 23 33 43.0 53.0
3 14 24 34 NaN NaN
4 15 25 35 NaN NaN

PANDAS dataframe concat and pivot data

I'm leaning python pandas and playing with some example data. I have a CSV file of a dataset with net worth by percentile of US population by quarter of year.
I've successfully subseted the data by percentile to create three scatter plots of net worth by year, one plot for each of three population sections. However, I'm trying to combine those three plots to one data frame so I can combine the lines on a single plot figure.
Data here:
https://www.federalreserve.gov/releases/z1/dataviz/download/dfa-income-levels.csv
Code thus far:
import pandas as pd
import matplotlib.pyplot as plt
# importing numpy as np
import numpy as np
df = pd.read_csv("dfa-income-levels.csv")
df99th = df.loc[df['Category']=="pct99to100"]
df99th.plot(x='Date',y='Net worth', title='Net worth by percentile')
dfmid = df.loc[df['Category']=="pct40to60"]
dfmid.plot(x='Date',y='Net worth')
dflow = df.loc[df['Category']=="pct00to20"]
dflow.plot(x='Date',y='Net worth')
data = dflow['Net worth'], dfmid['Net worth'], df99th['Net worth']
headers = ['low', 'mid', '99th']
newdf = pd.concat(data, axis=1, keys=headers)
And that yields a dataframe shown below, which is not what I want for plotting the data.
low mid 99th
0 NaN NaN 3514469.0
3 NaN 2503918.0 NaN
5 585550.0 NaN NaN
6 NaN NaN 3602196.0
9 NaN 2518238.0 NaN
... ... ... ...
747 NaN 8610343.0 NaN
749 3486198.0 NaN NaN
750 NaN NaN 32011671.0
753 NaN 8952933.0 NaN
755 3540306.0 NaN NaN
Any recommendations for other ways to approach this?
#filter you dataframe to only the categories you're interested in
filtered_df = df[df['Category'].isin(['pct99to100', 'pct00to20', 'pct40to60'])]
filtered_df = filtered_df[['Date', 'Category', 'Net worth']]
fig, ax = plt.subplots() #ax is an axis object allowing multiple plots per axis
filtered_df.groupby('Category').plot(ax=ax)
I don't see the categories mentioned in your code in the csv file you shared. In order to concat dataframes along columns, you could use pd.concat along axis=1. It concats the columns of same index number. So first set the Date column as index and then concat them, and then again bring back Date as a dataframe column.
To set Date column as index of dataframe, df1 = df1.set_index('Date') and df2 = df2.set_index('Date')
Concat the dataframes df1 and df2 using df_merge = pd.concat([df1,df2],axis=1) or df_merge = pd.merge(df1,df2,on='Date')
bringing back Date into column by df_merge = df_merge.reset_index()

How to read Excel in pandas keeping a mixed type column without NaN?

Here is the dataframe:
As it is in Excel:
stockdf
timestamp dividend_amount split_coefficient
10-07-2020 0 NA
11-07-2020 NA 1
12-07-2020 0 1
When I try to read this into pandas using: pd.read_excel(file.xlsx, index_col=0)
I get
timestamp dividend_amount split_coefficient
10-07-2020 0 NaN
11-07-2020 NaN 1
12-07-2020 0 1
I understand the issue here so I tried:
pd.read_excel(file.xlsx, index_col=0, converters={'dividend_amount': str})
A bit of reading got me to: This converts the column after loading the data.
I tried:
pd.read_excel(file.xlsx, index_col=0, dtype={'divident_amount': str)
Still the same result.
If you don't want attempted conversion of NA values, you can specify that when reading, eg:
df = pd.read_excel('your_file', keep_default_na=False)
In case if you want to drop rows that are are NaN then use below:
df = pd.read_excel(file.xlsx, index_col=0).dropna()

Pandas drop row when parse_dates fails

I came across a problem I thought the smart people at Pandas would've already solved, but I can't seem to find anything, so here I am.
The problem I'm having originates from some bad data, that I expected pandas would be able to filter on reading.
The data looks like this:
Station;Datum;Zeit;Lufttemperatur;Relative Feuchte;Wettersymbol;Windgeschwindigkeit;Windrichtung
9;12.11.2016;08:04;-1.81;86;;;
9;12.11.2016;08:19;-1.66;85.5;;;
9;²;08:34;-1.71;85.6;;;
9;12.11.2016;08:49;-1.91;87.7;;;
9;12.11.2016;09:04;-1.66;86.6;;;
(This is using the ISO-8859-1 character set, it looks different in UTF-8 etc.) I want to read the second column as dates, so naturally, I used
data = pandas.read_csv(file, sep=";", encoding="ISO-8859-1", parse_dates=["Datum"],
date_parser=lambda x: pandas.to_datetime(x, format="%d.%m.%Y"))
which gave
ValueError: time data '²' does not match format '%d.%m.%Y' (match)
Although pandas.read_csv has an input parameter error_bad_lines which looks like it would help my case, it appears all it does is filter out lines that do not have the correct amount of columns. Now I can filter out this particular line in many different ways, and to my knowing all of them require to first load all the data, filter out the rows and then converting the column to datetime objects, but I'd rather do it while reading in the file. It seems to be possible since when I leave out the date_parser, the file gets parsed succesfully and the strange character is just left as it is (although that might give issues when doing datetime instructions later on).
Is there a way for pandas to filter out rows it can't use the date_parser on while reading the file instead of during post-processing?
You want to use the errors parameter in pandas.to_datetime
date_parser=lambda x: pd.to_datetime(x, errors="coerce")
file = "file.csv"
data = pd.read_csv(
file, sep=";", encoding="ISO-8859-1", parse_dates=["Datum"],
date_parser=lambda x: pd.to_datetime(x, errors="coerce")
)
data
Station Datum Zeit Lufttemperatur Relative Feuchte Wettersymbol Windgeschwindigkeit Windrichtung
0 9 2016-12-11 08:04 -1.81 86.0 NaN NaN NaN
1 9 2016-12-11 08:19 -1.66 85.5 NaN NaN NaN
2 9 NaT 08:34 -1.71 85.6 NaN NaN NaN
3 9 2016-12-11 08:49 -1.91 87.7 NaN NaN NaN
4 9 2016-12-11 09:04 -1.66 86.6 NaN NaN NaN

pandas read_excel: nan values forcing others in the same column to be converted to float

Let's say I have the following Excel file to be read:
What I want is a simple solution (preferrably one-line) that can read the excel so that the dates are converted to str (or at least int), and the blank values are to nan or nat or whatever can be detected by pd.isnull.
If I use df = pd.read_excel(file_path), what I get is
df
Out[8]:
001002.XY 600123.AB 123456.YZ 555555.GO
ipo_date 20100203.0 20150605 NaN 20090501.0
delist_date NaN 20170801 NaN NaN
So pandas recognised blank cells as NaN, which is fine, but the pet peeve is that all the other values are forced to float64, even if they are intended to be just str or ints. (edit: it seems that if a column, e.g. the column [1], has no nans, then the other values won't be forced to float. However, in my case most columns have delist_date blank, since most stocks have an ipo date but are not delisted yet.)
For what I know though, I tried the dtype=str keyword arg, and it gives me
df
Out[10]:
001002.XY 600123.AB 123456.YZ 555555.GO
ipo_date 20100203 20150605 nan 20090501
delist_date nan 20170801 nan nan
Looks good? True, the dates are now str, but one thing ridiculous is that the nans now become literal strings! E.g.
df.iloc[1, 0]
Out[12]:
'nan'
which would make me have to add something weird like df.replace later on.
I didn't try the converters because it would require specifying datatype column by column, and the actual excel file I'm working with is a very long spreadsheet (3k columns approx). I don't want to transpose the spreadsheet in excel itself either.
Could anybody help? Thanks in advance.
Use dtype=object as the parameter.
Great explanation here: pandas distinction between str and object types

Categories

Resources