Pandas drop row when parse_dates fails - python

I came across a problem I thought the smart people at Pandas would've already solved, but I can't seem to find anything, so here I am.
The problem I'm having originates from some bad data, that I expected pandas would be able to filter on reading.
The data looks like this:
Station;Datum;Zeit;Lufttemperatur;Relative Feuchte;Wettersymbol;Windgeschwindigkeit;Windrichtung
9;12.11.2016;08:04;-1.81;86;;;
9;12.11.2016;08:19;-1.66;85.5;;;
9;²;08:34;-1.71;85.6;;;
9;12.11.2016;08:49;-1.91;87.7;;;
9;12.11.2016;09:04;-1.66;86.6;;;
(This is using the ISO-8859-1 character set, it looks different in UTF-8 etc.) I want to read the second column as dates, so naturally, I used
data = pandas.read_csv(file, sep=";", encoding="ISO-8859-1", parse_dates=["Datum"],
date_parser=lambda x: pandas.to_datetime(x, format="%d.%m.%Y"))
which gave
ValueError: time data '²' does not match format '%d.%m.%Y' (match)
Although pandas.read_csv has an input parameter error_bad_lines which looks like it would help my case, it appears all it does is filter out lines that do not have the correct amount of columns. Now I can filter out this particular line in many different ways, and to my knowing all of them require to first load all the data, filter out the rows and then converting the column to datetime objects, but I'd rather do it while reading in the file. It seems to be possible since when I leave out the date_parser, the file gets parsed succesfully and the strange character is just left as it is (although that might give issues when doing datetime instructions later on).
Is there a way for pandas to filter out rows it can't use the date_parser on while reading the file instead of during post-processing?

You want to use the errors parameter in pandas.to_datetime
date_parser=lambda x: pd.to_datetime(x, errors="coerce")
file = "file.csv"
data = pd.read_csv(
file, sep=";", encoding="ISO-8859-1", parse_dates=["Datum"],
date_parser=lambda x: pd.to_datetime(x, errors="coerce")
)
data
Station Datum Zeit Lufttemperatur Relative Feuchte Wettersymbol Windgeschwindigkeit Windrichtung
0 9 2016-12-11 08:04 -1.81 86.0 NaN NaN NaN
1 9 2016-12-11 08:19 -1.66 85.5 NaN NaN NaN
2 9 NaT 08:34 -1.71 85.6 NaN NaN NaN
3 9 2016-12-11 08:49 -1.91 87.7 NaN NaN NaN
4 9 2016-12-11 09:04 -1.66 86.6 NaN NaN NaN

Related

DataFrame from dicts with automatic date parsing

I am creating a Pandas DataFrame from sequence of dicts.
The dicts are large and somewhat heterogeneous.
Some of the fields are dates.
I would like to automatically detect and parse the date fields.
This can be achieved by
df0 = pd.Dataframe.from_dict(dicts)
df0.to_csv('tmp.csv', index=False)
df = pd.read_csv('tmp.csv', parse_dates=True)
I would like to find a more direct way to do this.
Use pd.to_datetime with errors='ignore'
Only use on columns of dtype == object using select_dtypes. This prevents converting numeric columns into nonsensical dates.
'ignore' abandons the conversion attempt if any errors are encountered.
combine_first is used instead of update because update keeps the initial dtypes. Since they were object, this would mess it all up.
df.select_dtypes(include=object).apply(pd.to_datetime, errors='ignore').combine_first(df)
date0 date1 feuxdate notadate
0 2019-01-01 NaT NaN NaN
1 NaT NaT 0.0 NaN
2 NaT NaT NaN hi
3 NaT 2019-02-01 NaN NaN
Could've also gotten tricky with it using assign to deal with dtypes
df.assign(**df.select_dtypes(include=object).apply(pd.to_datetime, errors='ignore'))
Setup
dicts = [
{'date0': '2019-01-01'},
{'feuxdate': 0},
{'notadate': 'hi'},
{'date1': '20190201'}
]
df = pd.DataFrame.from_dict(dicts)

Reorganize Dataframe to Multi Index

I have the following dataframe after I appended the data from different sources of files:
Owed Due Date
Input NaN 51.83 08012019
Net NaN 35.91 08012019
Output NaN -49.02 08012019
Total -1.26 38.72 08012019
Input NaN 58.43 09012019
Net NaN 9.15 09012019
Output NaN -57.08 09012019
Total -3.48 10.50 09012019
Input NaN 66.50 10012019
Net NaN 9.64 10012019
Output NaN -64.70 10012019
Total -5.16 11.44 10012019
I have been trying to figure out how to reorganize this dataframe to become multi index like this:
I have tried to use melt and pivot but with limited success to even reshape anything. Will appreciate for some guidance!
P.S: The date when using print(df) shows DD for date (e.g 08). However if I change this to a csv file, it becomes 8 instead of 08 for single digit day. Hope someone can guide me on this too, thanks.
Here you go:
df.set_index('Date', append=True).unstack(0).dropna(axis=1)
set_index() moves Date to become an additional index column. Then unstack(0) moves the original index to become column names. Finally, drop the NAN columns and you have your desired result.

Why is pandas data frame interpreting all data as NaN?

I am importing data from a csv file for use in a pandas data frame. My data file has 102 rows and 5 columns, and all of them are clearly labelled as 'Number' in Excel. My code is as follows:
import pandas as pd
data = pd.read_csv('uni.csv', header=None, names = ['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])
print data.head()
The output looks like this:
TopThird Oxbridge Russell Other Low
0 14\t1\t12\t35\t1 NaN NaN NaN NaN
1 14\t1\t12\t32\t0 NaN NaN NaN NaN
2 16\t0\t13\t33\t0 NaN NaN NaN NaN
3 10\t0\t9\t44\t1 NaN NaN NaN NaN
4 18\t1\t13\t28\t1 NaN NaN NaN NaN
And this continues to the bottom of the data frame. I have attempted to change the cell type in Excel to 'General' or use decimal points on the 'Number' type, but this has not changed anything.
Why is this happening? How can it be prevented?
It seems like your file is a file of tab separated values. You'll need to explicitly let read_csv know that it is dealing with whitespace characters as delimiters.
In most cases, passing sep='\t' should work.
df = pd.read_csv('uni.csv',
sep='\t',
header=None,
names=['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])
In some cases, however, columns are not perfectly tab separated. Assuming you have a TSV of numbers, it should be alright to use delim_whitespace=True -
df = pd.read_csv('uni.csv',
delim_whitespace=True,
header=None,
names=['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])
Which is equivalent to sep='\s+', and is a little more generalised, use with caution. On the upside, if your columns have stray whitespaces, this should take care of that automatically.
As mentioned by #Vaishali, there's an alternative function pd.read_table that is useful for width TSV files, and will work with the same arguments that you passed to read_csv -
df = pd.read_table('uni.csv', header=None, names=[...])
Looks like tab delimited data. Try sep='\t'
data = pd.read_csv('uni.csv', sep='\t', header=None, names = ['TopThird', 'Oxbridge', 'Russell', 'Other', 'Low'])

pandas' fillna not working on resampled pivot table

I am working on jupyter lab with pandas, version 0.20.1. I have a pivot table with a DatetimeIndex such as
In [1]:
pivot = df.pivot_table(index='Date', columns=['State'], values='B',
fill_value=0, aggfunc='count')
pivot
Out [1]:
State SAFE UNSAFE
Date
2017-11-18 1 0
2017-11-22 57 42
2017-11-23 155 223
The table counts all occurrences of events on a specific date, which can be either SAFE or UNSAFE. I need to resample the resulting table and sum the results.
Resampling the table with a daily frequency introduces NaNs on the days without data. Surprisingly, I cannot imput those NaNs with pandas' fillna().
In [2]:
pivot = pivot.resample('D').sum().fillna(0.)
pivot
Out [2]:
State SAFE UNSAFE
Date
2017-11-18 1.0 0.0
2017-11-19 NaN NaN
2017-11-20 NaN NaN
2017-11-21 NaN NaN
2017-11-22 57.0 42.0
2017-11-23 155.0 223.0
Anyone can explain why this happens and how can I get rid of those NaNs? I could do something in the line of
for col in ['SAFE', 'UNSAFE']:
mov.loc[mov[col].isnull(), col] = 0
However that look rather ugly, plus I'd like to understand why the first approach is not working.

Adjusting Monthly Time Series Data in Pandas

I have a pandas DataFrame like this.
As you can see, the data corresponds to end of month data. The problem is that the end of month date is not the same for all the columns. ( The underlying reason is that the last trading day of the month does not always coincide with the end of the month. )
Currently, the end of 2016 January have two rows "2016-01-29" and "2016-01-31." It should be just one row. For example, the end of 2016 January should just be 451.1473 1951.218 1401.093 for Index A, Index B and Index C.
Another point is that even though each row almost always corresponds to the end of monthly data, the data might not be nice enough and can conceivably include the middle of the month data for a random columns. In that case, I don't want to make any adjustment so that any prior data collection error would be caught.
What is the most efficient way to achieve this goal.
EDIT:
Index A Index B Index C
DATE
2015-03-31 2067.89 1535.07 229.1
2015-04-30 2085.51 1543 229.4
2015-05-29 2107.39 NaN NaN
2015-05-31 NaN 1550.39 229.1
2015-06-30 2063.11 1534.96 229
2015-07-31 2103.84 NaN 228.8
2015-08-31 1972.18 1464.32 NaN
2015-09-30 1920.03 1416.84 227.5
2015-10-30 2079.36 NaN NaN
2015-10-31 NaN 1448.39 227.7
2015-11-30 2080.41 1421.6 227.6
2015-12-31 2043.94 1408.33 227.5
2016-01-29 1940.24 NaN NaN
2016-01-31 NaN 1354.66 227.5
2016-02-29 1932.23 1355.42 227.3
So, in this case, I need to combine rows at the end of 2015-05, 2015-10, 2016-01. However, rows at 2015-07 and 2015-08 simply does not have data. So, in this case, I would like to leave 2015-07 and 2015-08 as NaN while I like to merge the end of month rows at 2015-05, 2015-10, 2016-01. Hopefully, this provides more insight to what I am trying to do.
You can use:
df = df.groupby(pd.TimeGrouper('M')).fillna(method='ffill')
df = df.resample(rule='M', how='last')
to create a new DateTimeIndex ending on the last day of the months and sample the last available data point for each months. fillna() ensures that, for columns with of missing data for the last available date, you use the prior available value.

Categories

Resources