pandas possible bug with groupby and resample - python

I am a newbie in pandas and seeking advice if this is a possible bug?
Dataframe with non unique datetime index. Col1 is a group variable, col2 is values.
i want to resample the hourly values to years and grouping by the group variable. i do this with this command
df_resample = df.groupby('col1').resample('Y').mean()
This works fine and creates a multiindex of col1 and the datetimeindeks, where col1 is now NOT a column in the dataframe
How ever if i change mean() to max() this is not the case. Then col1 is part of the multiindex, but the column is still present in the dataframe.
Isnt this a bug?
Sorry, but i dont know how to present dummy data as a dataframe in this post?
Edit:
code example:
from datetime import datetime, timedelta
import pandas as pd
data = {'category':['A', 'B', 'C'],
'value_hour':[1,2,3]}
days = pd.date_range(datetime.now(), datetime.now() + timedelta(2), freq='D')
df = pd.DataFrame(data, index=days)
df_mean = df.groupby('category').resample('Y').mean()
df_max = df.groupby('category').resample('Y').max()
print(df_mean, df_max)
category value_hour
A 2021-12-31 1.0
B 2021-12-31 2.0
C 2021-12-31 3.0
category category value_hour
A 2021-12-31 A 1
B 2021-12-31 B 2
C 2021-12-31 C 3
Trying to drop the category column from df_max gives an KeyError
df_max.drop('category')
File "C:\Users\mav\Anaconda3\envs\EWDpy\lib\site-packages\pandas\core\indexes\base.py", line 3363, in get_loc
raise KeyError(key) from err
KeyError: 'category'

Concerning the KeyError: the problem is that you are trying to drop the "category" row instead of the column.
When using drop to drop the columns you should add axis = 1 as in the following code:
df_max.drop('category', axis=1)
axis=1 indicates you are looking at the columns

Related

Assign counts from .count() to a dataframe + column names - pandas python

Hoping someone can help me here - i believe i am close to the solution.
I have a dataframe, of which i have am using .count() in order to return a series of all column names of my dataframe, and each of their respective non-NAN value counts.
Example dataframe:
feature_1
feature_2
1
1
2
NaN
3
2
4
NaN
5
3
Example result for .count() here would output a series that looks like:
feature_1 5
feature_2 3
I am now trying to get this data into a dataframe, with the column names "Feature" and "Count". To have the expected output look like this:
Feature
Count
feature_1
5
feature_2
3
I am using .to_frame() to push the series to a dataframe in order to add column names. Full code:
df = data.count()
df = df.to_frame()
df.columns = ['Feature', 'Count']
However receiving this error message - "ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements", as if though it is not recognising the actual column names (Feature) as a column with values.
How can i get it to recognise both Feature and Count columns to be able to add column names to them?
Add Series.reset_index instead Series.to_frame for 2 columns DataFrame - first column from index, second from values of Series:
df = data.count().reset_index()
df.columns = ['Feature', 'Count']
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
Another solution with name parameter and Series.rename_axis or with DataFrame.set_axis:
df = data.count().rename_axis('Feature').reset_index(name='Count')
#alternative
df = data.count().reset_index().set_axis(['Feature', 'Count'], axis=1)
print (df)
Feature Count
0 feature_1 5
1 feature_2 3
This happens because your new dataframe has only one column (the column name is taken as series index, then translated into dataframe index with the func to_frame()). In order to assign a 2 elements list to df.columns you have to reset the index first:
df = data.count()
df = df.to_frame().reset_index()
df.columns = ['Feature', 'Count']

Add missing dates do datetime column in Pandas using last value

I've already checked out Add missing dates to pandas dataframe, but I don't want to fill in the new dates with a generic value.
My dataframe looks more or less like this:
date (dd/mm/yyyy)
value
01/01/2000
a
02/01/2000
b
03/01/2000
c
06/01/2000
d
So in this example, days 04/01/2000 and 05/01/2000 are missing. What I want to do is to insert them before the 6th, with a value of c, the last value before the missing days. So the "correct" df should look like:
date (dd/mm/yyyy)
value
01/01/2000
a
02/01/2000
b
03/01/2000
c
04/01/2000
c
05/01/2000
c
06/01/2000
d
There are multiple instances of missing dates, and it's a large df (~9000 rows).
Thanks for your time! :)
try this:
# If your date format is dayfirst, then use the following code
df['date (dd/mm/yyyy)'] = pd.to_datetime(df['date (dd/mm/yyyy)'], dayfirst=True)
out = df.set_index('date (dd/mm/yyyy)').asfreq('D', method='ffill').reset_index()
print(out)
Assuming that your dates are drawn at a regular frequency, you can generate a pd.DateIndex with date_range, filter those which are not in your date column, crate a dataframe to concatenate with nan in the value column and fillna using the back or forward fill method.
# assuming your dataframe is df:
all_dates = pd.date_range(start=df.date.min(), end=df.date.max(), freq='M')
known_dates = set(df.date.to_list()) # set is blazing fast on `in` compared with a list.
unknown_dates = all_dates[~all_dates.isin(known_dates)]
df2 = pd.DateFrame({'date': unknown_dates})
df2['value'] = np.nan
df = pd.concat([df, df2])
df = df.sort_values('value').fillna(method='ffill')

Python - fill NaN by range of date

I have a dataframe, when:
one of column is a Date column.
another column it's X column, that column have missing values.
I want to fill column X by a specific range of dates.
so far I got to this code:
df[df['Date'] < datetime.date(2017,1,1)]['X'].fillna(1,inplace=True)
But it dose not work, I am not getting an error, but the data isn't fill.
and another point it look messy, maybe there is a better way.
Thank for the help.
First, you need to create your data frame:
import pandas as pd
df = pd.DataFrame({'Date': ['2016-01-01', '2018-01-01']})
df['Date'] = pd.to_datetime(df['Date'])
Next, you can conditionally set the column value:
df.loc[df['Date'] < '2017-01-01','X'] = 1
The result would be like this:
Date X
0 2016-01-01 1.0
1 2018-01-01 NaN

fillna() for Multi-Index Pandas DataFrame

I have a multi-index Pandas dataframe and I want to use ffill() to fill any NaNs in certain columns. Following code shows the structure of the sample dataframe, and the result of ffill() in the next snapshot.
room = ['A', 'B']
val = range(3)
df = pd.DataFrame(columns=pd.MultiIndex.from_product([room, val]),data=np.random.randn(3,6))
df.loc[1,('B',0)]=np.nan
# print(df.loc[1,('B',0)])
display(df)
df = df.ffill(axis=1)
display(df)
What I was hoping to get is that the NaN at [1,('B',0)] is replaced with -0.392674 and not with -1.349675.
Generally, I want to be able to ffill() from the corresponding columns from level 1 ([0,1,2]).
How do I achieve this?
I think you are looking for groupby fillna
df=df.groupby(level=1,axis=1).fillna(method='ffill')
df
Out[496]:
A B
0 1 2 0 1 2
0 -0.177358 -1.531091 -0.945004 1.665143 0.602459 -0.008192
1 -0.006995 0.472267 -0.859471 -0.006995 -0.601538 -0.410391
2 0.101494 1.031941 0.499288 0.804391 -0.224750 -0.778403

How to combine groupby and sort in pandas

I am trying to get one result per 'name' with all of the latest data, unless the column is blank. In R I would have used group_by, sorted by timestamp and selected the latest value for each column. I tried to do that here and got very confused. Can someone explain how to do this in Python? In the example below my goal is:
col2 date name
1 4 2018-03-27 15:55:29 bil #latest timestamp with the latest non-blank col4 value
Heres my code so far:
d = {'name':['bil','bil','bil'],'date': ['2018-02-27 14:55:29', '2018-03-27 15:55:29', '2018-02-28 19:55:29'], 'col2': [3,'', 4]}
df2 = pd.DataFrame(data=d)
print(df2)
grouped = df2.groupby(['name']).sum().reset_index()
print(grouped)
sortedvals=grouped.sort_values(['date'], ascending=False)
print(sortedvals)
Here's one way:
df3 = df2[df2['col2'] != ''].sort_values('date', ascending=False).drop_duplicates('name')
# col2 date name
# 2 4 2018-02-28 19:55:29 bil
However, the dataframe you provided and output you desire seem to be inconsistent.

Categories

Resources