Removing incomplete seasons from multi-index dataframe (pandas)

Removing incomplete seasons from multi-index dataframe (pandas) - python

Trying to apply the method from here to a multi-index dataframe, doesn't seem to work.
Take a data-frame:
import pandas as pd
import numpy as np
dates = pd.date_range('20070101',periods=3200)
df = pd.DataFrame(data=np.random.randint(0,100,(3200,1)), columns =list('A'))
df['A'][5,6,7, 8, 9, 10, 11, 12, 13] = np.nan #add missing data points
df['date'] = dates
df = df[['date','A']]
Apply season function to the datetime index
def get_season(row):
if row['date'].month >= 3 and row['date'].month <= 5:
return '2'
elif row['date'].month >= 6 and row['date'].month <= 8:
return '3'
elif row['date'].month >= 9 and row['date'].month <= 11:
return '4'
else:
return '1'
Apply the function
df['Season'] = df.apply(get_season, axis=1)
Create a 'Year' column for indexing
df['Year'] = df['date'].dt.year
Multi-index by Year and Season
df = df.set_index(['Year', 'Season'], inplace=False)
Count datapoints in each season
count = df.groupby(level=[0, 1]).count()
Drop the seasons with less than 75 days in them
count = count.drop(count[count.A < 75].index)
Create a variable for seasons with more than 75 days
complete = count[count['A'] >= 75].index
Using isin function turns up false for everything, while I want it to select all the seasons who have more than 75 days of valid data in 'A'
df = df.isin(complete)
df
Every value comes up false, and I can't see why.
I hope this is concise enough, I need this to work on a multi-index using seasons so I included it!
EDIT
Another method based on multi-index reindexing not working (which also produces a blank dataframe) from here
df3 = df.reset_index().groupby('Year').apply(lambda x: x.set_index('Season').reindex(count,method='pad'))
EDIT 2
Also tried this
seasons = count[count['A'] >= 75].index
df = df[df['A'].isin(seasons)]
Again, blank output

I think you can use Index.isin:
complete = count[count['A'] >= 75].index
idx = df.index.isin(complete)
print idx
[ True True True ..., False False False]
print df[idx]
date A
Year Season
2007 1 2007-01-01 24.0
1 2007-01-02 92.0
1 2007-01-03 54.0
1 2007-01-04 91.0
1 2007-01-05 91.0
1 2007-01-06 NaN
1 2007-01-07 NaN
1 2007-01-08 NaN
1 2007-01-09 NaN
1 2007-01-10 NaN
1 2007-01-11 NaN
1 2007-01-12 NaN
1 2007-01-13 NaN
1 2007-01-14 NaN
1 2007-01-15 18.0
1 2007-01-16 82.0
1 2007-01-17 55.0
1 2007-01-18 64.0
1 2007-01-19 89.0
1 2007-01-20 37.0
1 2007-01-21 45.0
1 2007-01-22 4.0
1 2007-01-23 34.0
1 2007-01-24 35.0
1 2007-01-25 90.0
1 2007-01-26 17.0
1 2007-01-27 29.0
1 2007-01-28 58.0
1 2007-01-29 7.0
1 2007-01-30 57.0
... ... ...
2015 3 2015-08-02 42.0
3 2015-08-03 0.0
3 2015-08-04 31.0
3 2015-08-05 39.0
3 2015-08-06 25.0
3 2015-08-07 1.0
3 2015-08-08 7.0
3 2015-08-09 97.0
3 2015-08-10 38.0
3 2015-08-11 59.0
3 2015-08-12 28.0
3 2015-08-13 84.0
3 2015-08-14 43.0
3 2015-08-15 63.0
3 2015-08-16 68.0
3 2015-08-17 0.0
3 2015-08-18 19.0
3 2015-08-19 61.0
3 2015-08-20 11.0
3 2015-08-21 84.0
3 2015-08-22 75.0
3 2015-08-23 37.0
3 2015-08-24 40.0
3 2015-08-25 66.0
3 2015-08-26 50.0
3 2015-08-27 74.0
3 2015-08-28 37.0
3 2015-08-29 19.0
3 2015-08-30 25.0
3 2015-08-31 15.0
[3106 rows x 2 columns]

Related

Pandas - new column based on other row

I want to create a new column in my dataframe with the value of a other row.
DataFrame
TimeStamp Event Value
0 1603822620000 1 102.0
1 1603822680000 1 108.0
2 1603822740000 1 107.0
3 1603822800000 2 1
4 1603823040000 1 106.0
5 1603823100000 2 0
6 1603823160000 2 1
7 1603823220000 1 105.0
I would like to add a new column with the previous value where event = 1.
TimeStamp Event Value PrevValue
0 1603822620000 1 102.0 NaN
1 1603822680000 1 108.0 102.0
2 1603822740000 1 107.0 108.0
3 1603822800000 2 1 107.0
4 1603823040000 1 106.0 107.0
5 1603823100000 2 0 106.0
6 1603823160000 2 1 106.0
7 1603823220000 1 105.0 106.0
So I can't simply use shift(1) and also not groupBy(event).shift(1).
Current solution
df["PrevValue"] =df.timestamp.apply(lambda ts: (df[(df.Event == 1) & (df.timestamp < ts)].iloc[-1].value))
But I guess, that's not the best solution.
Is there something like shiftUntilCondition(condition)?
Thanks a lot!

Try with
df['new'] = df['Value'].where(df['Event']==1).ffill().shift()
Out[83]:
0 NaN
1 102.0
2 108.0
3 107.0
4 107.0
5 106.0
6 106.0
7 106.0
Name: Value, dtype: float64

maximum sum of consecutive n-days using pandas

I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops.
I am hoping that someone can help me solve this task using pandas instead.
If I have a data frame that looks like this.
date pcp sum_count sumcum
7/13/2013 0.1 3.0 48.7
7/14/2013 48.5
7/15/2013 0.1
7/16/2013
8/1/2013 1.5 1.0 1.5
8/2/2013
8/3/2013
8/4/2013 0.1 2.0 3.6
8/5/2013 3.5
9/22/2013 0.3 3.0 26.3
9/23/2013 14.0
9/24/2013 12.0
9/25/2013
9/26/2013
10/1/2014 0.1 11.0
10/2/2014 96.0 135.5
10/3/2014 2.5
10/4/2014 37.0
10/5/2014 9.5
10/6/2014 26.5
10/7/2014 0.5
10/8/2014 25.5
10/9/2014 2.0
10/10/2014 5.5
10/11/2014 5.5
And I was hoping I could do the following:
STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column.
STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'.
STEP 3 : create a pivot table that will look like this:
year max_sum_count
2013 48.7
2014 135.5
BUT!! the max_sum_count is based on the condition when sum_count = 3
I'd appreciate any help! thank you!
UPDATED QUESTION:
I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry.
The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11.
Thank you

Use:
#filtering + rolling by days
N = 3
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#test NaNs
m = df['pcp'].isna()
#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]
#extract years
df['year'] = df.index.year
#filter no NaNs rows
df = df[~m].copy()
#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()
#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
.rolling(f'{N}D')['pcp']
.sum()
.reset_index(level=[0, 1], drop=True))
#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')
print (df1)
year max_sum_count
0 2013 48.7
1 2014 135.5

First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables:
Input data:
>>> df
date pcp
0 7/13/2013 0.1
1 7/14/2013 48.5
2 7/15/2013 0.1
3 7/16/2013 NaN
4 8/1/2013 1.5
5 8/2/2013 NaN
6 8/3/2013 NaN
7 8/4/2013 0.1
8 8/5/2013 3.5
9 9/22/2013 0.3
10 9/23/2013 14.0
11 9/24/2013 12.0
12 9/25/2013 NaN
13 9/26/2013 NaN
14 10/1/2014 0.1
15 10/2/2014 96.0
16 10/3/2014 2.5
17 10/4/2014 37.0
18 10/5/2014 9.5
19 10/6/2014 26.5
20 10/7/2014 0.5
21 10/8/2014 25.5
22 10/9/2014 2.0
23 10/10/2014 5.5
24 10/11/2014 5.5
Code:
df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()
grp = df.loc[mask, 'date'] \
.ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
.cumsum()
df = df.join(df.reset_index()
.groupby(grp)
.agg(index=('index', 'first'),
sum_count=('pcp', 'size'),
sumcum=('pcp', 'sum'))
.set_index('index'))
pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
.rename('max_sum_count').reset_index()
Output results:
>>> df
date pcp sum_count sumcum
0 2013-07-13 0.1 3.0 48.7
1 2013-07-14 48.5 NaN NaN
2 2013-07-15 0.1 NaN NaN
3 2013-07-16 NaN NaN NaN
4 2013-08-01 1.5 1.0 1.5
5 2013-08-02 NaN NaN NaN
6 2013-08-03 NaN NaN NaN
7 2013-08-04 0.1 2.0 3.6
8 2013-08-05 3.5 NaN NaN
9 2013-09-22 0.3 3.0 26.3
10 2013-09-23 14.0 NaN NaN
11 2013-09-24 12.0 NaN NaN
12 2013-09-25 NaN NaN NaN
13 2013-09-26 NaN NaN NaN
14 2014-10-01 0.1 11.0 210.6
15 2014-10-02 96.0 NaN NaN
16 2014-10-03 2.5 NaN NaN
17 2014-10-04 37.0 NaN NaN
18 2014-10-05 9.5 NaN NaN
19 2014-10-06 26.5 NaN NaN
20 2014-10-07 0.5 NaN NaN
21 2014-10-08 25.5 NaN NaN
22 2014-10-09 2.0 NaN NaN
23 2014-10-10 5.5 NaN NaN
24 2014-10-11 5.5 NaN NaN
>>> pivot
date max_sum_count
0 2013 48.7
1 2014 210.6

Fill missing dates

I have dataframe contains temperature readings from different areas and in different dates
I want to add the missing dates for each location with zero temperature
for example:
df=pd.DataFrame({"area_id":[1,1,1,2,2,2,3,3,3],
"reading_date":["13/1/2017","15/1/2017"
,"16/1/2017","22/3/2017","26/3/2017"
,"28/3/2017","15/5/2017"
,"16/5/2017","18/5/2017"],
"temp":[12,15,22,6,14,8,30,25,33]})
What is the most efficient way to fill dates gap per area (by zeros) as shown below
Many Thanks.

Use:
first convert to datetime column reading_date by to_datetime
set_index for DatetimeIndex and groupby with resample
for Series add asfreq
replace NaNs by fillna
last add reset_index for columns from MultiIndex
df['reading_date'] = pd.to_datetime(df['reading_date'])
df = (df.set_index('reading_date')
.groupby('area_id')
.resample('d')['temp']
.asfreq()
.fillna(0)
.reset_index())
print (df)
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0

Using reindex. Define a custom function to handle the reindexing operation, and call it inside groupby.apply.
def reindex(x):
# Thanks to #jezrael for the improvement.
return x.reindex(pd.date_range(x.index.min(), x.index.max()), fill_value=0)
Next, convert reading_date to datetime first, using pd.to_datetime,
df.reading_date = pd.to_datetime(df.reading_date)
Now, perform a groupby.
df = (
df.set_index('reading_date')
.groupby('area_id')
.temp
.apply(reindex)
.reset_index()
)
df.columns = ['area_id', 'reading_date', 'temp']
df
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0

Pandas - How to group sub columns of a dataframe?

I create the following dataframe:
Date ProductID SubProductId Value
0 2015-01-02 1 1 11
1 2015-01-02 1 2 12
2 2015-01-02 1 3 NaN
3 2015-01-02 1 4 NaN
4 2015-01-02 2 1 14
5 2015-01-02 2 2 15
6 2015-01-02 2 3 16
7 2015-01-03 1 1 17
8 2015-01-03 1 2 18
9 2015-01-03 1 3 NaN
10 2015-01-03 1 4 21
11 2015-01-03 2 1 20
12 2015-01-03 2 2 21
And then I group the subproducts by products:
df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
and I would like to get the following:
Value
ProductID 1 2
SubProductId 1 2 3 4 1 2 3
Date
2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN
But what it does when I print it is that it pulls every column that start with some NaN at the end:
Value
ProductID 1 2 1
SubProductId 1 2 1 2 3 4 3
Date
2015-01-02 11.0 12.0 14.0 15.0 16.0 NaN NaN
2015-01-03 17.0 18.0 20.0 21.0 NaN 21.0 NaN
How to have every sub columns grouped under its corresponding column ? even the sub columns that contain NaN
NB: Versions used:
Python version: 3.6.0
Pandas version: 0.19.2

If you want to have ordered column names, you can use sort_level with axis = 1 to sort the column index:
df1 = df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
# sort in descending order
df1.sortlevel(axis=1, ascending=False)
# Value
#ProductID 2 1
#SubProductId 3 2 1 4 3 2 1
#Date
#2015-01-02 16.0 15.0 14.0 NaN NaN 12.0 11.0
#2015-01-03 NaN 21.0 20.0 21.0 NaN 18.0 17.0
# sort in ascending order
df1.sortlevel(axis=1, ascending=True)
# Value
#ProductID 1 2
#SubProductId 1 2 3 4 1 2 3
#Date
#2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
#2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN

convert specific rows of pandas dataframe into multiindex

here is my DataFrame:
0 1 2
0 0 0.0 20.0 NaN
1 1.0 21.0 NaN
2 2.0 22.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2011.0
1 0 3.0 23.0 NaN
1 4.0 24.0 NaN
2 5.0 25.0 NaN
3 6.0 26.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2012.0
i want to convert the 'ID' and 'Year' rows to dataframe Index with 'ID' being level=0 and 'Year' being level=1. I tried using stack() but still cannot figure it .
Edited: my desired output should look like below:
0 1
11111 2011 0 0.0 20.0
1 1.0 21.0
2 2.0 22.0
2012 0 3.0 23.0
1 4.0 24.0
2 5.0 25.0
3 6.0 26.0

This should work:
df1 = df.loc[pd.IndexSlice[:, ['ID', 'Year']], '2']
dfs = df1.unstack()
dfi = df1.index
dfn = df.drop(dfi).drop('2', axis=1).unstack()
dfn.set_index([dfs.ID, dfs.Year]).stack()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Removing incomplete seasons from multi-index dataframe (pandas) - python

Related

Pandas - new column based on other row

maximum sum of consecutive n-days using pandas

Fill missing dates

Pandas - How to group sub columns of a dataframe?

convert specific rows of pandas dataframe into multiindex

Categories

Resources