I would like to subtract [a groupby mean of subset] from the [original] dataframe:
I have a pandas DataFrame data whose index is in datetime object (monthly, say 100 years = 100yr*12mn) and 10 columns of station IDs. (i.e., 1200 row * 10 col pd.Dataframe)
1)
I would like to first take a subset of above data, e.g. top 50 years (i.e., 50yr*12mn),
data_sub = data_org[data_org.index.year <= top_50_year]
and calculate monthly mean for each month for each stations (columns). e.g.,
mean_sub = data_sub.groupby(data_sub.index.month).mean()
or
mean_sub = data_sub.groupby(data_sub.index.month).transform('mean')
which seem to do the job.
2)
Now I want to subtract above from the [original] NOT from the [subset], e.g.,
data_org - mean_sub
which I do not know how to. So in summary, I would like to calculate monthly mean from a subset of the original data (e.g., only using 50 years), and subtract that monthly mean from the original data month by month.
It was easy to subtract if I were using the full [original] data to calculate the mean (i.e., .transform('mean') or .apply(lambda x: x - x.mean()) do the job), but what should I do if the mean is calculated from a [subset] data?
Could you share your insight for this problem? Thank you in advance!
#mozway
The input (and also the output) shape looks like the following:
Input shape with random values
Only the values of output are anomalies from the [subset]'s monthly mean. Thank you.
One idea is replace non matched values to NaN by DataFrame.where, so after GroupBy.transform get same indices like original DataFrame, so possible subtract:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data1 = data_org.where(data_org.index.to_series().dt.year <= top_50_year)
print (data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 NaN NaN NaN
2001-04-30 NaN NaN NaN
2001-07-31 NaN NaN NaN
2001-10-31 NaN NaN NaN
2002-01-31 NaN NaN NaN
2002-04-30 NaN NaN NaN
mean_data1 = data1.groupby(data1.index.month).transform('mean')
print (mean_data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 2.0 2.0 6.0
2001-04-30 1.0 3.0 9.0
2001-07-31 6.0 1.0 0.0
2001-10-31 1.0 9.0 0.0
2002-01-31 2.0 2.0 6.0
2002-04-30 1.0 3.0 9.0
df = data_org - mean_data1
print (df)
0 1 2
2000-01-31 0.0 0.0 0.0
2000-04-30 0.0 0.0 0.0
2000-07-31 0.0 0.0 0.0
2000-10-31 0.0 0.0 0.0
2001-01-31 -2.0 7.0 -3.0
2001-04-30 3.0 -3.0 -9.0
2001-07-31 -2.0 0.0 7.0
2001-10-31 2.0 -7.0 4.0
2002-01-31 5.0 0.0 -2.0
2002-04-30 7.0 -3.0 -2.0
Another idea with filtering:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data_sub = data_org[data_org.index.year <= top_50_year]
print (data_sub)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
mean_sub = data_sub.groupby(data_sub.index.month).mean()
print (mean_sub)
0 1 2
1 2 2 6
4 1 3 9
7 6 1 0
10 1 9 0
Create new column m for months:
data_org['m'] = data_org.index.month
print (data_org)
0 1 2 m
2000-01-31 2 2 6 1
2000-04-30 1 3 9 4
2000-07-31 6 1 0 7
2000-10-31 1 9 0 10
2001-01-31 0 9 3 1
2001-04-30 4 0 0 4
2001-07-31 4 1 7 7
2001-10-31 3 2 4 10
2002-01-31 7 2 4 1
2002-04-30 8 0 7 4
And for this solumn are merged mean_sub by DataFrame.join
mean_data1 = data_org[['m']].join(mean_sub, on='m')
print (mean_data1)
m 0 1 2
2000-01-31 1 2 2 6
2000-04-30 4 1 3 9
2000-07-31 7 6 1 0
2000-10-31 10 1 9 0
2001-01-31 1 2 2 6
2001-04-30 4 1 3 9
2001-07-31 7 6 1 0
2001-10-31 10 1 9 0
2002-01-31 1 2 2 6
2002-04-30 4 1 3 9
df = data_org - mean_data1
print (df)
0 1 2 m
2000-01-31 0 0 0 0
2000-04-30 0 0 0 0
2000-07-31 0 0 0 0
2000-10-31 0 0 0 0
2001-01-31 -2 7 -3 0
2001-04-30 3 -3 -9 0
2001-07-31 -2 0 7 0
2001-10-31 2 -7 4 0
2002-01-31 5 0 -2 0
2002-04-30 7 -3 -2 0
I'm working with a large dataset and have the following issue:
Let's say i'm measuring the input of a substance ("sub-input") into a medium ("id"). For each sub-input i have calculated the year in which it is going to reach the other side of the medium ("y-arrival"). Sometimes several sub-input's arrive in the same year and sometimes no substance arrives in a year.
Example:
import pandas as pd
import numpy as np
ids = [1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3]
year= [2000,2001,2002,2003,2004,2005,1990,1991,1992,1993,1994,1995,2000,2001,2002,2003,2004,2005]
in1 = [20,40,10,30,50,80,
60,10,10,40,np.NaN,np.NaN,
np.NaN,120,30,70,60,90]
arr = [2002,2004,2004,2004,2005,np.NaN,
1991,1992,np.NaN,1995,1995,np.NaN,
2001,2002,2004,2004,2005,np.NaN]
dictex3 ={"id":ids,"year":year,"sub-input":in1, "y-arrival":arr}
dfex3 = pd.DataFrame(dictex3)
I have then calculated the sum of "sub-input" for each "y-arrival" using the following code:
dfex3["input_sum_tf"] = dfex3.groupby(["id","y-arrival"])["sub-input"].transform(sum)
print(dfex3)
id year sub-input y-arrival input_sum_tf
0 1 2000 20.0 2002.0 20.0
1 1 2001 40.0 2004.0 80.0
2 1 2002 10.0 2004.0 80.0
3 1 2003 30.0 2004.0 80.0
4 1 2004 50.0 2005.0 50.0
5 1 2005 80.0 NaN NaN
6 2 1990 60.0 1991.0 60.0
7 2 1991 10.0 1992.0 10.0
8 2 1992 10.0 NaN NaN
9 2 1993 40.0 1995.0 40.0
10 2 1994 NaN 1995.0 40.0
11 2 1995 NaN NaN NaN
12 3 2000 NaN 2001.0 0.0
13 3 2001 120.0 2002.0 120.0
14 3 2002 30.0 2004.0 100.0
15 3 2003 70.0 2004.0 100.0
16 3 2004 60.0 2005.0 60.0
17 3 2005 90.0 NaN NaN
Now, for each "id" the sum of the inputs that reach the destination at a "y-arrival" has been calculated.
The goal is to reorder these values so that for each id and each year, the sum of the sub-inputs that will arrive in that year can be shown. Example:
id = 1, year = 2000 --> no y-arrival = 2000 --> = NaN
id = 1, year = 2001 --> no y-arrival = 2001 --> = NaN
id = 1, year = 2002 --> y-arrival = 2002 has an input_sum_tf = 20 --> = 20
id = 1, year = 2003 --> no y-arrival = 2003 --> = NaN
id = 1, year = 2004 --> y-arrival = 2004 has an input_sum_tf = 80 --> = 80
The "input_sum_tf" is the sum of the substances that arrive in a given year. The value "80" for year 2004 is the sum of the sub-input from the years 2001, 2002, 2003 because all of these arrive in year 2004 (y-arrival = 2004).
The result ("input_sum") should look like this:
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
My approach:
I tried solving this by using the merge-function of pandas on two columns, but the result isn't quite right. So far my code only works for the first 5 columns.
dfex3['input_sum'] = dfex3.merge(dfex3, left_on=['id','y-arrival'],
right_on=['id','year'],
how='right')['input_sum_tf_x']
dfex3["input_sum"]
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 80.0
6 80.0
7 50.0
8 NaN
9 60.0
10 10.0
11 NaN
12 NaN
13 40.0
14 40.0
15 NaN
16 0.0
17 120.0
Any help would be much appreciated!
The issue is your code is trying to merge on 'year' and 'y-arrival', so its making multiple matches when you only want one match. E.g. Row 4 where year=2004 will match 3 times where y-arrival=2004 (rows 1-3), hence the duplicates of 80 in the output rows 4-6.
Use groupby to get the last row for each id/y-arrival combo (also looks like you don't want matches where 'input_sum_tf' is zero):
df_last = dfex3.groupby(['id', 'y-arrival']).last().reset_index()
df_last = df_last[df_last['input_sum_tf'] != 0]
Then merge:
dfex3.merge(df_last,
left_on=['id', 'year'],
right_on=['id', 'y-arrival'],
how='left')['input_sum_tf_y']
0 NaN
1 NaN
2 20.0
3 NaN
4 80.0
5 50.0
6 NaN
7 60.0
8 10.0
9 NaN
10 NaN
11 40.0
12 NaN
13 NaN
14 120.0
15 NaN
16 100.0
17 60.0
I have a problem with pandas interpolate(). I only want to interpolate when there are not more than 2 succsessive "np.nans".
But the interpolate function tries to interpolate also single values when there are more than 2 np.nans!?
s = pd.Series(data = [np.nan,10,np.nan,np.nan,np.nan,5,np.nan,6,np.nan,np.nan,30])
a = s.interpolate(limit=2,limit_area='inside')
print(a)
the output I get is:
0 NaN
1 10.00
2 8.75
3 7.50
4 NaN
5 5.00
6 5.50
7 6.00
8 14.00
9 22.00
10 30.00
dtype: float64
I do not want the result in line 2 and 3.
What I want is:
0 NaN
1 10.00
2 NaN
3 NaN
4 NaN
5 5.00
6 5.50
7 6.00
8 14.00
9 22.00
10 30.00
dtype: float64
Can anybody please help?
Groupby.transform with Series.where
s_notna = s.notna()
m = (s.groupby(s_notna.cumsum()).transform('size').le(3) | s_notna)
s = s.interpolate(limit_are='inside').where(m)
print(s)
Output
0 NaN
1 10.0
2 NaN
3 NaN
4 NaN
5 5.0
6 5.5
7 6.0
8 14.0
9 22.0
10 30.0
dtype: float64
I have a dataframe like this-
element id year month days tmax tmin
0 MX17004 2010 1 d1 NaN NaN
1 MX17004 2010 1 d10 NaN NaN
2 MX17004 2010 1 d11 NaN NaN
3 MX17004 2010 1 d12 NaN NaN
4 MX17004 2010 1 d13 NaN NaN
where I want to further break days column like this
**
days
1
10
11
12
13
**
I have tried a couple of ways, but not successful in getting the output. Can someone please help or some clue?
By using str slice
df.days=df.days.str[1:]
df
Out[759]:
element id year month days tmax tmin
0 0 MX17004 2010 1 1 NaN NaN
1 1 MX17004 2010 1 10 NaN NaN
2 2 MX17004 2010 1 11 NaN NaN
3 3 MX17004 2010 1 12 NaN NaN
4 4 MX17004 2010 1 13 NaN NaN
Use extract with regex:
df['days'] = df.days.str.extract('d(\d+)', expand=False)
print(df)
Output:
element id year month days tmax tmin
0 0 MX17004 2010 1 1 NaN NaN
1 1 MX17004 2010 1 10 NaN NaN
2 2 MX17004 2010 1 11 NaN NaN
3 3 MX17004 2010 1 12 NaN NaN
4 4 MX17004 2010 1 13 NaN NaN
I create the following dataframe:
Date ProductID SubProductId Value
0 2015-01-02 1 1 11
1 2015-01-02 1 2 12
2 2015-01-02 1 3 NaN
3 2015-01-02 1 4 NaN
4 2015-01-02 2 1 14
5 2015-01-02 2 2 15
6 2015-01-02 2 3 16
7 2015-01-03 1 1 17
8 2015-01-03 1 2 18
9 2015-01-03 1 3 NaN
10 2015-01-03 1 4 21
11 2015-01-03 2 1 20
12 2015-01-03 2 2 21
And then I group the subproducts by products:
df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
and I would like to get the following:
Value
ProductID 1 2
SubProductId 1 2 3 4 1 2 3
Date
2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN
But what it does when I print it is that it pulls every column that start with some NaN at the end:
Value
ProductID 1 2 1
SubProductId 1 2 1 2 3 4 3
Date
2015-01-02 11.0 12.0 14.0 15.0 16.0 NaN NaN
2015-01-03 17.0 18.0 20.0 21.0 NaN 21.0 NaN
How to have every sub columns grouped under its corresponding column ? even the sub columns that contain NaN
NB: Versions used:
Python version: 3.6.0
Pandas version: 0.19.2
If you want to have ordered column names, you can use sort_level with axis = 1 to sort the column index:
df1 = df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
# sort in descending order
df1.sortlevel(axis=1, ascending=False)
# Value
#ProductID 2 1
#SubProductId 3 2 1 4 3 2 1
#Date
#2015-01-02 16.0 15.0 14.0 NaN NaN 12.0 11.0
#2015-01-03 NaN 21.0 20.0 21.0 NaN 18.0 17.0
# sort in ascending order
df1.sortlevel(axis=1, ascending=True)
# Value
#ProductID 1 2
#SubProductId 1 2 3 4 1 2 3
#Date
#2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
#2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN