I want to merge static data with time varying data.
First dataframe
a_columns = pd.MultiIndex.from_product([["A","B","C"],["1","2"]])
a_index = pd.date_range("20100101","20110101",freq="BM")
a = pd.DataFrame(columns=a_columns,index=a_index)#A
Second dataframe
b_columns = ["3","4","5"]
b_index = ["A","B","C"]
b = pd.DataFrame(columns=b_columns,index=b_index)
How do i join these two? My desired dataframe has the form as A but with additional columns.
Thanks!
I think you need reshape by stack and then create df by to_frame - for concat need Datetimeindex, so new index was from first value of index of a.
Last concat + sort_index:
#added some data - 2
a_columns = pd.MultiIndex.from_product([["A","B","C"],["1","2"]])
a_index = pd.date_range("20100101","20110101",freq="BM")
a = pd.DataFrame(2,columns=a_columns,index=a_index)#A
#added some data - 1
b_columns = ["3","4","5"]
b_index = ["A","B","C"]
b = pd.DataFrame(1,columns=b_columns,index=b_index)
c = b.stack().to_frame(a.index[0]).T
print (c)
A B C
3 4 5 3 4 5 3 4 5
2010-01-29 1 1 1 1 1 1 1 1 1
d = pd.concat([a,c], axis=1).sort_index(axis=1)
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-02-26 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-03-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-04-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-05-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-06-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-07-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-08-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-09-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-10-29 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-11-30 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
2010-12-31 2 2 NaN NaN NaN 2 2 NaN NaN NaN 2 2 NaN NaN NaN
Last if need replace NaNs only in added columns by first row:
d[c.columns] = d[c.columns].ffill()
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-02-26 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-03-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-04-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-05-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-06-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-07-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-08-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-09-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-10-29 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-11-30 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
2010-12-31 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0 2 2 1.0 1.0 1.0
Similar solution with reindex:
c = b.stack().to_frame(a.index[0]).T.reindex(a.index, method='ffill')
print (c)
A B C
3 4 5 3 4 5 3 4 5
2010-01-29 1 1 1 1 1 1 1 1 1
2010-02-26 1 1 1 1 1 1 1 1 1
2010-03-31 1 1 1 1 1 1 1 1 1
2010-04-30 1 1 1 1 1 1 1 1 1
2010-05-31 1 1 1 1 1 1 1 1 1
2010-06-30 1 1 1 1 1 1 1 1 1
2010-07-30 1 1 1 1 1 1 1 1 1
2010-08-31 1 1 1 1 1 1 1 1 1
2010-09-30 1 1 1 1 1 1 1 1 1
2010-10-29 1 1 1 1 1 1 1 1 1
2010-11-30 1 1 1 1 1 1 1 1 1
2010-12-31 1 1 1 1 1 1 1 1 1
d = pd.concat([a,c], axis=1).sort_index(axis=1)
print (d)
A B C
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
2010-01-29 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-02-26 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-03-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-04-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-05-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-06-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-07-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-08-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-09-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-10-29 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-11-30 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
2010-12-31 2 2 1 1 1 2 2 1 1 1 2 2 1 1 1
Related
I want to fill all rows between two values by group. For each group, var1 has two values equal to 1, and I want to fill the missing rows between the two 1s. var1 represents what I have, var2 represents what I want, var3 shows what I am obtaining with my code, but it is not what I want (different from var2):
var1 group var2 var3
NaN 1 NaN NaN
NaN 1 NaN NaN
1 1 1 1
NaN 1 1 1
NaN 1 1 1
1 1 1 1
NaN 1 NaN 1
NaN 1 NaN 1
1 2 1 1
NaN 2 1 1
1 2 1 1
NaN 2 NaN 1
My code:
df.var3 = df.groupby('group')['var1'].bffill()
Assuming the values are only 1 or NaN, you can groupby.ffill and groupby.bfill and only keep the values that are identical:
g = df.groupby('group')['var1']
s1 = g.ffill()
s2 = g.bfill()
df['var2'] = s1.where(s1.eq(s2))
Output:
var1 group var2
0 NaN 1 NaN
1 NaN 1 NaN
2 1.0 1 1.0
3 NaN 1 1.0
4 NaN 1 1.0
5 1.0 1 1.0
6 NaN 1 NaN
7 NaN 1 NaN
8 1.0 2 1.0
9 NaN 2 1.0
10 1.0 2 1.0
11 NaN 2 NaN
Intermediates:
var1 group var2 ffill bfill
0 NaN 1 NaN NaN 1.0
1 NaN 1 NaN NaN 1.0
2 1.0 1 1.0 1.0 1.0
3 NaN 1 1.0 1.0 1.0
4 NaN 1 1.0 1.0 1.0
5 1.0 1 1.0 1.0 1.0
6 NaN 1 NaN 1.0 NaN
7 NaN 1 NaN 1.0 NaN
8 1.0 2 1.0 1.0 1.0
9 NaN 2 1.0 1.0 1.0
10 1.0 2 1.0 1.0 1.0
11 NaN 2 NaN 1.0 NaN
for example I have a dataframe:
0
1
2
3
4
5
6
0
0.493212
0.586246
nan
0.589289
nan
0.629087
0.593872
1
0.568513
0.367722
nan
nan
nan
nan
0.423369
2
0.70054
0.735529
nan
nan
0.494135
nan
nan
3
nan
nan
nan
0.338822
0.466331
0.765367
0.83082
4
0.512891
nan
0.623782
0.642438
nan
0.541117
0.92981
If I compare it like:
df >= 0.5
The result is:
0
1
2
3
4
5
6
0
0
1
0
1
0
1
1
1
1
0
0
0
0
0
0
2
1
1
0
0
0
0
0
3
0
0
0
0
0
1
1
4
1
0
1
1
0
1
1
How can I keep nan cell ? I mean I need 0.5 > np.nan == np.nan not 0.5 > np.nan == False
IIUC, you can use a mask:
df.lt(0.5).astype(int).mask(df.isna())
output:
0 1 2 3 4 5 6
0 1.0 0.0 NaN 0.0 NaN 0.0 0.0
1 0.0 1.0 NaN NaN NaN NaN 1.0
2 0.0 0.0 NaN NaN 1.0 NaN NaN
3 NaN NaN NaN 1.0 1.0 0.0 0.0
4 0.0 NaN 0.0 0.0 NaN 0.0 0.0
If you want to keep the integer type:
out = df.lt(0.5).astype(pd.Int64Dtype()).mask(df.isna()))
output:
0 1 2 3 4 5 6
0 1 0 <NA> 0 <NA> 0 0
1 0 1 <NA> <NA> <NA> <NA> 1
2 0 0 <NA> <NA> 1 <NA> <NA>
3 <NA> <NA> <NA> 1 1 0 0
4 0 <NA> 0 0 <NA> 0 0
Use DataFrame.mask with convert values to integers:
df = (df >= 0.5).astype(int).mask(df.isna())
print (df)
0 1 2 3 4 5 6
0 0.0 1.0 NaN 1.0 NaN 1.0 1.0
1 1.0 0.0 NaN NaN NaN NaN 0.0
2 1.0 1.0 NaN NaN 0.0 NaN NaN
3 NaN NaN NaN 0.0 0.0 1.0 1.0
4 1.0 NaN 1.0 1.0 NaN 1.0 1.0
Details:
print ((df >= 0.5).astype(int))
0 1 2 3 4 5 6
0 0 1 0 1 0 1 1
1 1 0 0 0 0 0 0
2 1 1 0 0 0 0 0
3 0 0 0 0 0 1 1
4 1 0 1 1 0 1 1
Another idea with numpy.select:
df[:] = np.select([df.isna(), df >= 0.5], [None, 1], default=0)
print (df)
0 1 2 3 4 5 6
0 0.0 1.0 NaN 1.0 NaN 1.0 1.0
1 1.0 0.0 NaN NaN NaN NaN 0.0
2 1.0 1.0 NaN NaN 0.0 NaN NaN
3 NaN NaN NaN 0.0 0.0 1.0 1.0
4 1.0 NaN 1.0 1.0 NaN 1.0 1.0
Btw, if need True/False with NaN is possible use Nullable Boolean data type:
df = (df >= 0.5).astype(int).mask(df.isna()).astype('boolean')
print (df)
0 1 2 3 4 5 6
0 False True <NA> True <NA> True True
1 True False <NA> <NA> <NA> <NA> False
2 True True <NA> <NA> False <NA> <NA>
3 <NA> <NA> <NA> False False True True
4 True <NA> True True <NA> True True
I would like to subtract [a groupby mean of subset] from the [original] dataframe:
I have a pandas DataFrame data whose index is in datetime object (monthly, say 100 years = 100yr*12mn) and 10 columns of station IDs. (i.e., 1200 row * 10 col pd.Dataframe)
1)
I would like to first take a subset of above data, e.g. top 50 years (i.e., 50yr*12mn),
data_sub = data_org[data_org.index.year <= top_50_year]
and calculate monthly mean for each month for each stations (columns). e.g.,
mean_sub = data_sub.groupby(data_sub.index.month).mean()
or
mean_sub = data_sub.groupby(data_sub.index.month).transform('mean')
which seem to do the job.
2)
Now I want to subtract above from the [original] NOT from the [subset], e.g.,
data_org - mean_sub
which I do not know how to. So in summary, I would like to calculate monthly mean from a subset of the original data (e.g., only using 50 years), and subtract that monthly mean from the original data month by month.
It was easy to subtract if I were using the full [original] data to calculate the mean (i.e., .transform('mean') or .apply(lambda x: x - x.mean()) do the job), but what should I do if the mean is calculated from a [subset] data?
Could you share your insight for this problem? Thank you in advance!
#mozway
The input (and also the output) shape looks like the following:
Input shape with random values
Only the values of output are anomalies from the [subset]'s monthly mean. Thank you.
One idea is replace non matched values to NaN by DataFrame.where, so after GroupBy.transform get same indices like original DataFrame, so possible subtract:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data1 = data_org.where(data_org.index.to_series().dt.year <= top_50_year)
print (data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 NaN NaN NaN
2001-04-30 NaN NaN NaN
2001-07-31 NaN NaN NaN
2001-10-31 NaN NaN NaN
2002-01-31 NaN NaN NaN
2002-04-30 NaN NaN NaN
mean_data1 = data1.groupby(data1.index.month).transform('mean')
print (mean_data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 2.0 2.0 6.0
2001-04-30 1.0 3.0 9.0
2001-07-31 6.0 1.0 0.0
2001-10-31 1.0 9.0 0.0
2002-01-31 2.0 2.0 6.0
2002-04-30 1.0 3.0 9.0
df = data_org - mean_data1
print (df)
0 1 2
2000-01-31 0.0 0.0 0.0
2000-04-30 0.0 0.0 0.0
2000-07-31 0.0 0.0 0.0
2000-10-31 0.0 0.0 0.0
2001-01-31 -2.0 7.0 -3.0
2001-04-30 3.0 -3.0 -9.0
2001-07-31 -2.0 0.0 7.0
2001-10-31 2.0 -7.0 4.0
2002-01-31 5.0 0.0 -2.0
2002-04-30 7.0 -3.0 -2.0
Another idea with filtering:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data_sub = data_org[data_org.index.year <= top_50_year]
print (data_sub)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
mean_sub = data_sub.groupby(data_sub.index.month).mean()
print (mean_sub)
0 1 2
1 2 2 6
4 1 3 9
7 6 1 0
10 1 9 0
Create new column m for months:
data_org['m'] = data_org.index.month
print (data_org)
0 1 2 m
2000-01-31 2 2 6 1
2000-04-30 1 3 9 4
2000-07-31 6 1 0 7
2000-10-31 1 9 0 10
2001-01-31 0 9 3 1
2001-04-30 4 0 0 4
2001-07-31 4 1 7 7
2001-10-31 3 2 4 10
2002-01-31 7 2 4 1
2002-04-30 8 0 7 4
And for this solumn are merged mean_sub by DataFrame.join
mean_data1 = data_org[['m']].join(mean_sub, on='m')
print (mean_data1)
m 0 1 2
2000-01-31 1 2 2 6
2000-04-30 4 1 3 9
2000-07-31 7 6 1 0
2000-10-31 10 1 9 0
2001-01-31 1 2 2 6
2001-04-30 4 1 3 9
2001-07-31 7 6 1 0
2001-10-31 10 1 9 0
2002-01-31 1 2 2 6
2002-04-30 4 1 3 9
df = data_org - mean_data1
print (df)
0 1 2 m
2000-01-31 0 0 0 0
2000-04-30 0 0 0 0
2000-07-31 0 0 0 0
2000-10-31 0 0 0 0
2001-01-31 -2 7 -3 0
2001-04-30 3 -3 -9 0
2001-07-31 -2 0 7 0
2001-10-31 2 -7 4 0
2002-01-31 5 0 -2 0
2002-04-30 7 -3 -2 0
I'm having a bit of trouble with this. My dataframe looks like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 nan nan
1 nan nan
2 nan 0
2 50 0
2 20 1
2 nan nan
2 nan nan
So, what I need to do is, after the dummy gets value = 1, I need to fill the amount variable with zeroes for each id, like this:
id amount dummy
1 130 0
1 120 0
1 110 1
1 0 nan
1 0 nan
2 nan 0
2 50 0
2 20 1
2 0 nan
2 0 nan
I'm guessing I'll need some combination of groupby('id'), fillna(method='ffill'), maybe a .loc or a shift() , but everything I tried has had some problem or is very slow. Any suggestions?
The way I will use
s = df.groupby('id')['dummy'].ffill().eq(1)
df.loc[s&df.dummy.isna(),'amount']=0
You can do this much easier:
data[data['dummy'].isna()]['amount'] = 0
This will select all the rows where dummy is nan and fill the amount column with 0.
IIUC, ffill() and mask the still-nan:
s = df.groupby('id')['amount'].ffill().notnull()
df.loc[df['amount'].isna() & s, 'amount'] = 0
Output:
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN
Could you please try following.
df.loc[df['dummy'].isnull(),'amount']=0
df
Output will be as follows.
id amount dummy
0 1 130.0 0.0
1 1 120.0 0.0
2 1 110.0 1.0
3 1 0.0 NaN
4 1 0.0 NaN
5 2 NaN 0.0
6 2 50.0 0.0
7 2 20.0 1.0
8 2 0.0 NaN
9 2 0.0 NaN
If I have a pandas data frame like this:
A
1 1
2 1
3 NaN
4 1
5 NaN
6 1
7 1
8 1
9 1
10 NaN
11 1
12 1
13 1
How do I remove values that are clustered in a length less than some value (in this case four) for example? Such that I get an array like this:
A
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 1
7 1
8 1
9 1
10 NaN
11 NaN
12 NaN
13 NaN
Using groupby and np.where
s = df.groupby(df.A.isnull().cumsum()).transform(lambda s: pd.notnull(s).sum())
df['B'] = np.where(s.A>=4, df.A, np.nan)
Outputs
A B
1 1.0 NaN
2 1.0 NaN
3 NaN NaN
4 1.0 NaN
5 NaN NaN
6 1.0 1.0
7 1.0 1.0
8 1.0 1.0
9 1.0 1.0
10 NaN NaN
11 1.0 NaN
12 1.0 NaN
13 1.0 NaN