Compare two dataframes based on column data in Python pandas - python

I have two dataframes, df1 and df2, and I would like to substruct the df2 from df1 and using as a row comparison a specific column, 'Code'
import pandas as pd
import numpy as np
rng = pd.date_range('2021-01-01', periods=10, freq='D')
df1 = pd.DataFrame(index=rng, data={'Val1': range(10), 'Val2': np.array(range(10))*5, 'Code': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]})
df2 = pd.DataFrame(data={'Code': [1, 2, 3, 4], 'Val1': [10, 5, 15, 20], 'Val2': [4, 8, 10, 7]})
df1:
Val1 Val2 Code
2021-01-01 0 0 1
2021-01-02 1 5 1
2021-01-03 2 10 1
2021-01-04 3 15 2
2021-01-05 4 20 2
2021-01-06 5 25 2
2021-01-07 6 30 3
2021-01-08 7 35 3
2021-01-09 8 40 3
2021-01-10 9 45 3
df2:
Code Val1 Val2
0 1 10 4
1 2 5 8
2 3 15 10
3 4 20 7
I using the following code:
df = (df1.set_index(['Code']) - df2.set_index(['Code']))
and the result is
Code
1 -10.0 -4.0
1 -9.0 1.0
1 -8.0 6.0
2 -2.0 7.0
2 -1.0 12.0
2 0.0 17.0
3 -9.0 20.0
3 -8.0 25.0
3 -7.0 30.0
3 -6.0 35.0
4 NaN NaN
However, I only want to get the results for the rows that are in df1 and not the missing keys, in this example the 4.
How do I do it and then to set back the index to the original from df1?
Something like that but it doesn't work:
df = (df1.set_index(['Code']) - df2.set_index(['Code'])).set_index(df1['Code'])
Also I would like to keep the headers of the columns.
Desired output:
Val1 Val2 Code
Date
2021-01-01 -10.0 -4.0 1
2021-01-02 -9.0 1.0 1
2021-01-03 -8.0 6.0 1
2021-01-04 -2.0 7.0 2
2021-01-05 -1.0 12.0 2
2021-01-06 0.0 17.0 2
2021-01-07 -9.0 20.0 3
2021-01-08 -8.0 25.0 3
2021-01-09 -7.0 30.0 3
2021-01-10 -6.0 35.0 3

If you want to get the results for the rows that are in df1 and not the missing keys, in this example the 4 then just use drop_na() method
df = (df1.set_index(['Code']) - df2.set_index(['Code'])).dropna()
then:-
df.insert(0,'Date',df1.index)
And Finally:-
df.reset_index(inplace=True)
df.set_index('Date',inplace=True)
Now if you print df you will get your desired output:-
Code Val1 Val2
Date
2021-01-01 1 -10.0 -4.0
2021-01-02 1 -9.0 1.0
2021-01-03 1 -8.0 6.0
2021-01-04 2 -2.0 7.0
2021-01-05 2 -1.0 12.0
2021-01-06 2 0.0 17.0
2021-01-07 3 -9.0 20.0
2021-01-08 3 -8.0 25.0
2021-01-09 3 -7.0 30.0
2021-01-10 3 -6.0 35.0
Note:-In case this is not your desired output then let me know

You can use reindex to align df2 to df1["code"]. Then we can take the underlying numpy ndarray and subtract that inplace from the corresponding columns df1. This will leave both the index and the "code" column untouched and perform subtraction as expected.
subtract_values = df2.set_index("Code").reindex(df1["Code"]).to_numpy()
df1[["Val1", "Val2"]] -= subtract_values
print(df1)
Val1 Val2 Code
2021-01-01 -10 -4 1
2021-01-02 -9 1 1
2021-01-03 -8 6 1
2021-01-04 -2 7 2
2021-01-05 -1 12 2
2021-01-06 0 17 2
2021-01-07 -9 20 3
2021-01-08 -8 25 3
2021-01-09 -7 30 3
2021-01-10 -6 35 3
If you don't want to change df1, you can copy the data to a new DataFrame via new_df = df1.copy() and proceeding with new_df instead of df1

Related

pandas Dataframe: Subtract a groupby mean of subset data from the full original data

I would like to subtract [a groupby mean of subset] from the [original] dataframe:
I have a pandas DataFrame data whose index is in datetime object (monthly, say 100 years = 100yr*12mn) and 10 columns of station IDs. (i.e., 1200 row * 10 col pd.Dataframe)
1)
I would like to first take a subset of above data, e.g. top 50 years (i.e., 50yr*12mn),
data_sub = data_org[data_org.index.year <= top_50_year]
and calculate monthly mean for each month for each stations (columns). e.g.,
mean_sub = data_sub.groupby(data_sub.index.month).mean()
or
mean_sub = data_sub.groupby(data_sub.index.month).transform('mean')
which seem to do the job.
2)
Now I want to subtract above from the [original] NOT from the [subset], e.g.,
data_org - mean_sub
which I do not know how to. So in summary, I would like to calculate monthly mean from a subset of the original data (e.g., only using 50 years), and subtract that monthly mean from the original data month by month.
It was easy to subtract if I were using the full [original] data to calculate the mean (i.e., .transform('mean') or .apply(lambda x: x - x.mean()) do the job), but what should I do if the mean is calculated from a [subset] data?
Could you share your insight for this problem? Thank you in advance!
#mozway
The input (and also the output) shape looks like the following:
Input shape with random values
Only the values of output are anomalies from the [subset]'s monthly mean. Thank you.
One idea is replace non matched values to NaN by DataFrame.where, so after GroupBy.transform get same indices like original DataFrame, so possible subtract:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data1 = data_org.where(data_org.index.to_series().dt.year <= top_50_year)
print (data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 NaN NaN NaN
2001-04-30 NaN NaN NaN
2001-07-31 NaN NaN NaN
2001-10-31 NaN NaN NaN
2002-01-31 NaN NaN NaN
2002-04-30 NaN NaN NaN
mean_data1 = data1.groupby(data1.index.month).transform('mean')
print (mean_data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 2.0 2.0 6.0
2001-04-30 1.0 3.0 9.0
2001-07-31 6.0 1.0 0.0
2001-10-31 1.0 9.0 0.0
2002-01-31 2.0 2.0 6.0
2002-04-30 1.0 3.0 9.0
df = data_org - mean_data1
print (df)
0 1 2
2000-01-31 0.0 0.0 0.0
2000-04-30 0.0 0.0 0.0
2000-07-31 0.0 0.0 0.0
2000-10-31 0.0 0.0 0.0
2001-01-31 -2.0 7.0 -3.0
2001-04-30 3.0 -3.0 -9.0
2001-07-31 -2.0 0.0 7.0
2001-10-31 2.0 -7.0 4.0
2002-01-31 5.0 0.0 -2.0
2002-04-30 7.0 -3.0 -2.0
Another idea with filtering:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data_sub = data_org[data_org.index.year <= top_50_year]
print (data_sub)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
mean_sub = data_sub.groupby(data_sub.index.month).mean()
print (mean_sub)
0 1 2
1 2 2 6
4 1 3 9
7 6 1 0
10 1 9 0
Create new column m for months:
data_org['m'] = data_org.index.month
print (data_org)
0 1 2 m
2000-01-31 2 2 6 1
2000-04-30 1 3 9 4
2000-07-31 6 1 0 7
2000-10-31 1 9 0 10
2001-01-31 0 9 3 1
2001-04-30 4 0 0 4
2001-07-31 4 1 7 7
2001-10-31 3 2 4 10
2002-01-31 7 2 4 1
2002-04-30 8 0 7 4
And for this solumn are merged mean_sub by DataFrame.join
mean_data1 = data_org[['m']].join(mean_sub, on='m')
print (mean_data1)
m 0 1 2
2000-01-31 1 2 2 6
2000-04-30 4 1 3 9
2000-07-31 7 6 1 0
2000-10-31 10 1 9 0
2001-01-31 1 2 2 6
2001-04-30 4 1 3 9
2001-07-31 7 6 1 0
2001-10-31 10 1 9 0
2002-01-31 1 2 2 6
2002-04-30 4 1 3 9
df = data_org - mean_data1
print (df)
0 1 2 m
2000-01-31 0 0 0 0
2000-04-30 0 0 0 0
2000-07-31 0 0 0 0
2000-10-31 0 0 0 0
2001-01-31 -2 7 -3 0
2001-04-30 3 -3 -9 0
2001-07-31 -2 0 7 0
2001-10-31 2 -7 4 0
2002-01-31 5 0 -2 0
2002-04-30 7 -3 -2 0

Creating a difference of the data frame with 1 col in python

I am new to python and trying to replicate things in python that are done in excel.
I want to take difference of the col(A,B,C) with the mean(fixed)
Unnamed:0 A B C mean Mean diffA
0 2020-08-28 1 6 11 6.0 -5.0
1 2020-08-29 2 7 12 7.0 -5.0
2 2020-08-30 3 8 13 8.0 -5.0
3 2020-08-31 4 9 14 9.0 -5.0
4 2020-09-01 5 10 15 10.0 -5.0
1 way is to manually put in the col name and find the difference, but is there any other less manual way?
new_df['Mean diffA']=new_df['A']-new_df['mean']
You can subtract the mean from a range of columns:
diffs = new_df.loc[:, 'A':'C'].subtract(new_df['mean'], axis=0)
Then combine the differences and the original DataFrame:
new_df.join(diffs, rsuffix='_mean')
You have a few options to do this. I tried the following and it worked.
import pandas as pd
dt = {'START DATE':['2020-08-28','2020-08-29','2020-08-30',
'2020-08-31','2020-09-01'],
'A':[1,2,3,4,5],
'B':[6,7,8,9,10],
'C':[11,12,13,14,15]}
df = pd.DataFrame(dt)
df['Mean'] = df.loc[:,'A':'C'].mean(axis=1)
df[['dA','dB','dC']] = df.loc[:, 'A':'C'].subtract(df['Mean'], axis=0)
print(df)
OR you can try and do something like this as well
df[['dA','dB','dC']] = df.loc[:,'A':'C'] - df[['Mean','Mean','Mean']].values
print(df)
Both of these will provide the same output:
START DATE A B C Mean dA dB dC
0 2020-08-28 1 6 11 6.0 -5.0 0.0 5.0
1 2020-08-29 2 7 12 7.0 -5.0 0.0 5.0
2 2020-08-30 3 8 13 8.0 -5.0 0.0 5.0
3 2020-08-31 4 9 14 9.0 -5.0 0.0 5.0
4 2020-09-01 5 10 15 10.0 -5.0 0.0 5.0
The second option is not a good way to use it. Pandas provides you subtract function. Use this.

Is there a way to apply a function to a MultiIndex dataframe slice with the same outer index without iterating each slice?

Basically, what I'm trying to accomplish is to fill the missing dates (creating new DataFrame rows) with respect to each product, then create a new column based on a cumulative sum of column 'A' (example shown below)
The data is a MultiIndex with (product, date) as indexes.
Basically I would like to apply this answer to a MultiIndex DataFrame using only the rightmost index and calculating a subsequent np.cumsum for each product (and all dates).
A
product date
0 2017-01-02 1
2017-01-03 2
2017-01-04 2
2017-01-05 1
2017-01-06 4
2017-01-07 1
2017-01-10 7
1 2018-06-29 1
2018-06-30 4
2018-07-01 1
2018-07-02 1
2018-07-04 2
What I want to accomplish (efficiently) is:
A CumSum
product date
0 2017-01-02 1 1
2017-01-03 2 3
2017-01-04 2 5
2017-01-05 1 6
2017-01-06 4 10
2017-01-07 1 11
2017-01-08 0 11
2017-01-09 0 11
2017-01-10 7 18
1 2018-06-29 1 1
2018-06-30 4 5
2018-07-01 1 6
2018-07-02 1 7
2018-07-03 0 7
2018-07-04 2 9
You have 2 ways:
One way:
Using groupby with apply and with resample and cumsum. Finally, pd.concat result with df.A and fillna with 0
s = (df.reset_index(0).groupby('product').apply(lambda x: x.resample(rule='D')
.asfreq(0).A.cumsum()))
pd.concat([df.A, s.rename('cumsum')], axis=1).fillna(0)
Out[337]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9
Another way:
you need 2 groupbys. First one for resample, 2nd one for cumsum. Finally, use pd.concat and fillna with 0
s1 = df.reset_index(0).groupby('product').resample(rule='D').asfreq(0).A
pd.concat([df.A, s1.groupby(level=0).cumsum().rename('cumsum')], axis=1).fillna(0)
Out[351]:
A cumsum
product date
0 2017-01-02 1.0 1
2017-01-03 2.0 3
2017-01-04 2.0 5
2017-01-05 1.0 6
2017-01-06 4.0 10
2017-01-07 1.0 11
2017-01-08 0.0 11
2017-01-09 0.0 11
2017-01-10 7.0 18
1 2018-06-29 1.0 1
2018-06-30 4.0 5
2018-07-01 1.0 6
2018-07-02 1.0 7
2018-07-03 0.0 7
2018-07-04 2.0 9

Computing the difference between first and last values in a rolling window

I am using the Pandas rolling window tool on a one-column dataframe whose index is in datetime form.
I would like to compute, for each window, the difference between the first value and the last value of said window. How do I refer to the relative index when giving a lambda function? (in the brackets below)
df2 = df.rolling('3s').apply(...)
IIUC:
In [93]: df = pd.DataFrame(np.random.randint(10,size=(9, 3)))
In [94]: df
Out[94]:
0 1 2
0 7 4 5
1 9 9 3
2 1 7 6
3 0 9 2
4 2 3 7
5 6 7 1
6 1 0 1
7 8 4 7
8 0 0 9
In [95]: df.rolling(window=3).apply(lambda x: x[0]-x[-1])
Out[95]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 6.0 -3.0 -1.0
3 9.0 0.0 1.0
4 -1.0 4.0 -1.0
5 -6.0 2.0 1.0
6 1.0 3.0 6.0
7 -2.0 3.0 -6.0
8 1.0 0.0 -8.0

Removing incomplete seasons from multi-index dataframe (pandas)

Trying to apply the method from here to a multi-index dataframe, doesn't seem to work.
Take a data-frame:
import pandas as pd
import numpy as np
dates = pd.date_range('20070101',periods=3200)
df = pd.DataFrame(data=np.random.randint(0,100,(3200,1)), columns =list('A'))
df['A'][5,6,7, 8, 9, 10, 11, 12, 13] = np.nan #add missing data points
df['date'] = dates
df = df[['date','A']]
Apply season function to the datetime index
def get_season(row):
if row['date'].month >= 3 and row['date'].month <= 5:
return '2'
elif row['date'].month >= 6 and row['date'].month <= 8:
return '3'
elif row['date'].month >= 9 and row['date'].month <= 11:
return '4'
else:
return '1'
Apply the function
df['Season'] = df.apply(get_season, axis=1)
Create a 'Year' column for indexing
df['Year'] = df['date'].dt.year
Multi-index by Year and Season
df = df.set_index(['Year', 'Season'], inplace=False)
Count datapoints in each season
count = df.groupby(level=[0, 1]).count()
Drop the seasons with less than 75 days in them
count = count.drop(count[count.A < 75].index)
Create a variable for seasons with more than 75 days
complete = count[count['A'] >= 75].index
Using isin function turns up false for everything, while I want it to select all the seasons who have more than 75 days of valid data in 'A'
df = df.isin(complete)
df
Every value comes up false, and I can't see why.
I hope this is concise enough, I need this to work on a multi-index using seasons so I included it!
EDIT
Another method based on multi-index reindexing not working (which also produces a blank dataframe) from here
df3 = df.reset_index().groupby('Year').apply(lambda x: x.set_index('Season').reindex(count,method='pad'))
EDIT 2
Also tried this
seasons = count[count['A'] >= 75].index
df = df[df['A'].isin(seasons)]
Again, blank output
I think you can use Index.isin:
complete = count[count['A'] >= 75].index
idx = df.index.isin(complete)
print idx
[ True True True ..., False False False]
print df[idx]
date A
Year Season
2007 1 2007-01-01 24.0
1 2007-01-02 92.0
1 2007-01-03 54.0
1 2007-01-04 91.0
1 2007-01-05 91.0
1 2007-01-06 NaN
1 2007-01-07 NaN
1 2007-01-08 NaN
1 2007-01-09 NaN
1 2007-01-10 NaN
1 2007-01-11 NaN
1 2007-01-12 NaN
1 2007-01-13 NaN
1 2007-01-14 NaN
1 2007-01-15 18.0
1 2007-01-16 82.0
1 2007-01-17 55.0
1 2007-01-18 64.0
1 2007-01-19 89.0
1 2007-01-20 37.0
1 2007-01-21 45.0
1 2007-01-22 4.0
1 2007-01-23 34.0
1 2007-01-24 35.0
1 2007-01-25 90.0
1 2007-01-26 17.0
1 2007-01-27 29.0
1 2007-01-28 58.0
1 2007-01-29 7.0
1 2007-01-30 57.0
... ... ...
2015 3 2015-08-02 42.0
3 2015-08-03 0.0
3 2015-08-04 31.0
3 2015-08-05 39.0
3 2015-08-06 25.0
3 2015-08-07 1.0
3 2015-08-08 7.0
3 2015-08-09 97.0
3 2015-08-10 38.0
3 2015-08-11 59.0
3 2015-08-12 28.0
3 2015-08-13 84.0
3 2015-08-14 43.0
3 2015-08-15 63.0
3 2015-08-16 68.0
3 2015-08-17 0.0
3 2015-08-18 19.0
3 2015-08-19 61.0
3 2015-08-20 11.0
3 2015-08-21 84.0
3 2015-08-22 75.0
3 2015-08-23 37.0
3 2015-08-24 40.0
3 2015-08-25 66.0
3 2015-08-26 50.0
3 2015-08-27 74.0
3 2015-08-28 37.0
3 2015-08-29 19.0
3 2015-08-30 25.0
3 2015-08-31 15.0
[3106 rows x 2 columns]

Categories

Resources