Sum dataframes of different lenght, with overlapping indexes

Sum dataframes of different lenght, with overlapping indexes - python

I have many dataframes of equal lenght and equal Datetime indexes
Date OPP
0 2008-01-04 0.0
1 2008-02-04 0.0
2 2008-03-04 0.0
3 2008-04-04 0.0
4 2008-05-04 0.0
5 2008-06-04 0.0
6 2008-07-04 393.75
7 2008-08-04 -168.75
8 2008-09-04 -656.25
9 2008-10-04 -1631.25
Date OPP
0 2008-01-04 750.0
1 2008-02-04 0.0
2 2008-03-04 150.0
3 2008-04-04 600.0
4 2008-05-04 0.0
5 2008-06-04 0.0
6 2008-07-04 0.0
7 2008-08-04 -250.0
8 2008-09-04 1000.0
9 2008-10-04 0.0
I need to create a unique dataframe that sums all the OPP columns from many dataframes. This can easily be done like this:
df3 = df1["OPP"] + df2["OPP"]
df3["Date"] = df1["Date"]
This works as long as all the dataframes are same length and same Date index.
How can I make it work even if these conditions aren't met? What if I had another dataframe like this:
Date OPP
0 2008-07-04 393.75
1 2008-08-04 -168.75
2 2008-09-04 -656.25
3 2008-10-04 -1631.25
4 2008-11-04 -675.00
5 2008-12-04 0.00
I could do this manually: search for the df with the smallest starting date, the one with the biggest starting date and fill every df with all the dates and zeroes, so that I'd have df's of equal lenght... and then proceed with a simple sum.
But, is there a way to do this automatically in Pandas?

Following this answers method, we can use functools.reduce for this.
Whats left is to only sum over axis=1:
from functools import reduce
dfs = [df1, df2, df3]
df_final = reduce(lambda left,right: pd.merge(left,right,on='Date', how='left'), dfs)
Which gives us:
Date OPP_x OPP_y OPP
0 2008-01-04 0.00 750.0 NaN
1 2008-02-04 0.00 0.0 NaN
2 2008-03-04 0.00 150.0 NaN
3 2008-04-04 0.00 600.0 NaN
4 2008-05-04 0.00 0.0 NaN
5 2008-06-04 0.00 0.0 NaN
6 2008-07-04 393.75 0.0 393.75
7 2008-08-04 -168.75 -250.0 -168.75
8 2008-09-04 -656.25 1000.0 -656.25
9 2008-10-04 -1631.25 0.0 -1631.25
Then we sum:
df_final.iloc[:, 1:].sum(axis=1)
0 750.0
1 0.0
2 150.0
3 600.0
4 0.0
5 0.0
6 787.5
7 -587.5
8 -312.5
9 -3262.5
dtype: float64
Or as new column:
df_final['sum'] = df_final.iloc[:, 1:].sum(axis=1)
Date OPP_x OPP_y OPP sum
0 2008-01-04 0.00 750.0 NaN 750.0
1 2008-02-04 0.00 0.0 NaN 0.0
2 2008-03-04 0.00 150.0 NaN 150.0
3 2008-04-04 0.00 600.0 NaN 600.0
4 2008-05-04 0.00 0.0 NaN 0.0
5 2008-06-04 0.00 0.0 NaN 0.0
6 2008-07-04 393.75 0.0 393.75 787.5
7 2008-08-04 -168.75 -250.0 -168.75 -587.5
8 2008-09-04 -656.25 1000.0 -656.25 -312.5
9 2008-10-04 -1631.25 0.0 -1631.25 -3262.5

Use list comprehension for create Series with DatetimeIndex, then join together by concat and sum:
dfs = [df1, df2]
compr = [x.set_index('Date')['OPP'] for x in dfs]
df1 = pd.concat(compr, axis=1).sum(axis=1).reset_index(name='OPP')
print (df1)
Date OPP
0 2008-01-04 750.00
1 2008-02-04 0.00
2 2008-03-04 150.00
3 2008-04-04 600.00
4 2008-05-04 0.00
5 2008-06-04 0.00
6 2008-07-04 393.75
7 2008-08-04 -418.75
8 2008-09-04 343.75
9 2008-10-04 -1631.25

You can simply concat them and sum on groupby date:
(pd.concat((df1,df2,df3))
.groupby('Date', as_index=False)
.sum()
)
Output for your three sample dataframes:
Date OPP
0 2008-01-04 750.0
1 2008-02-04 0.0
2 2008-03-04 150.0
3 2008-04-04 600.0
4 2008-05-04 0.0
5 2008-06-04 0.0
6 2008-07-04 787.5
7 2008-08-04 -587.5
8 2008-09-04 -312.5
9 2008-10-04 -3262.5
10 2008-11-04 -675.0
11 2008-12-04 0.0

Related

Select only rows that have a value in a data range - pandas dataframe

what i have:
date percentage
0 2022-04-08 20.0
1 2022-04-09 0.0
2 2022-04-10 0.0
3 2022-04-11 0.0
4 2022-04-12 10.0
5 2022-04-13 0.0
6 2022-04-14 0.0
date percentage
0 2022-04-08 0.0
1 2022-04-09 0.0
2 2022-04-10 0.0
3 2022-04-11 0.0
4 2022-04-12 0.0
5 2022-04-13 0.0
6 2022-04-14 0.0
date percentage
0 2022-04-08 100.0
1 2022-04-09 0.0
2 2022-04-10 0.0
3 2022-04-11 0.0
4 2022-04-12 0.0
5 2022-04-13 0.0
6 2022-04-14 0.0
date percentage
0 2022-04-08 0.0
1 2022-04-09 0.0
2 2022-04-10 0.0
3 2022-04-11 0.0
4 2022-04-12 18.0
5 2022-04-13 0.0
6 2022-04-14 0.0
date percentage
0 2022-04-08 70.0
1 2022-04-09 0.0
2 2022-04-10 0.0
3 2022-04-11 0.0
4 2022-04-12 77.0
5 2022-04-13 0.0
6 2022-04-14 0.0
what I expect:
date percentage
0 2022-04-08 20.0
1 2022-04-12 10.0
date percentage
0 2022-04-08 100.0
date percentage
0 2022-04-12 18.0
date percentage
0 2022-04-08 70.0
1 2022-04-12 77.0
I want to select only rows that have values in those days. remove rows that have a value of 0
I use the for loop to go through all the elements, after which I append them in a list.

Try this:
df[df['percentage'] > 0]

Seems need filter rows in list comprehension:
L = [df[df['percentage'].ne(0)] for df in dfs]

To get values that are nonzero, you can simply do df = df[df["percentage"] != 0]. If your date column is a datetime data type, you can filter by days with df = df[df["date"].dt.day.isin([8, 12])]. If not and you do not want to convert it, you will need to use string slicing and it will be a bit more cumbersome.
split_date = df["date"].str.split("-", expand=True)
df = df[split_date[2].using(["08", "12"])
Where the 2 in the last command is just the last column what is returned by the splitting function.

pandas Dataframe: Subtract a groupby mean of subset data from the full original data

I would like to subtract [a groupby mean of subset] from the [original] dataframe:
I have a pandas DataFrame data whose index is in datetime object (monthly, say 100 years = 100yr*12mn) and 10 columns of station IDs. (i.e., 1200 row * 10 col pd.Dataframe)
1)
I would like to first take a subset of above data, e.g. top 50 years (i.e., 50yr*12mn),
data_sub = data_org[data_org.index.year <= top_50_year]
and calculate monthly mean for each month for each stations (columns). e.g.,
mean_sub = data_sub.groupby(data_sub.index.month).mean()
or
mean_sub = data_sub.groupby(data_sub.index.month).transform('mean')
which seem to do the job.
2)
Now I want to subtract above from the [original] NOT from the [subset], e.g.,
data_org - mean_sub
which I do not know how to. So in summary, I would like to calculate monthly mean from a subset of the original data (e.g., only using 50 years), and subtract that monthly mean from the original data month by month.
It was easy to subtract if I were using the full [original] data to calculate the mean (i.e., .transform('mean') or .apply(lambda x: x - x.mean()) do the job), but what should I do if the mean is calculated from a [subset] data?
Could you share your insight for this problem? Thank you in advance!
#mozway
The input (and also the output) shape looks like the following:
Input shape with random values
Only the values of output are anomalies from the [subset]'s monthly mean. Thank you.

One idea is replace non matched values to NaN by DataFrame.where, so after GroupBy.transform get same indices like original DataFrame, so possible subtract:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data1 = data_org.where(data_org.index.to_series().dt.year <= top_50_year)
print (data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 NaN NaN NaN
2001-04-30 NaN NaN NaN
2001-07-31 NaN NaN NaN
2001-10-31 NaN NaN NaN
2002-01-31 NaN NaN NaN
2002-04-30 NaN NaN NaN
mean_data1 = data1.groupby(data1.index.month).transform('mean')
print (mean_data1)
0 1 2
2000-01-31 2.0 2.0 6.0
2000-04-30 1.0 3.0 9.0
2000-07-31 6.0 1.0 0.0
2000-10-31 1.0 9.0 0.0
2001-01-31 2.0 2.0 6.0
2001-04-30 1.0 3.0 9.0
2001-07-31 6.0 1.0 0.0
2001-10-31 1.0 9.0 0.0
2002-01-31 2.0 2.0 6.0
2002-04-30 1.0 3.0 9.0
df = data_org - mean_data1
print (df)
0 1 2
2000-01-31 0.0 0.0 0.0
2000-04-30 0.0 0.0 0.0
2000-07-31 0.0 0.0 0.0
2000-10-31 0.0 0.0 0.0
2001-01-31 -2.0 7.0 -3.0
2001-04-30 3.0 -3.0 -9.0
2001-07-31 -2.0 0.0 7.0
2001-10-31 2.0 -7.0 4.0
2002-01-31 5.0 0.0 -2.0
2002-04-30 7.0 -3.0 -2.0
Another idea with filtering:
np.random.seed(123)
data_org = pd.DataFrame(np.random.randint(10, size=(10,3)),
index=pd.date_range('2000-01-01',periods=10, freq='3M'))
print (data_org)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
2001-01-31 0 9 3
2001-04-30 4 0 0
2001-07-31 4 1 7
2001-10-31 3 2 4
2002-01-31 7 2 4
2002-04-30 8 0 7
top_50_year = 2000
data_sub = data_org[data_org.index.year <= top_50_year]
print (data_sub)
0 1 2
2000-01-31 2 2 6
2000-04-30 1 3 9
2000-07-31 6 1 0
2000-10-31 1 9 0
mean_sub = data_sub.groupby(data_sub.index.month).mean()
print (mean_sub)
0 1 2
1 2 2 6
4 1 3 9
7 6 1 0
10 1 9 0
Create new column m for months:
data_org['m'] = data_org.index.month
print (data_org)
0 1 2 m
2000-01-31 2 2 6 1
2000-04-30 1 3 9 4
2000-07-31 6 1 0 7
2000-10-31 1 9 0 10
2001-01-31 0 9 3 1
2001-04-30 4 0 0 4
2001-07-31 4 1 7 7
2001-10-31 3 2 4 10
2002-01-31 7 2 4 1
2002-04-30 8 0 7 4
And for this solumn are merged mean_sub by DataFrame.join
mean_data1 = data_org[['m']].join(mean_sub, on='m')
print (mean_data1)
m 0 1 2
2000-01-31 1 2 2 6
2000-04-30 4 1 3 9
2000-07-31 7 6 1 0
2000-10-31 10 1 9 0
2001-01-31 1 2 2 6
2001-04-30 4 1 3 9
2001-07-31 7 6 1 0
2001-10-31 10 1 9 0
2002-01-31 1 2 2 6
2002-04-30 4 1 3 9
df = data_org - mean_data1
print (df)
0 1 2 m
2000-01-31 0 0 0 0
2000-04-30 0 0 0 0
2000-07-31 0 0 0 0
2000-10-31 0 0 0 0
2001-01-31 -2 7 -3 0
2001-04-30 3 -3 -9 0
2001-07-31 -2 0 7 0
2001-10-31 2 -7 4 0
2002-01-31 5 0 -2 0
2002-04-30 7 -3 -2 0

Groupby id to calculate ratios

Objective
I have this df and take some ratios below. I want to calculate these ratios by each id and datadate and I believe the groupby function is the way to go, however I am not exactly sure. Any help would be super!
df
id datadate dltt ceq ... pstk icapt dlc sale
1 001004 1975-02-28 3.0 193.0 ... 1.012793 1 0.20 7.367237
2 001004 1975-05-31 4.0 197.0 ... 1.249831 1 0.21 8.982741
3 001004 1975-08-31 5.0 174.0 ... 1.142086 2 0.24 8.115609
4 001004 1975-11-30 8.0 974.0 ... 1.400673 3 0.26 9.944990
5 001005 1975-02-28 3.0 191.0 ... 1.012793 4 0.25 7.367237
6 001005 1975-05-31 3.0 971.0 ... 1.249831 5 0.26 8.982741
7 001005 1975-08-31 2.0 975.0 ... 1.142086 6 0.27 8.115609
8 001005 1975-11-30 1.0 197.0 ... 1.400673 3 0.27 9.944990
9 001006 1975-02-28 3.0 974.0 ... 1.012793 2 0.28 7.367237
10 001006 1975-05-31 4.0 74.0 ... 1.249831 1 0.21 8.982741
11 001006 1975-08-31 5.0 75.0 ... 1.142086 3 0.23 8.115609
12 001006 1975-11-30 5.0 197.0 ... 1.400673 4 0.24 9.944990
Example of ratios
df['capital_ratioa'] = df['dltt']/(df['dltt']+df['ceq']+df['pstk'])
df['equity_invcapa'] = df['ceq']/df['icapt']
df['debt_invcapa'] = df['dltt']/df['icapt']
df['sale_invcapa']=df['sale']/df['icapt']
df['totdebt_invcapa']=(df['dltt']+df['dlc'])/df['icapt']

Is this what you're looking for?
df = df.groupby(by=['id'], as_index=False).sum()
df['capital_ratioa'] = df['dltt']/(df['dltt']+df['ceq']+df['pstk'])
df['equity_invcapa'] = df['ceq']/df['icapt']
df['debt_invcapa'] = df['dltt']/df['icapt']
df['sale_invcapa']=df['sale']/df['icapt']
df['totdebt_invcapa']=(df['dltt']+df['dlc'])/df['icapt']
print(df)
Output:
id dltt ceq pstk icapt dlc sale capital_ratioa equity_invcapa debt_invcapa sale_invcapa totdebt_invcapa
0 1004 20.0 1538.0 4.805383 7 0.91 34.410577 0.012797 219.714286 2.857143 4.915797 2.987143
1 1005 9.0 2334.0 4.805383 18 1.05 34.410577 0.003833 129.666667 0.500000 1.911699 0.558333
2 1006 17.0 1320.0 4.805383 10 0.96 34.410577 0.012669 132.000000 1.700000 3.441058 1.796000

Why does pandas.interpolate() interpolate single values surrounded by NaNs?

I have a problem with pandas interpolate(). I only want to interpolate when there are not more than 2 succsessive "np.nans".
But the interpolate function tries to interpolate also single values when there are more than 2 np.nans!?
s = pd.Series(data = [np.nan,10,np.nan,np.nan,np.nan,5,np.nan,6,np.nan,np.nan,30])
a = s.interpolate(limit=2,limit_area='inside')
print(a)
the output I get is:
0 NaN
1 10.00
2 8.75
3 7.50
4 NaN
5 5.00
6 5.50
7 6.00
8 14.00
9 22.00
10 30.00
dtype: float64
I do not want the result in line 2 and 3.
What I want is:
0 NaN
1 10.00
2 NaN
3 NaN
4 NaN
5 5.00
6 5.50
7 6.00
8 14.00
9 22.00
10 30.00
dtype: float64
Can anybody please help?

Groupby.transform with Series.where
s_notna = s.notna()
m = (s.groupby(s_notna.cumsum()).transform('size').le(3) | s_notna)
s = s.interpolate(limit_are='inside').where(m)
print(s)
Output
0 NaN
1 10.0
2 NaN
3 NaN
4 NaN
5 5.0
6 5.5
7 6.0
8 14.0
9 22.0
10 30.0
dtype: float64

Fill missing dates

I have dataframe contains temperature readings from different areas and in different dates
I want to add the missing dates for each location with zero temperature
for example:
df=pd.DataFrame({"area_id":[1,1,1,2,2,2,3,3,3],
"reading_date":["13/1/2017","15/1/2017"
,"16/1/2017","22/3/2017","26/3/2017"
,"28/3/2017","15/5/2017"
,"16/5/2017","18/5/2017"],
"temp":[12,15,22,6,14,8,30,25,33]})
What is the most efficient way to fill dates gap per area (by zeros) as shown below
Many Thanks.

Use:
first convert to datetime column reading_date by to_datetime
set_index for DatetimeIndex and groupby with resample
for Series add asfreq
replace NaNs by fillna
last add reset_index for columns from MultiIndex
df['reading_date'] = pd.to_datetime(df['reading_date'])
df = (df.set_index('reading_date')
.groupby('area_id')
.resample('d')['temp']
.asfreq()
.fillna(0)
.reset_index())
print (df)
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0

Using reindex. Define a custom function to handle the reindexing operation, and call it inside groupby.apply.
def reindex(x):
# Thanks to #jezrael for the improvement.
return x.reindex(pd.date_range(x.index.min(), x.index.max()), fill_value=0)
Next, convert reading_date to datetime first, using pd.to_datetime,
df.reading_date = pd.to_datetime(df.reading_date)
Now, perform a groupby.
df = (
df.set_index('reading_date')
.groupby('area_id')
.temp
.apply(reindex)
.reset_index()
)
df.columns = ['area_id', 'reading_date', 'temp']
df
area_id reading_date temp
0 1 2017-01-13 12.0
1 1 2017-01-14 0.0
2 1 2017-01-15 15.0
3 1 2017-01-16 22.0
4 2 2017-03-22 6.0
5 2 2017-03-23 0.0
6 2 2017-03-24 0.0
7 2 2017-03-25 0.0
8 2 2017-03-26 14.0
9 2 2017-03-27 0.0
10 2 2017-03-28 8.0
11 3 2017-05-15 30.0
12 3 2017-05-16 25.0
13 3 2017-05-17 0.0
14 3 2017-05-18 33.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sum dataframes of different lenght, with overlapping indexes - python

Related

Select only rows that have a value in a data range - pandas dataframe

pandas Dataframe: Subtract a groupby mean of subset data from the full original data

Groupby id to calculate ratios

Why does pandas.interpolate() interpolate single values surrounded by NaNs?

Fill missing dates

Categories

Resources