Pandas rolling cumulative sum of across two dataframes

Pandas rolling cumulative sum of across two dataframes - python

I'm looking to create a rolling grouped cumulative sum across two dataframes. I can get the result via iteration, but wanted to see if there was a more intelligent way.
I need the 5 row block of A to roll through the rows of B and accumulate. Think of it as rolling balance with a block of contributions and rolling returns.
So, here's the calculation for C
A B
1 100.00 1 0.01 101.00
2 110.00 2 0.02 215.22 102.00
3 120.00 3 0.03 345.28 218.36 103.00
4 130.00 4 0.04 494.29 351.89 221.52 104.00
5 140.00 5 0.05 666.00 505.99 358.60 224.70 105.00
6 0.06 684.75 517.91 365.38 227.90 106.00
7 0.07 703.97 530.06 372.25 231.12
8 0.08 723.66 542.43 379.21
9 0.09 743.85 555.04
10 0.10 764.54
C Row 5
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.01 101.00
101.00 110.00 0.02 215.22
215.22 120.00 0.03 345.28
345.28 130.00 0.04 494.29
494.29 140.00 0.05 666.00
C Row 6
Begining Balance Contribution Return Ending Balance
0.00 100.00 0.02 102.00
102.00 110.00 0.03 218.36
218.36 120.00 0.04 351.89
351.89 130.00 0.05 505.99
505.99 140.00 0.06 684.75
Here's what the source data looks like:
A B
1 100.00 1 0.01
2 110.00 2 0.02
3 120.00 3 0.03
4 130.00 4 0.04
5 140.00 5 0.05
6 0.06
7 0.07
8 0.08
9 0.09
10 0.10
Here is the desired result:
C
1 Nan
2 Nan
3 Nan
4 Nan
5 666.00
6 684.75
7 703.97
8 723.66
9 743.85
10 764.54

Related

Pandas function for subtract a cumulative column monthly

I have a weather dataset, it gave me many years of data, as below:
Date
rainfallInMon
2009-01-01
0.0
2009-01-02
0.03
2009-01-03
0.05
2009-01-04
0.05
2009-01-05
0.06
...
...
2009-01-29
0.2
2009-01-30
0.21
2009-01-31
0.21
2009-02-01
0.0
2009-02-02
0.0
...
...
I am trying to get the daily rainfall, starting from the end of the month subtracting the previous day. For eg:
Date
rainfallDaily
2009-01-01
0.0
2009-01-02
0.03
2009-01-03
0.02
...
...
2009-01-29
0.01
2009-01-30
0.0
...
...
Thanks for your efforts in advance.

Because there is many years of data need Series.dt.to_period for month periods for distinguish months with years separately:
df['rainfallDaily'] = (df.groupby(df['Date'].dt.to_period('m'))['rainfallInMon']
.diff()
.fillna(0))
Or use Grouper:
df['rainfallDaily'] = (df.groupby(pd.Grouper(freq='M',key='Date'))['rainfallInMon']
.diff()
.fillna(0))
print (df)
Date rainfallInMon rainfallDaily
0 2009-01-01 0.00 0.00
1 2009-01-02 0.03 0.03
2 2009-01-03 0.05 0.02
3 2009-01-04 0.05 0.00
4 2009-01-05 0.06 0.01
5 2009-01-29 0.20 0.14
6 2009-01-30 0.21 0.01
7 2009-01-31 0.21 0.00
8 2009-02-01 0.00 0.00
9 2009-02-02 0.00 0.00

Try:
# Convert to datetime if it's not already the case
df['Date'] = pd.to_datetime(df['Date'])
df['rainfallDaily'] = df.resample('M', on='Date')['rainfallInMon'].diff().fillna(0)
print(df)
# Output
Date rainfallInMon rainfallDaily
0 2009-01-01 0.00 0.00
1 2009-01-02 0.03 0.03
2 2009-01-03 0.05 0.02
3 2009-01-04 0.05 0.00
4 2009-01-05 0.06 0.01
5 2009-01-29 0.20 0.14
6 2009-01-30 0.21 0.01
7 2009-01-31 0.21 0.00
8 2009-02-01 0.00 0.00
9 2009-02-02 0.00 0.00

expand array and append row name

I have an input file foo.txt with the following lines
A 1 2 3
B 4 5 6
C 7 8 9
I have written the following lines
import numpy as np
import pandas as pd
file="foo.txt"
source_array=pd.read_csv(file, sep=" ", header=None)
name_array=source_array.iloc[:,0].to_numpy()
number_array=source_array.iloc[:,1:4].to_numpy()
r1=np.array([[1,0,0],[0,1,0],[0,0,1]])
r2=np.array([[0.5,-0.30902,-0.80902],[0.30902,-0.80902,0.5],[-0.80902,-0.5,-0.30902]])
r3=np.array([[0.5,0.30902,-0.80902],[-0.30902,-0.80902,-0.5],[-0.80902,0.5,-0.30902]])
mult_array=np.array([r1,r2,r3])
out_array=np.empty((0,3))
for i in range(number_array.shape[0]):
lad=number_array[i,0:3]
lad=lad.reshape(1,3)
print(lad)
for j in range(mult_array.shape[0]):
operated_array=np.dot(lad,mult_array[j])
out_array=np.append(out_array,operated_array,axis=0)
#print(operated_array)
np.savetxt('foo2.txt',out_array, fmt='%.2f')
After performing the do multiplication i get the following output
1.00 2.00 3.00
-1.31 -3.43 -0.74
-2.55 0.19 -2.74
4.00 5.00 6.00
-1.31 -8.28 -2.59
-4.40 0.19 -7.59
7.00 8.00 9.00
-1.31 -13.14 -4.44
-6.25 0.19 -12.44
But the expected output in foo2.txt is
A 1.00 2.00 3.00
A -1.31 -3.43 -0.74
A -2.55 0.19 -2.74
B 4.00 5.00 6.00
B -1.31 -8.28 -2.59
B -4.40 0.19 -7.59
C 7.00 8.00 9.00
C -1.31 -13.14 -4.44
C -6.25 0.19 -12.44
How can i duplicate the row name as many times as i perform the dot multiplication?
For clarity input print(df) output is
df
0 1 2 3
0 A 1 2 3
1 B 4 5 6
2 C 7 8 9

We do not need for loop, also with the help of explode
df['new']=np.dot(df,mult_array).tolist()
s=df.new.explode()
output=pd.DataFrame(s.tolist(),index=s.index).round(2)
Out[30]:
0 1 2
A 1.00 2.00 3.00
A -1.31 -3.43 -0.74
A -2.55 0.19 -2.74
B 4.00 5.00 6.00
B -1.31 -8.28 -2.59
B -4.40 0.19 -7.59
C 7.00 8.00 9.00
C -1.31 -13.14 -4.44
C -6.25 0.19 -12.44
Data input
df
0 1 2
A 1 2 3
B 4 5 6
C 7 8 9

Is there a way to plot corresponding points of two data frames?

I have two dataframes with the same columns and date indices:
df1:
Date T.TO AS.TO NTR.TO ... R.TO
2016-03-03 0.1 0.02 0.04 0.02
2016-03-04 0.09 0.01 0.02 0.02
2016-03-05 0.1 0.02 0.04 0.02
...
2019-03-03 0.09 0.01 0.02 0.02
df2:
Date T.TO AS.TO NTR.TO ... R.TO
2016-03-03 0.01 0.32 0.04 0.02
2016-03-04 0.81 0.21 0.02 0.02
2016-03-05 0.01 0.12 0.04 0.02
...
2019-03-03 0.89 0.11 0.12 0.72
I want to plot all the matching points of the two dataframes on a chart like the first point would correspond to 2016-03-03, T.TO (0.1, 0.01). Another point would correspond to 2016-03-03, AS.TO (0.02, 0.32) and so on giving me a large number of points. I will then use these to find a line of best fit.
I know how to find the best fit line but I am having difficulty plotting these points directly. I tried using nested for loops and dictionaries but I was wondering if there is a more straightforward approach to this?

To plot these points, you can stack:
plt.scatter(df1.set_index('Date').stack(), df2.set_index('Date').stack())
Output:

If you want to drop out all the data that is not common between the two dataframes then this should work.
In [71]: df = pd.read_clipboard()
In [72]: df
Out[72]:
Date T.TO AS.TO NTR.TO ... R.TO
0 2016-03-03 0.10 0.02 0.04 0.02 NaN
1 2016-03-04 0.09 0.01 0.02 0.02 NaN
2 2016-03-05 0.10 0.02 0.04 0.02 NaN
3 ... NaN NaN NaN NaN NaN
4 2019-03-03 0.09 0.01 0.02 0.02 NaN
In [73]: df2 = pd.read_clipboard()
In [74]: df2
Out[74]:
Date T.TO AS.TO NTR.TO ... R.TO
0 2016-03-03 0.01 0.32 0.04 0.02 NaN
1 2016-03-04 0.81 0.21 0.02 0.02 NaN
2 2016-03-05 0.01 0.12 0.04 0.02 NaN
3 ... NaN NaN NaN NaN NaN
4 2019-03-03 0.89 0.11 0.12 0.72 NaN
Then df3 can only have values that match the two datasets
In [75]: df3 = df[df==df2]
In [76]: df3
Out[76]:
Date T.TO AS.TO NTR.TO ... R.TO
0 2016-03-03 NaN NaN 0.04 0.02 NaN
1 2016-03-04 NaN NaN 0.02 0.02 NaN
2 2016-03-05 NaN NaN 0.04 0.02 NaN
3 ... NaN NaN NaN NaN NaN
4 2019-03-03 NaN NaN NaN NaN NaN
From there plotting is a simple matter.

How to create multiple spacing CSV from pandas?

I have a pandas dataframe and want to output a text file separated by different spacing for input to other model. How can I do that?
The sample OUTPUT text file is as follow (each columns in the text file correspond to columns in df):
SO HOUREMIS 92 5 1 1 MC12 386.91 389.8 11.45
SO HOUREMIS 92 5 1 1 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 ED2 322.00 397.4 13.00
SO HOUREMIS 92 5 1 1 HL2 25.55 464.3 7.46
SO HOUREMIS 92 5 1 1 WC1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 1 WC2 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC12 405.35 389.3 11.54
SO HOUREMIS 92 5 1 2 MC3 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 MC4 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED1 0.00 0.1 0.10
SO HOUREMIS 92 5 1 2 ED2 319.90 396.3 13.00

After referring to this post. I found the solution:
fmt = '%0s %+1s %+1s %+2s %+2s %+2s %+6s %+15s'
np.savetxt('test.txt', data.values[0:10], fmt=fmt)
I can format each columns and specify how many spacing and the alignment.

Categorizing Pandas column with the indiviual custom bins (tresholds)

I have a dataframe with some numeric values stored in column "value", accompanied by their respective categorical tresholds (warning levels in this case), stored in other columns (in my case "low", "middle", "high"):
value low middle high
0 179.69 17.42 88.87 239.85
1 2.58 17.81 93.37 236.58
2 1.21 0.05 0.01 0.91
3 1.66 0.20 0.32 4.57
4 3.54 0.04 0.04 0.71
5 5.97 0.16 0.17 2.55
6 5.39 0.86 1.62 9.01
7 1.20 0.03 0.01 0.31
8 3.19 0.08 0.01 0.45
9 0.02 0.03 0.01 0.10
10 3.98 0.18 0.05 0.83
11 134.51 78.63 136.86 478.27
12 254.53 83.73 146.33 486.65
13 15.36 86.07 13.74 185.16
14 85.10 86.12 13.74 185.16
15 15.12 1.37 6.09 30.12
I would like to know in which category each value falls (e.g. first value would be middle, second would be below_low, since it's smaller than any of its tresholds, third would be high, ... you get the idea). So here is the expected output:
value low middle high category
0 179.69 17.42 88.87 239.85 middle
1 2.58 17.81 93.37 236.58 below_low
2 1.21 0.05 0.01 0.91 high
3 1.66 0.20 0.32 4.57 middle
4 3.54 0.04 0.04 0.71 high
5 5.97 0.16 0.17 2.55 high
6 5.39 0.86 1.62 9.01 middle
7 1.20 0.03 0.01 0.31 high
8 3.19 0.08 0.01 0.45 high
9 0.02 0.03 0.01 0.10 middle
10 3.98 0.18 0.05 0.83 high
11 134.51 78.63 136.86 478.27 low
12 254.53 83.73 146.33 486.65 middle
13 15.36 86.07 13.74 185.16 middle
14 85.10 86.12 13.74 185.16 middle
15 15.12 1.37 6.09 30.12 middle
So far I use this ugly procedure of "manually" checking line by line, stopping at the first category (from higher to lower), being bigger that the current value:
df["category"]="below_low"
for i in df.index:
for cat in ["high","middle","low"]:
if df.loc[i,"value"]>df.loc[i,cat]:
df.loc[i,"category"]=cat
break
I am aware of the pd.cut() method, but I only know how to use it with a predefined generic tresholds list. Can somebody tell what am I missing?

You can use:
remove column value
compare with lt (less then)
change order of columns
cumulative sum columns - first True get 1
compare with 1 by eq
mask = df.drop('value',axis=1)
.lt(df['value'], axis=0)
.reindex(columns=['high','middle','low'])
.cumsum(axis=1)
.eq(1)
If all values in columns high, middle and low are False then some correctness is necessary. I create new column with inverting mask and all.
mask['below_low'] = (~mask).all(axis=1)
print (mask)
high middle low below_low
0 True False False False
1 False False False True
2 True False False False
3 False True False False
4 True False False False
5 True False False False
6 False True False False
7 True False False False
8 True False False False
9 False True True False
10 True False False False
11 False False True False
12 False True False False
13 False True True False
14 False True True False
15 False True False False
Last call DataFrame.idxmax:
df['category'] = mask.idxmax(axis=1)
print (df)
value low middle high category
0 179.69 17.42 88.87 239.85 high
1 2.58 17.81 93.37 236.58 below_low
2 1.21 0.05 0.01 0.91 high
3 1.66 0.20 0.32 4.57 middle
4 3.54 0.04 0.04 0.71 high
5 5.97 0.16 0.17 2.55 high
6 5.39 0.86 1.62 9.01 middle
7 1.20 0.03 0.01 0.31 high
8 3.19 0.08 0.01 0.45 high
9 0.02 0.03 0.01 0.10 middle
10 3.98 0.18 0.05 0.83 high
11 134.51 78.63 136.86 478.27 low
12 254.53 83.73 146.33 486.65 middle
13 15.36 86.07 13.74 185.16 middle
14 85.10 86.12 13.74 185.16 middle
15 15.12 1.37 6.09 30.12 middle
Solution with multiple numpy.where as pointed Paul H:
df['category'] = np.where(df['high'] < df['value'], 'high',
np.where(df['middle'] < df['value'], 'medium',
np.where(df['low'] < df['value'], 'low', 'below_low')))
print (df)
value low middle high category
0 179.69 17.42 88.87 239.85 high
1 2.58 17.81 93.37 236.58 below_low
2 1.21 0.05 0.01 0.91 high
3 1.66 0.20 0.32 4.57 medium
4 3.54 0.04 0.04 0.71 high
5 5.97 0.16 0.17 2.55 high
6 5.39 0.86 1.62 9.01 medium
7 1.20 0.03 0.01 0.31 high
8 3.19 0.08 0.01 0.45 high
9 0.02 0.03 0.01 0.10 medium
10 3.98 0.18 0.05 0.83 high
11 134.51 78.63 136.86 478.27 low
12 254.53 83.73 146.33 486.65 medium
13 15.36 86.07 13.74 185.16 medium
14 85.10 86.12 13.74 185.16 medium
15 15.12 1.37 6.09 30.12 medium

In every other universe, you should use jezrael classic vector ways. However, if you're curious about apply way of doing things, then, you could
In [702]: df.apply(lambda x: 'high' if x.value > x['high']
else 'middle' if x.value > x['middle']
else 'low' if x.value > x['low']
else 'below low', axis=1)
Out[702]:
0 middle
1 below low
2 high
3 middle
4 high
5 high
6 middle
7 high
8 high
9 middle
10 high
11 low
12 middle
13 middle
14 middle
15 middle
dtype: object

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas rolling cumulative sum of across two dataframes - python

Related

Pandas function for subtract a cumulative column monthly

expand array and append row name

Is there a way to plot corresponding points of two data frames?

How to create multiple spacing CSV from pandas?

Categorizing Pandas column with the indiviual custom bins (tresholds)

Categories

Resources