Pandas : Cumulative sum with moving window (following and preceding rows) - python

I have the following dataset :
date sales
201201 5
201202 5
201203 5
201204 5
201205 5
201206 5
201207 5
201208 5
201209 5
201210 5
201211 5
201212 5
201301 100
201302 100
And I want to compute the cumulative sum of sales, from the beginning to the actual date + 12 months
So here :
date sales expected
201201 5 60
201202 5 160
201203 5 260
201204 5 260
201205 5 260
201206 5 260
201207 5 260
201208 5 260
201209 5 260
201210 5 260
201211 5 260
201212 5 260
201301 100 260
201302 100 260
According to this question How to compute cumulative sum of previous N rows in pandas? I tried :
df['sales'].rolling(window=12).sum()
However I am looking for something more like this :
df['sales'].rolling(window=['unlimited preceding, 11 following']).sum()

Use cumsum directly thanks shift by 11, than use ffill to fill NaNs with previous value:
df['expected'] = df['sales'].cumsum().shift(-11).ffill()
And now:
print(df)
Is:
date sales expected
0 201201 5 60.0
1 201202 5 160.0
2 201203 5 260.0
3 201204 5 260.0
4 201205 5 260.0
5 201206 5 260.0
6 201207 5 260.0
7 201208 5 260.0
8 201209 5 260.0
9 201210 5 260.0
10 201211 5 260.0
11 201212 5 260.0
12 201301 100 260.0
13 201302 100 260.0

Related

Moving average for the last row of a data frame

I have a data frame with two prices and moving average(window=3) for each price:
price1
price2
MA3-price1
MA3-price2
18
10
12
9
20
15
16.66
11.33
12
7
14.66
10.33
4
9
12
10.33
6
4
NaN
NaN
I don't have the MA for the last row. How can I calculate the MA for the last row and get:
price1
price2
MA3-price1
MA3-price2
18
10
12
9
20
15
16.66
11.33
12
7
14.66
10.33
4
9
12
10.33
6
4
7.33
6.66
To compute "MA3-price1" and "MA3-price2" columns from "price1" and "price2", try:
df[["MA3-price1", "MA3-price2"]] = df.rolling(3).mean()
print(df)
Prints:
price1 price2 MA3-price1 MA3-price2
0 18 10 NaN NaN
1 12 9 NaN NaN
2 20 15 16.666667 11.333333
3 12 7 14.666667 10.333333
4 4 9 12.000000 10.333333
5 6 4 7.333333 6.666667

Create new column with largest number indexes based on values of another column

I have a DataFrame with two columns: 'goods name' and their 'overall sales'. I need to make another column which will contain the indexes with largest sales numerated from 1, 2, 3... Where 1 is the largest number, 2 second largest number and so on.
Hope you can help me.
My dataframe:
lst = [['Keyboard1', 1860], ['Keyboard2', 1650], ['Keyboard3', 900], ['Keyboard4', 1230], ['Keyboard5', 1150], ['Keyboard6', 1345],
['Mouse1', 3100], ['Mouse2', 2900], ['Mouse3', 3050], ['Mouse4', 2750], ['Mouse5', 4100], ['Mouse6', 3910]]
df = pd.DataFrame(lst, columns = ['Goods', 'Sales'])
Goods Sales
0 Keyboard1 1860
1 Keyboard2 1650
2 Keyboard3 900
3 Keyboard4 1230
4 Keyboard5 1150
5 Keyboard6 1345
6 Mouse1 3100
7 Mouse2 2900
8 Mouse3 3050
9 Mouse4 2750
10 Mouse5 4100
11 Mouse6 3910
I'm trying to use this code:
import pandas as pd
import numpy as np
df = df.sort_values('Sales', ascending = False)
df['Largest'] = np.arange(len(df))+1
But I get indexes of Largest values for all goods, I need to get Indexes of Largest values for each type of good separately. My result:
Goods Sales Largest
10 Mouse5 4100 1
11 Mouse6 3910 2
6 Mouse1 3100 3
8 Mouse3 3050 4
7 Mouse2 2900 5
9 Mouse4 2750 6
1 Keyboard2 1860 7
0 Keyboard1 1650 8
5 Keyboard6 1345 9
3 Keyboard4 1230 10
4 Keyboard5 1150 11
2 Keyboard3 900 12
Here is the output I need:
Goods Sales Largest
10 Mouse5 4100 1
11 Mouse6 3910 2
6 Mouse1 3100 3
8 Mouse3 3050 4
7 Mouse2 2900 5
9 Mouse4 2750 6
1 Keyboard2 1860 1
0 Keyboard1 1650 2
5 Keyboard6 1345 3
3 Keyboard4 1230 4
4 Keyboard5 1150 5
2 Keyboard3 900 6
Just do:
# remove any number of groups at the end
df['goods_group'] = df['Goods'].str.replace('\d+$', '')
# sort by the new column and sales
df = df.sort_values(['goods_group', 'Sales'], ascending=False)
# create largest column
df['largest'] = df.groupby('goods_group').cumcount() + 1
# drop the new column
res = df.drop('goods_group', 1)
print(res)
Output
Goods Sales largest
10 Mouse5 4100 1
11 Mouse6 3910 2
6 Mouse1 3100 3
8 Mouse3 3050 4
7 Mouse2 2900 5
9 Mouse4 2750 6
0 Keyboard1 1860 1
1 Keyboard2 1650 2
5 Keyboard6 1345 3
3 Keyboard4 1230 4
4 Keyboard5 1150 5
2 Keyboard3 900 6
Try adding these lines to the end of the code:
df['new'] = df['Goods'].str[:-1]
df['Largest'] = df.groupby('new').cumcount() + 1
df = df.drop('new', axis=1)
print(df)
Output:
Goods Sales new Largest
10 Mouse5 4100 Mouse 1
11 Mouse6 3910 Mouse 2
6 Mouse1 3100 Mouse 3
8 Mouse3 3050 Mouse 4
7 Mouse2 2900 Mouse 5
9 Mouse4 2750 Mouse 6
0 Keyboard1 1860 Keyboard 1
1 Keyboard2 1650 Keyboard 2
5 Keyboard6 1345 Keyboard 3
3 Keyboard4 1230 Keyboard 4
4 Keyboard5 1150 Keyboard 5
2 Keyboard3 900 Keyboard 6
You could groupby, Goods without the digits:
>>> df = df.sort_values('Sales', ascending=False)
>>> df
Goods Sales
10 Mouse5 4100
11 Mouse6 3910
6 Mouse1 3100
8 Mouse3 3050
7 Mouse2 2900
9 Mouse4 2750
0 Keyboard1 1860
1 Keyboard2 1650
5 Keyboard6 1345
3 Keyboard4 1230
4 Keyboard5 1150
2 Keyboard3 900
>>> df['Largest'] = df.groupby(df['Goods'].replace('\d+', '', regex=True)).cumcount() + 1
>>> df
Goods Sales Largest
10 Mouse5 4100 1
11 Mouse6 3910 2
6 Mouse1 3100 3
8 Mouse3 3050 4
7 Mouse2 2900 5
9 Mouse4 2750 6
0 Keyboard1 1860 1
1 Keyboard2 1650 2
5 Keyboard6 1345 3
3 Keyboard4 1230 4
4 Keyboard5 1150 5
2 Keyboard3 900 6

How to filling up the missing value in pandas dataframe

I have a dataframe that contains missing values.
index month value
0 201501 100
1 201507 172
2 201602 181
3 201605 98
I want to fill the missing value of the above data frame with the list below.
list = [201501, 201502, 201503 ... 201612]
The result I want to get...
index month value
0 201501 100
1 201502 100
2 201503 100
3 201504 100
4 201505 100
5 201506 100
6 201507 172
7 201508 172
...
...
23 201611 98
34 201612 98
Setup
my_list = list(range(201501,201509))
df=df.drop('index',axis=1) #remove the column index after use pd.read_clipboard
print(df)
month value
0 201501 100
1 201507 172
2 201602 181
3 201605 98
pd.DataFrame.reindex
df = (df.set_index('month')
.reindex( index = np.sort(np.unique(df['month'].tolist() + my_list)) )
.ffill()
.reset_index() )
print(df)
month value
0 201501 100.0
1 201502 100.0
2 201503 100.0
3 201504 100.0
4 201505 100.0
5 201506 100.0
6 201507 172.0
7 201508 172.0
8 201602 181.0
9 201605 98.0
10 201612 98.0
Using pandas.DataFrame.merge:
l = list(range(201501,201509))
new_df = df.merge(pd.Series(l,name='month'),how='outer').sort_values('month').ffill()
new_df['index'] = range(new_df.shape[0])
Output:
index month value
0 0 201501 100.0
4 1 201502 100.0
5 2 201503 100.0
6 3 201504 100.0
7 4 201505 100.0
8 5 201506 100.0
1 6 201507 172.0
9 7 201508 172.0
2 8 201602 181.0
3 9 201605 98.0

Problem with pandas.DataFrame.shift function

I have the following dataframe in python:
months = [1,2,3,4,5,6,7,8,9,10,11,12]
data1 = [100,200,300,400,500,600,700,800,900,1000,1100,1200]
df = pd.DataFrame({
'month' : months,
'd1' : data1,
'd2' : 0,
});
and I want to calculate the column d2, in the following way:
month d1 d2
0 1 100 101.0
1 2 200 303.0
2 3 300 606.0
3 4 400 1010.0
4 5 500 1515.0
5 6 600 2121.0
6 7 700 2828.0
7 8 800 3636.0
8 9 900 4545.0
9 10 1000 5555.0
10 11 1100 6666.0
11 12 1200 7878.0
I am doing it in the following way:
df['d2'] = (df['d2'].shift(1) + df['d1']) + df['month']
but the result is not what was expected:
month d1 d2
0 1 100 NaN
1 2 200 202.0
2 3 300 303.0
3 4 400 404.0
4 5 500 505.0
5 6 600 606.0
6 7 700 707.0
7 8 800 808.0
8 9 900 909.0
9 10 1000 1010.0
10 11 1100 1111.0
11 12 1200 1212.0
I do not know if I am clear in my request, I thank who can help me.
IIUC, you're looking for cumsum:
df['d2'] = (df.d1+df.month).cumsum()
>>> df
month d1 d2
0 1 100 101
1 2 200 303
2 3 300 606
3 4 400 1010
4 5 500 1515
5 6 600 2121
6 7 700 2828
7 8 800 3636
8 9 900 4545
9 10 1000 5555
10 11 1100 6666
11 12 1200 7878
What you need is cumulative sum :)
df['d2'] = df.d1.cumsum()
print(df)
month d1 d2
0 1 100 100
1 2 200 300
2 3 300 600
3 4 400 1000
4 5 500 1500
5 6 600 2100
6 7 700 2800
7 8 800 3600
8 9 900 4500
9 10 1000 5500
10 11 1100 6600
11 12 1200 7800

Python: How to split hourly values into 15 minute buckets?

happy new year to all!
I guess this question might be an easy one, but i can't figure it out.
How can i turn hourly data into 15 minute buckets quickly in python (see table below). Basically the left column should be converted into the right one.Just duplicate the hourly value for times and dump it into a new column.
Thanks for the support!
Cheers!
Hourly 15mins
1 28.90 1 28.90
2 28.88 1 28.90
3 28.68 1 28.90
4 28.67 1 28.90
5 28.52 2 28.88
6 28.79 2 28.88
7 31.33 2 28.88
8 32.60 2 28.88
9 42.00 3 28.68
10 44.00 3 28.68
11 44.00 3 28.68
12 44.00 3 28.68
13 39.94 4 28.67
14 39.90 4 28.67
15 38.09 4 28.67
16 39.94 4 28.67
17 44.94 5 28.52
18 66.01 5 28.52
19 49.45 5 28.52
20 48.37 5 28.52
21 38.02 6 28.79
22 34.55 6 28.79
23 33.33 6 28.79
24 32.05 6 28.79
7 31.33
7 31.33
7 31.33
7 31.33
You could also do this through constructing a new DataFrame and using numpy methods.
import numpy as np
pd.DataFrame(np.column_stack((np.arange(df.shape[0]).repeat(4, axis=0),
np.array(df).repeat(4, axis=0))),
columns=['hours', '15_minutes'])
which returns
hours 15_minutes
0 0 28.90
1 0 28.90
2 0 28.90
3 0 28.90
4 1 28.88
5 1 28.88
...
91 22 33.33
92 23 32.05
93 23 32.05
94 23 32.05
95 23 32.05
column_stack appends arrays by columns (index=0). np.arange(df.shape[0]).repeat(4, axis=0) gets the hour IDs by repeating 0 through 23 four times, and the values for each 15 minutes is constructed in a similar manner. pd.DataFrame produces the DataFrames and column names are added as well.
Create datetime-like index for your DataFrame, then you can use resample.
series.resample('15T')

Categories

Resources