I have a dataframe (totaldf) such that:
... Hom ... March Plans March Ships April Plans April Ships ...
0 CAD ... 12 5 4 13
1 USA ... 7 6 2 11
2 CAD ... 4 9 6 14
3 CAD ... 13 3 9 7
... ... ... ... ... ... ...
for all months of the year. I would like it to be:
... Hom ... Month Plans Ships ...
0 CAD ... March 12 5
1 USA ... March 7 6
2 CAD ... March 4 9
3 CAD ... March 13 3
4 CAD ... April 4 13
5 USA ... April 2 11
6 CAD ... April 6 14
7 CAD ... April 9 7
... ... ... ... ... ...
Is there an easy way to do this without splitting string entries?
I have played around with totaldf.unstack() but since there are multiple columns I'm unsure as to how to properly reindex the dataframe.
If you convert the columns to a MultiIndex you can use stack:
In [11]: df1 = df.set_index("Hom")
In [12]: df1.columns = pd.MultiIndex.from_tuples(df1.columns.map(lambda x: tuple(x.split())))
In [13]: df1
Out[13]:
March April
Plans Ships Plans Ships
Hom
CAD 12 5 4 13
USA 7 6 2 11
CAD 4 9 6 14
CAD 13 3 9 7
In [14]: df1.stack(level=0)
Out[14]:
Plans Ships
Hom
CAD April 4 13
March 12 5
USA April 2 11
March 7 6
CAD April 6 14
March 4 9
April 9 7
March 13 3
In [21]: res = df1.stack(level=0)
In [22]: res.index.names = ["Hom", "Month"]
In [23]: res.reset_index()
Out[23]:
Hom Month Plans Ships
0 CAD April 4 13
1 CAD March 12 5
2 USA April 2 11
3 USA March 7 6
4 CAD April 6 14
5 CAD March 4 9
6 CAD April 9 7
7 CAD March 13 3
You can use pd.wide_to_long, with a little extra work in order to have the right stubnames, given that as mentioned in the docs:
The stub name(s). The wide format variables are assumed to start with the stub names.
So it will be necessary to slightly modify the column names so that the stubnames are at the beginning of each column name:
m = df.columns.str.contains('Plans|Ships')
cols = df.columns[m].str.split(' ')
df.columns.values[m] = [w+month for month, w in cols]
print(df)
Hom PlansMarch ShipsMarch PlansApril ShipsApril
0 CAD 12 5 4 13
1 USA 7 6 2 11
2 CAD 4 9 6 14
3 CAD 13 3 9 7
Now you can use pd.wide_to_long using ['Ships', 'Plans'] as stubnames in order to obtain the output you want:
((pd.wide_to_long(df.reset_index(), stubnames=['Ships', 'Plans'], i = 'index',
j = 'Month', suffix='\w+')).reset_index(drop=True, level=0)
.reset_index())
x Month Hom Ships Plans
0 March CAD 5 12
1 March USA 6 7
2 March CAD 9 4
3 March CAD 3 13
4 April CAD 13 4
5 April USA 11 2
6 April CAD 14 6
7 April CAD 7 9
Related
I want to do a rolling sum based on different levels of the index but am struggling to make it a reality. Instead of explaining the problem am giving below the demo input and desired output along with the kind of insights am looking for.
So I have multiple brands and each of their sales of various item categories in different year month day grouped by as below. What I want is a dynamic rolling sum at each day level, rolled over a window on Year as asked.
for eg, if someone asks
Demo question 1) Till a certain day(not including that day) what were their last 2 years' sales of that particular category for that particular brand.
I need to be able to answer this for every single day i.e every single row should have a number as shown in Table 2.0.
I want to be able to code in such a way that if the question changes from 2 years to 3 years I just need to change a number. I also need to do the same thing at the month's level.
demo question 2) Till a certain day(not including that day) what was their last 3 months' sale of that particular category for that particular year for that particular brand.
Below is demo input
The tables are grouped by brand,category,year,month,day and sum of sales from a master table which had all the info and sales at hour level each day
Table 1.0
Brand
Category
Year
Month
Day
Sales
ABC
Big Appliances
2021
9
3
0
Clothing
2021
9
2
0
Electronics
2020
10
18
2
Utensils
2020
10
18
0
2021
9
2
4
3
0
XYZ
Big Appliances
2012
4
29
7
2013
4
7
6
Clothing
2012
4
29
3
Electronics
2013
4
9
1
27
2
5
4
5
2015
4
27
7
5
2
2
Fans
2013
4
14
4
5
4
0
2015
4
18
1
5
17
11
2016
4
12
18
Furniture
2012
5
4
1
8
6
20
4
2013
4
5
1
7
8
9
2
2015
4
18
12
27
15
5
2
4
17
3
Musical-inst
2012
5
18
10
2013
4
5
6
2015
4
16
10
18
0
2016
4
12
1
16
13
Utencils
2012
5
8
2
2016
4
16
3
18
2
2017
4
12
13
Below is desired output for demo question 1 based on the demo table(last 2 years cumsum not including that day)
Table 2.0
Brand
Category
Year
Month
Day
Sales
Conditional Cumsum(till last 2 years)
ABC
Big Appliances
2021
9
3
0
0
Clothing
2021
9
2
0
0
Electronics
2020
10
18
2
0
Utensils
2020
10
18
0
0
2021
9
2
4
0
3
0
4
XYZ
Big Appliances
2012
4
29
7
0
2013
4
7
6
7
Clothing
2012
4
29
3
0
Electronics
2013
4
9
1
0
27
2
1
5
4
5
3
2015
4
27
7
8
5
2
2
15
Fans
2013
4
14
4
0
5
4
0
4
2015
4
18
1
4
5
17
11
5
2016
4
12
18
12
Furniture
2012
5
4
1
0
8
6
1
20
4
7
2013
4
5
1
11
7
8
12
9
2
20
2015
4
18
12
11
27
15
23
5
2
4
38
17
3
42
Musical-inst
2012
5
18
10
0
2013
4
5
6
10
2015
4
16
10
6
18
0
16
2016
4
12
1
10
16
13
11
Utencils
2012
5
8
2
0
2016
4
16
3
0
18
2
3
2017
4
12
13
5
End thoughts:
The idea is to basically do a rolling window over year column maintaining the 2 years span criteria and keep on summing the sales figures.
P.S I really need a fast solution due to the huge data size and therefore created a .apply function row-wise which I didn't find feasible. A better solution by using some kind of group rolling sum or supporting columns will be really helpful.
Here I'm giving a sample solution for the above problem.
I have concidered just onr product so that the solution would be simple
Code:
from datetime import date,timedelta
Input={"Utencils": [[2012,5,8,2],[2016,4,16,3],[2017,4,12,13]]}
Input1=Input["Utencils"]
Limit=timedelta(365*2)
cumsum=0
lis=[]
Tot=[]
for i in range(len(Input1)):
if(lis):
while(lis):
idx=lis[0]
Y,M,D=Input1[i][:3]
reqDate=date(Y,M,D)-Limit
Y,M,D=Input1[idx][:3]
if(date(Y,M,D)<=reqDate):
lis.pop(0)
cumsum-=Input1[idx][3]
else:
break
Tot.append(cumsum)
lis.append(i)
cumsum+=Input1[i][3]
print(Tot)
Here Tot would output the required cumsum column for the given data.
Output:
[0, 0, 3]
Here you can specify the Time span using Number of days in Limit variable.
Hope this solves the problem you are looking for.
I have 2 integer variables in pandas dataframe. These are months and years. I want to combine them into one variable like 2021-1. Each index is matching one-to-one (No problem).New variable must be the time series. How can I do that.
For example my dataframe seems like this:
import pandas as pd
a = [2015,2015,2015,2015,2015,2016,2016,2016,2016]
b = [1,2,3,4,5,1,2,3,4]
c = pd.DataFrame(a , columns=["Year"])
d = pd.DataFrame(b , columns = ["Month"])
e = pd.concat([c,d] , axis = 1)
e.head()
There are multiple ways to do this, two examples:
import datetime as dt
r = pd.date_range("1-jan-2018", freq="M", periods=24)
df = pd.DataFrame({"year":r.year, "month":r.month})
df.assign(ymstr=df.astype({"year":"string","month":"string"}).apply("-".join, axis=1),
ymdt=df.apply(lambda r: dt.datetime(r[0],r[1], 1).strftime("%Y-%-m"), axis=1))
year
month
ymstr
ymdt
0
2018
1
2018-1
2018-1
1
2018
2
2018-2
2018-2
2
2018
3
2018-3
2018-3
3
2018
4
2018-4
2018-4
4
2018
5
2018-5
2018-5
5
2018
6
2018-6
2018-6
6
2018
7
2018-7
2018-7
7
2018
8
2018-8
2018-8
8
2018
9
2018-9
2018-9
9
2018
10
2018-10
2018-10
10
2018
11
2018-11
2018-11
11
2018
12
2018-12
2018-12
12
2019
1
2019-1
2019-1
13
2019
2
2019-2
2019-2
14
2019
3
2019-3
2019-3
15
2019
4
2019-4
2019-4
16
2019
5
2019-5
2019-5
17
2019
6
2019-6
2019-6
18
2019
7
2019-7
2019-7
19
2019
8
2019-8
2019-8
20
2019
9
2019-9
2019-9
21
2019
10
2019-10
2019-10
22
2019
11
2019-11
2019-11
23
2019
12
2019-12
2019-12
I found it.
from datetime import
e['NewTime'] = e.apply(lambda row: datetime.strptime(f"{int(row.Year)}-
{int(row.Month)}", '%Y-%m'), axis=1)
I am trying to create a new variable which performs the SALES_AMOUNT difference between years-month on the following dataframe. I think my code should be think with this groupby but i dont know how to add the condition [df2 df.Control - df.Control.shift(1) == 12] after the groupby so as to perform a correct difference between years
df['LY'] = df.groupby(['month']).SALES_AMOUNT.shift(1)
Dataframe:
SALES_AMOUNT Store Control year month
0 16793.14 A 3 2013 3
1 42901.61 A 5 2013 5
2 63059.72 A 6 2013 6
3 168471.43 A 10 2013 10
4 58570.72 A 11 2013 11
5 67526.71 A 12 2013 12
6 50649.07 A 14 2014 2
7 48819.97 A 18 2014 6
8 97100.77 A 19 2014 7
9 67778.40 A 21 2014 9
10 90327.52 A 22 2014 10
11 75703.12 A 23 2014 11
12 26098.50 A 24 2014 12
13 81429.36 A 25 2015 1
14 19539.85 A 26 2015 2
15 71727.66 A 27 2015 3
16 20117.79 A 28 2015 4
17 44252.19 A 29 2015 6
18 68578.82 A 30 2015 7
19 91483.39 A 31 2015 8
20 39220.87 A 32 2015 10
21 12224.11 A 33 2015 11
result should look like this:
SALES_AMOUNT Store Control year month year_diff
0 16793.14 A 3 2013 3 Nan
1 42901.61 A 5 2013 5 Nan
2 63059.72 A 6 2013 6 Nan
3 168471.43 A 10 2013 10 Nan
4 58570.72 A 11 2013 11 Nan
5 67526.71 A 12 2013 12 Nan
6 50649.07 A 14 2014 2 Nan
7 48819.97 A 18 2014 6 -14239.75
8 97100.77 A 19 2014 7 Nan
9 67778.40 A 21 2014 9 Nan
10 90327.52 A 22 2014 10 -78143.91
11 75703.12 A 23 2014 11 17132.4
12 26098.50 A 24 2014 12 -41428.21
13 81429.36 A 25 2015 1 Nan
14 19539.85 A 26 2015 2 -31109.22
15 71727.66 A 27 2015 3 Nan
16 20117.79 A 28 2015 4 Nan
17 44252.19 A 29 2015 6 -4567.78
18 68578.82 A 30 2015 7 -28521.95
19 91483.39 A 31 2015 8 Nan
20 39220.87 A 32 2015 10 -51106.65
21 12224.11 A 33 2015 11 -63479.01
I think what you're looking for is the below:
df = df.sort_values(by=['month', 'year'])
df['SALES_AMOUNT_shifted'] = df.groupby(['month'])['SALES_AMOUNT'].shift(1).tolist()
df['LY'] = df['SALES_AMOUNT'] - df['SALES_AMOUNT_shifted']
Once you sort by month and year, the month groups will be organized in a consistent way and then the shift makes sense.
-- UPDATE --
After applying the solution above, you could set to None all instances where the year difference is greater than 1.
df['year_diff'] = df['year'] - df.groupby(['month'])['year'].shift()
df['year_diff'] = df['year_diff'].fillna(0)
df.loc[df['year_diff'] != 1, 'LY'] = None
Using this I'm getting the desired output that you added.
Does this work? I would also greatly appreciate a pandas-centric solution, as I spent some time on this and could not come up with one.
df = pd.read_clipboard().set_index('Control')
df['yoy_diff'] = np.nan
for i in df.index:
for j in df.index:
if j - i == 12:
df['yoy_diff'].loc[j] = df.loc[j, 'SALES_AMOUNT'] - df.loc[i, 'SALES_AMOUNT']
df
Output:
SALES_AMOUNT Store year month yoy_diff
Control
3 16793.14 A 2013 3 NaN
5 42901.61 A 2013 5 NaN
6 63059.72 A 2013 6 NaN
10 168471.43 A 2013 10 NaN
11 58570.72 A 2013 11 NaN
12 67526.71 A 2013 12 NaN
14 50649.07 A 2014 2 NaN
18 48819.97 A 2014 6 -14239.75
19 97100.77 A 2014 7 NaN
21 67778.40 A 2014 9 NaN
22 90327.52 A 2014 10 -78143.91
23 75703.12 A 2014 11 17132.40
24 26098.50 A 2014 12 -41428.21
25 81429.36 A 2015 1 NaN
26 19539.85 A 2015 2 -31109.22
27 71727.66 A 2015 3 NaN
28 20117.79 A 2015 4 NaN
29 44252.19 A 2015 6 NaN
30 68578.82 A 2015 7 19758.85
31 91483.39 A 2015 8 -5617.38
32 39220.87 A 2015 10 NaN
33 12224.11 A 2015 11 -55554.29
I would like to know how can I add a growth rate year to year in the following data in Pandas.
Date Total Managed Expenditure
0 2001 503.2
1 2002 529.9
2 2003 559.8
3 2004 593.2
4 2005 629.5
5 2006 652.1
6 2007 664.3
7 2008 688.2
8 2009 732.0
9 2010 759.2
10 2011 769.2
11 2012 759.8
12 2013 760.6
13 2014 753.3
14 2015 757.6
15 2016 753.9
Use Series.pct_change():
df['Total Managed Expenditure'].pct_change()
Out:
0 NaN
1 0.053060
2 0.056426
3 0.059664
4 0.061194
5 0.035902
6 0.018709
7 0.035978
8 0.063644
9 0.037158
10 0.013172
11 -0.012220
12 0.001053
13 -0.009598
14 0.005708
15 -0.004884
Name: Total Managed Expenditure, dtype: float64
To assign it back:
df['Growth Rate'] = df['Total Managed Expenditure'].pct_change()
I have a dataframe (2000 rows, 5 columns):
year month day GroupBy_Day
0 2013 11 6 3
1 2013 11 7 10
2 2013 11 8 4
3 2013 11 9 4
4 2013 11 10 4
...
24 2013 12 1 5
25 2013 12 2 4
26 2013 12 3 5
27 2013 12 4 2
28 2013 12 5 7
29 2013 12 6 1
I already grouped my elements and got the count for each days (column GroupBy_Day). I need to get the mean count by day (e.g, for all days 6, we have a mean of (3+1)/2 = 2 occurence), and substract this value to GroupBy_Day in a new column.