unstack "multi-indexed" column| in pandas - python

in pandas I have a dataframe as follows (first line is the column, second is just a row now)
2012 2013 2012 2013
women women men men
0 14 43 24 45
1 34 54 35 65
and would like to get it like
women men
2012 0 14 24
2012 1 34 35
2013 0 43 45
2013 1 54 65
using df.stack, df.unstack did not get anywhere?
Any elegant solution?

In [5]: df
Out[5]:
2012 2013
women men women men
0 0 1 2 3
1 4 5 6 7
the idea is to first stack the first level of the column to the first level of index, and then swap two indices (pandas.DataFrame.swaplevel)
In [6]: df.stack(level=0).swaplevel(0,1,axis=0)
Out[6]:
men women
2012 0 1 0
2013 0 3 2
2012 1 5 4
2013 1 7 6

df.stack is most likely what you want. See below, you do need to specify that you want the first level.
In [79]: df = pd.DataFrame(0., index=[0,1], columns=pd.MultiIndex.from_product([[2012,2013], ['women','men']]))
In [83]: df.stack(level=0)
Out[83]:
men women
0 2012 0 0
2013 0 0
1 2012 0 0
2013 0 0

Related

Sum rows of grouped data frame based on a specific column

I have one data frame where I would like to create new column from the sum of different rows within one group df["NEW_Salary"].
If grouped df by the column Year, month & day , I want to for each group to sum the rows where Combination is True to the rows where Combination is False.
import pandas as pd
data = {"Year":[2002,2002,2002,2002,2002,2010,2010,2010,2010,2010],
"Name":['Jason','Tom','KimJason','KimTom','Kim','Jason','Tom','KimJason','KimTom','Kim'],
"Combination":[False,False,True,True,False,False,False,True,True,False],
"Salary":[10,20,25,25,30,20,30,35,35,40]
}
df=pd.dataframe(data)
Year Month Day Name Combination Salary
0 2002 1 15 Jason False 10
1 2002 1 15 Tom False 20
2 2002 1 15 KimJason True 25
3 2002 1 15 KimTom True 25
4 2002 1 15 Kim False 30
5 2010 3 20 Jason False 20
6 2010 3 20 Tom False 30
7 2010 3 20 KimJason True 35
8 2010 3 20 KimTom True 35
9 2010 3 20 Kim False 40
10 2002 4 5 Mary False 10
11 2002 4 5 MaryTom True 20
12 2002 4 5 Tom False 30
df["New_Salary"] would be created as following:
The row where Name is KimJason,Salary would be added to the Salary rows where Name is Kim & Jason
The row where Name is KimTom, Salary would be added again to the Salary rows where Name is Kim& Tom
The rows of KimTom & KimJason would be the same in the new column NEW_Salary as in Salary
The expected output:
Year Month Day Name Combination Salary NEW_Salary
0 2002 1 15 Jason False 10 35
1 2002 1 15 Tom False 20 45
2 2002 1 15 KimJason True 25 25
3 2002 1 15 KimTom True 25 25
4 2002 1 15 Kim False 30 80
5 2010 3 20 Jason False 20 55
6 2010 3 20 Tom False 30 65
7 2010 3 20 KimJason True 35 35
8 2010 3 20 KimTom True 35 35
9 2010 3 20 Kim False 40 110
10 2002 4 5 Mary False 10 30
11 2002 4 5 MaryTom True 20 20
12 2002 4 5 Tom False 30 50
Is there an easy way to achieve this output? no matter how many groups I have ?
Here`s how you can do it, as far I can tell, it should also work for any groups.
First extract all rows with combination names to a dictionary
lookup = dict(tuple(df.loc[df['Combination']==True].groupby('Name')[['Year', 'Salary']]))
for key,value in lookup.items():
print(f"{key=}:\n{value}")
which looks like this: (value to each key is a df)
key='KimJason':
Year Salary
2 2002 25
7 2010 35
key='KimTom':
Year Salary
3 2002 25
8 2010 35
Then filter for rows where Combination is False, apply row by row the value of Salary and add all values which are found for that name and year in the lookup dictionary. At the end update the df with the new Salary values.
s = (df.loc[df['Combination']==False]
.apply(lambda row:
row['Salary'] + sum(lookup[x].loc[lookup[x]['Year']==row['Year'], 'Salary'].squeeze()
for x in lookup
if row['Name'] in x)
,axis=1)
)
df['Salary'].update(s)
print(df)
Output df:
Year Name Combination Salary
0 2002 Jason False 35
1 2002 Tom False 45
2 2002 KimJason True 25
3 2002 KimTom True 25
4 2002 Kim False 80
5 2010 Jason False 55
6 2010 Tom False 65
7 2010 KimJason True 35
8 2010 KimTom True 35
9 2010 Kim False 110

How to do rolling sum with conditional window criteria on different index levels in Python

I want to do a rolling sum based on different levels of the index but am struggling to make it a reality. Instead of explaining the problem am giving below the demo input and desired output along with the kind of insights am looking for.
So I have multiple brands and each of their sales of various item categories in different year month day grouped by as below. What I want is a dynamic rolling sum at each day level, rolled over a window on Year as asked.
for eg, if someone asks
Demo question 1) Till a certain day(not including that day) what were their last 2 years' sales of that particular category for that particular brand.
I need to be able to answer this for every single day i.e every single row should have a number as shown in Table 2.0.
I want to be able to code in such a way that if the question changes from 2 years to 3 years I just need to change a number. I also need to do the same thing at the month's level.
demo question 2) Till a certain day(not including that day) what was their last 3 months' sale of that particular category for that particular year for that particular brand.
Below is demo input
The tables are grouped by brand,category,year,month,day and sum of sales from a master table which had all the info and sales at hour level each day
Table 1.0
Brand
Category
Year
Month
Day
Sales
ABC
Big Appliances
2021
9
3
0
Clothing
2021
9
2
0
Electronics
2020
10
18
2
Utensils
2020
10
18
0
2021
9
2
4
3
0
XYZ
Big Appliances
2012
4
29
7
2013
4
7
6
Clothing
2012
4
29
3
Electronics
2013
4
9
1
27
2
5
4
5
2015
4
27
7
5
2
2
Fans
2013
4
14
4
5
4
0
2015
4
18
1
5
17
11
2016
4
12
18
Furniture
2012
5
4
1
8
6
20
4
2013
4
5
1
7
8
9
2
2015
4
18
12
27
15
5
2
4
17
3
Musical-inst
2012
5
18
10
2013
4
5
6
2015
4
16
10
18
0
2016
4
12
1
16
13
Utencils
2012
5
8
2
2016
4
16
3
18
2
2017
4
12
13
Below is desired output for demo question 1 based on the demo table(last 2 years cumsum not including that day)
Table 2.0
Brand
Category
Year
Month
Day
Sales
Conditional Cumsum(till last 2 years)
ABC
Big Appliances
2021
9
3
0
0
Clothing
2021
9
2
0
0
Electronics
2020
10
18
2
0
Utensils
2020
10
18
0
0
2021
9
2
4
0
3
0
4
XYZ
Big Appliances
2012
4
29
7
0
2013
4
7
6
7
Clothing
2012
4
29
3
0
Electronics
2013
4
9
1
0
27
2
1
5
4
5
3
2015
4
27
7
8
5
2
2
15
Fans
2013
4
14
4
0
5
4
0
4
2015
4
18
1
4
5
17
11
5
2016
4
12
18
12
Furniture
2012
5
4
1
0
8
6
1
20
4
7
2013
4
5
1
11
7
8
12
9
2
20
2015
4
18
12
11
27
15
23
5
2
4
38
17
3
42
Musical-inst
2012
5
18
10
0
2013
4
5
6
10
2015
4
16
10
6
18
0
16
2016
4
12
1
10
16
13
11
Utencils
2012
5
8
2
0
2016
4
16
3
0
18
2
3
2017
4
12
13
5
End thoughts:
The idea is to basically do a rolling window over year column maintaining the 2 years span criteria and keep on summing the sales figures.
P.S I really need a fast solution due to the huge data size and therefore created a .apply function row-wise which I didn't find feasible. A better solution by using some kind of group rolling sum or supporting columns will be really helpful.
Here I'm giving a sample solution for the above problem.
I have concidered just onr product so that the solution would be simple
Code:
from datetime import date,timedelta
Input={"Utencils": [[2012,5,8,2],[2016,4,16,3],[2017,4,12,13]]}
Input1=Input["Utencils"]
Limit=timedelta(365*2)
cumsum=0
lis=[]
Tot=[]
for i in range(len(Input1)):
if(lis):
while(lis):
idx=lis[0]
Y,M,D=Input1[i][:3]
reqDate=date(Y,M,D)-Limit
Y,M,D=Input1[idx][:3]
if(date(Y,M,D)<=reqDate):
lis.pop(0)
cumsum-=Input1[idx][3]
else:
break
Tot.append(cumsum)
lis.append(i)
cumsum+=Input1[i][3]
print(Tot)
Here Tot would output the required cumsum column for the given data.
Output:
[0, 0, 3]
Here you can specify the Time span using Number of days in Limit variable.
Hope this solves the problem you are looking for.

How to unpack a list of tuple in various length in a panda dataframe?

ID LIST_OF_TUPLE (2col)
1 [('2012','12'), ('2012','33'), ('2014', '82')]
2 NA
3 [('2012','12')]
4 [('2012','12'), ('2012','33'), ('2014', '82'), ('2022', '67')]
Result:
ID TUP_1 TUP_2(3col)
1 2012 12
1 2012 33
1 2014 82
3 2012 12
4 2012 12
4 2012 33
4 2014 82
4 2022 67
Thanks in advance.
This is explode then create a dataframe and then join:
s = df['LIST_OF_TUPLE'].explode()
out = (df[['ID']].join(pd.DataFrame(s.tolist(),index=s.index)
.add_prefix("TUP_")).reset_index(drop=True)) #you can chain a dropna if reqd
print(out)
ID TUP_0 TUP_1
0 1 2012 12
1 1 2012 33
2 1 2014 82
3 2 NaN None
4 3 2012 12
5 4 2012 12
6 4 2012 33
7 4 2014 82
8 4 2022 67

How can I get this series to a pandas dataframe?

I have some data and after using a groupby function I now have a series that looks like this:
year
1997 15
1998 22
1999 24
2000 24
2001 28
2002 11
2003 15
2004 19
2005 10
2006 10
2007 21
2008 26
2009 23
2010 16
2011 33
2012 19
2013 26
2014 25
How can I create a pandas dataframe from here with year as one column and the other column named sightings ?
I am a pandas novice so don't really know what I am doing. I have tried the reindex and unstack functions but haven't been able to get what I want...
You can use reset_index and rename columns:
print (df.reset_index())
index year
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
print (df.reset_index().rename(columns=({'index':'year','year':'sightings'})))
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
Another solution is set column names by list of names:
df1 = df.reset_index()
df1.columns = ['year','sightings']
print (df1)
year sightings
0 1997 15
1 1998 22
2 1999 24
3 2000 24
4 2001 28
5 2002 11
6 2003 15
7 2004 19
8 2005 10
9 2006 10
10 2007 21
11 2008 26
12 2009 23
13 2010 16
14 2011 33
15 2012 19
16 2013 26
17 2014 25
EDIT:
Sometimes help add parameter as_index=False to groupby for returning DataFrame:
import pandas as pd
df = pd.DataFrame({'A':[1,1,3],
'B':[4,5,6]})
print (df)
A B
0 1 4
1 1 5
2 3 6
print (df.groupby('A')['B'].sum())
A
1 9
3 6
Name: B, dtype: int64
print (df.groupby('A', as_index=False)['B'].sum())
A B
0 1 9
1 3 6
s.rename('sightings').reset_index()
I've also used this method during the groupby stage to put the results straight into a dataframe:
df2 = df1.groupby(['Year']).count()
df3 = pd.DataFrame(df2).reset_index()
If your original dataframe - df1 - had "Year" and "Sightings" as it's two columns then df3 should have each year listed under "Year" and the count (or sum, average, whatever) listed under "Sightings".
If not, you can change the column names by doing the following:
df3.columns = ['Year','Sightings']
or
df3 = df3.rename(columns={'oldname_A': 'Year', 'oldname_B': 'Sightings'})

pandas: conditionally return a column's value

I am trying to make a new column called 'wage_rate' that fills in the appropriate wage rate for the employee based on the year of the observation.
In other words, my list looks something like this:
eecode year w2011 w2012 w2013
1 2012 7 8 9
1 2013 7 8 9
2 2011 20 25 25
2 2012 20 25 25
2 2013 20 25 25
And I want return, in a new column, 8 for the first row, 9 for the second, 20, 25, 25.
One way would be to use apply by constructing column name for each row based on year like 'w' + str(x.year).
In [41]: df.apply(lambda x: x['w' + str(x.year)], axis=1)
Out[41]:
0 8
1 9
2 20
3 25
4 25
dtype: int64
Details:
In [42]: df
Out[42]:
eecode year w2011 w2012 w2013
0 1 2012 7 8 9
1 1 2013 7 8 9
2 2 2011 20 25 25
3 2 2012 20 25 25
4 2 2013 20 25 25
In [43]: df['wage_rate'] = df.apply(lambda x: x['w' + str(x.year)], axis=1)
In [44]: df
Out[44]:
eecode year w2011 w2012 w2013 wage_rate
0 1 2012 7 8 9 8
1 1 2013 7 8 9 9
2 2 2011 20 25 25 20
3 2 2012 20 25 25 25
4 2 2013 20 25 25 25
values = [ row['w%s'% row['year']] for key, row in df.iterrows() ]
df['wage_rate'] = values # create the new columns
This solution is using an explicit loop, thus is likely slower than other pure-pandas solutions, but on the other hand it is simple and readable.
you can rename columns names to be the same as year columns using replace
In [70]:
df.columns = [re.sub('w(?=\d+4$)' , '' , column ) for column in df.columns ]
In [80]:
df.columns
Out[80]:
Index([u'eecode', u'year', u'2011', u'2012', u'2013', u'wage_rate'], dtype='object')
then get the value using the following
df['wage_rate'] = df.apply(lambda x : x[str(x.year)] , axis = 1)
Out[79]:
eecode year 2011 2012 2013 wage_rate
1 2012 7 8 9 8
1 2013 7 8 9 9
2 2011 20 25 25 20
2 2012 20 25 25 25
2 2013 20 25 25 25

Categories

Resources