How to add computed columns in a multi-level column dataframe - python

I have a multi-level column dataframe on the lines of one below:
How can I add columns 'Sales' = 'Qty' * 'Price' one each for each 'Year'?
The input dataframe in dictionary format:
{('Qty', 2001): [50, 50], ('Qty', 2002): [100, 10], ('Qty', 2003): [200, 20], ('Qty', 2004): [300, 30], ('Qty', 2005): [400, 40], ('Price', 2001): [20, 11], ('Price', 2002): [21, 12], ('Price', 2003): [22, 13], ('Price', 2004): [23, 14], ('Price', 2005): [24, 15]}
Currently, I am splitting the dataframe for each year separately and adding a computed column. If there is an easier method that would be great.
Here is the expected output

You can create the required column names with a list comprehension, and then simply assign the multiplication (df.mul).
new_cols = [('Sales', col) for col in df['Qty'].columns]
# [('Sales', 2001), ('Sales', 2002), ('Sales', 2003), ('Sales', 2004), ('Sales', 2005)]
df[new_cols] = df['Qty'].mul(df['Price'])
df
Qty Price Sales \
2001 2002 2003 2004 2005 2001 2002 2003 2004 2005 2001 2002 2003 2004
0 50 100 200 300 400 20 21 22 23 24 1000 2100 4400 6900
1 50 10 20 30 40 11 12 13 14 15 550 120 260 420
2005
0 9600
1 600

Let us stack to flatten multiindex columns then multiply and reshape back using unstack
df.stack().eval('Sales = Price * Qty').unstack()
Price Qty Sales
2001 2002 2003 2004 2005 2001 2002 2003 2004 2005 2001 2002 2003 2004 2005
0 20 21 22 23 24 50 100 200 300 400 1000 2100 4400 6900 9600
1 11 12 13 14 15 50 10 20 30 40 550 120 260 420 600

Related

sum rows from two different data frames based on the value of columns

I have two data frames
df1
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 100
1 11 2023 Lyon Chicago,Paris 200
2 11 2023 Berlin Paris 300
3 12 2022 Newyork Chicago 150
4 12 2022 Lyon Chicago,Paris 250
5 12 2022 Berlin Paris 400
df2
ID Year Primary_Location Sales
0 11 2023 Chicago 150
1 11 2023 Paris 200
2 12 2022 Chicago 300
3 12 2022 Paris 350
I would like for each group having the same ID & Year:
to add the column Sales from df2 to Sales in df1 where Primary_Location in df2 appear (contained) in Secondary_Location in df1.
For example: For ID=11 & Year=2023, Sales for Lyon would be added to Sales for Chicago & Sales for Paris of df_2.
New Sales of Paris for that row would be 200+150+200=550.
The expected output would be :
df_primary_output
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 400
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
Here are the dataframes to start with :
import pandas as pd
df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
'Year': [2023, 2023, 2023, 2022, 2022, 2022],
'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
'Sales': [100, 200, 300, 150, 250, 400]
})
df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
'Year': [2023, 2023, 2022, 2022],
'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
'Sales': [150, 200, 300, 350]
})
EDIT: pandas.errors.InvalidIndexError: Reindexing only valid with uniquely valued Index objects
Would be great if the solution could work for these inputs as well:
df1
Day ID Year Primary_Location Secondary_Location Sales
0 1 11 2023 NewYork Chicago 100
1 1 11 2023 Berlin Chicago 300
2 1 11 2022 Newyork Chicago 150
3 1 11 2022 Berlin Chicago 400
df2
Day ID Year Primary_Location Sales
0 1 11 2023 Chicago 150
1 1 11 2022 Chicago 300
The expected output would be :
df_primary_output
Day ID Year Primary_Location Secondary_Location Sales
0 1 11 2023 NewYork Chicago 250
1 1 11 2023 Berlin Chicago 450
2 1 11 2022 Newyork Chicago 450
3 1 11 2022 Berlin Chicago 700
This should work:
s = 'Secondary_Location'
(df1.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)['Sales_2'].sum()
.add(df1['Sales']))
or
df3 = (df1.assign(Secondary_Location = df1['Secondary_Location'].str.split(',')) #split Secondary_Location column into list and explode it so each row has one value
.explode('Secondary_Location'))
(df3[['ID','Year','Secondary_Location']].apply(tuple,axis=1) #create a series where ID, Year and Secondary_Location are a combined into a tuple so we can map our series created below to bring in the values needed.
.map(df2.set_index(['ID','Year','Primary_Location'])['Sales']) #create a series with lookup values in index, and make a series by selecting Sales column
.groupby(level=0).sum() #when exploding the column above, the index was repeated, so groupby(level=0).sum() will combine back to original form.
.add(df1['Sales'])) #add in original sales column
Original Answer:
s = 'Secondary_Location'
(df.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)
.agg({**dict.fromkeys(df,'first'),**{s:','.join,'Sales_2':'sum'}})
.assign(Sales = lambda x: x['Sales'] + x['Sales_2'])
.drop('Sales_2',axis=1))
Output:
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 450
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
Not so easy your question...
Proposed script
import pandas as pd
df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
'Year': [2023, 2023, 2023, 2022, 2022, 2022],
'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
'Sales': [100, 200, 300, 150, 250, 400]
})
df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
'Year': [2023, 2023, 2022, 2022],
'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
'Sales': [150, 200, 300, 350]
})
tot = []
def func(g, iterdf, len_df1, i = 0):
global tot
kv = {g['Primary_Location'].iloc[i]:g['Sales'].iloc[i] for i in range(len(g))}
while i < len_df1:
row = next(iterdf)[1]
# Select specific df1 rows to modify by ID and Year criteria
if g['ID'][g.index[0]]==row['ID'] and g['Year'][g.index[0]]==row['Year']:
tot.append(row['Sales'] + sum([kv[town] for town in row['Secondary_Location'].split(',') if town in kv]))
i+=1
df2.groupby(['ID', 'Year'], sort=False).apply(lambda g: func(g, df1.iterrows(), len(df1)))
df1['Sales'] = tot
print(df1)
Result :
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Lyon Chicago,Paris 550
2 11 2023 Berlin Paris 500
3 12 2022 Newyork Chicago 450
4 12 2022 Lyon Chicago,Paris 900
5 12 2022 Berlin Paris 750
Are you sure of the result in line 3, my script found 450 and not 400 ?
Explanation :
1 - group(...).apply(...) sends two groups from df2 one by one to func() :
ID Year Primary_Location Sales
0 11 2023 Chicago 150
1 11 2023 Paris 200
ID Year Primary_Location Sales
2 12 2022 Chicago 300
3 12 2022 Paris 350
2 - kv returns dictionnaries from df2 like this :
(each iteration corresponds to a group ie ID + Year)
call 1 - {'Chicago': 100, 'Paris': 200}
call 2 - {'Chicago': 300, 'Paris': 350}
3 - Function while followed by used of next(iterator) allows to explore rows in g (group) one by one :
while i < len_df1:
row = next(iterdf)[1]
...
i+=1
4 - The if condition in while loop allows to filter df1 rows in order that ID and Year corresponds to df2's ones.
And for each correspondance to append to the global list tot df1 and df2 sales values
5 - tot is a global list for memorizing values and is passed to df1 for the Sales column creation :
df1['Sales'] = tot
Result with the new dataframes sample :
ID Year Primary_Location Secondary_Location Sales
0 11 2023 NewYork Chicago 250
1 11 2023 Berlin Chicago 450
2 11 2022 Newyork Chicago 450
3 11 2022 Berlin Chicago 700

Python melt dataframe

I am trying to convert a dataframe.
Currently I have something similar to this
Material Revenue 2007 Revenue 2008 Revenue 2009 Profit 2007 Profit 2008 Profit 2009
Mat A 50 55 60 10 15 20
Mat B 45 50 55 5 10 35
Mat C 75 80 85 35 30 45
And this is the conversion I am trying to achieve:
Material Revenue Profit Period
Mat A 50 10 2007
Mat A 55 5 2008
Mat A 75 35 2009
Mat B 55 15 2007
Mat B 50 10 2008
Mat B 80 30 2009
Mat C 60 20 2007
Mat C 55 35 2008
Mat C 85 45 2009
From what I have gathered I most likely have to use melt but I am not able to get the code to work.
Edit:
This code does seem to work but is too complicated to be used for real dataframe.
df1 = df.melt(id_vars=['Material'],
value_vars=['Revenue 2007', 'Revenue 2008', 'Revenue 2009'],
var_name='Period', value_name='Revenue')
df1["Period"]=df1['Period'].str[-4:]
df2 = df.melt(id_vars=['Material'],
value_vars=['Profit 2007', 'Profit 2008', 'Profit 2009'],
var_name='Period', value_name='Profit')
df1["Profit"]=df2["Profit"]
Deforms all columns into a melt() target. Divide the created columns by whitespace. Outputs them as a group.
df1 = df.melt(id_vars=['Material'],
value_vars=['Revenue 2007', 'Revenue 2008', 'Revenue 2009','Profit 2007','Profit 2008','Profit 2009'],
var_name='Period', value_name='Revenue')
df2 = pd.concat([df1, df1['Period'].str.split(' ', expand=True)], axis=1).drop('Period', axis=1)
df2.rename(columns={0:'flg', 1:'Period'},inplace=True)
df2.groupby(['Material','Period','flg'])['Revenue'].sum().unstack().reset_index()
flg Material Period Profit Revenue
0 Mat A 2007 10 50
1 Mat A 2008 15 55
2 Mat A 2009 20 60
3 Mat B 2007 5 45
4 Mat B 2008 10 50
5 Mat B 2009 35 55
6 Mat C 2007 35 75
7 Mat C 2008 30 80
8 Mat C 2009 45 85
Is this what you're looking for?
left = df[[col for col in df.columns if col.startswith('Profit')] + ['Material']]\
.melt(id_vars='Material', var_name='Period', value_name='Profit')
left['Period'] = left['Period'].str.split(' ').str[1]
right = df[[col for col in df.columns if col.startswith('Revenue')] + ['Material']]\
.melt(id_vars='Material', var_name='Period', value_name='Revenue')
right['Period'] = right['Period'].str.split(' ').str[1]
print(left.merge(right).sort_values(by=['Material', 'Period']).reset_index(drop=True))
Output
Material Period Profit Revenue
0 Mat A 2007 10 50
1 Mat A 2008 15 55
2 Mat A 2009 20 60
3 Mat B 2007 5 45
4 Mat B 2008 10 50
5 Mat B 2009 35 55
6 Mat C 2007 35 75
7 Mat C 2008 30 80
8 Mat C 2009 45 85
df = pd.melt(df, id_vars=['Material'])
df['Period'] = df.variable.str.split(" ").str[1]
df['type'] = df.variable.str.split(" ").str[0]
df = df.drop('variable', axis=1)
df = (
df
.groupby(['Material','Period','type'])
.sum()
.unstack('type')
.reset_index()
)
df.columns = ["Material", "Period", "Profit", "Revenue"]
df['Material'] = 'Mat ' + df['Material'].astype(str)
df = df[["Material","Revenue","Profit","Period"]]
df
Material Revenue Profit Period
0 Mat A 50 10 2007
1 Mat A 55 15 2008
2 Mat A 60 20 2009
3 Mat B 45 5 2007
4 Mat B 50 10 2008
5 Mat B 55 35 2009
6 Mat C 75 35 2007
7 Mat C 80 30 2008
8 Mat C 85 45 2009

join missing rows from another table columns in Python

I have two tables (as DataFrames) in Python. One is as follows:
Country Year totmigrants
Afghanistan 2000
Afghanistan 2001
Afghanistan 2002
Afghanistan 2003
Afghanistan 2004
Afghanistan 2005
Afghanistan 2006
Afghanistan 2007
Afghanistan 2008
Algeria 2000
Algeria 2001
Algeria 2002
...
Zimbabwe 2008
the other one is for each single year (9 seperate DataFrames overall 2000-2008):
Year=2000
---------------------------------------
Country totmigrants Gender Total
Afghanistan 73 M 70
Afghanistan F 3
Albania 11 M 5
Albania F 6
Algeria 52 M 44
...
Zimbabwe F 1
I want to join them together, the first table being outer join.
I had this in my mind but this only works for merging by columns:
new=pd.merge(table1,table2,how='left',on=['Country', 'Year'])
What I wanted to see is, from each data frame for each year total number of migrants, F and M occur in new columns in the first table as:
Country Year totmigrants F M
Afghanistan 2000 73 3 70
Afghanistan 2001 table3
Afghanistan 2002 table4
Afghanistan 2003 ...
Afghanistan 2004
Afghanistan 2005
Afghanistan 2006
Afghanistan 2007
Afghanistan 2008
Algeria 2000 52 8 44
Algeria 2001 table3 ...
Algeria 2002 table4 ...
...
Zimbabwe 2008 ... ...
Is there a specific method for this merging, or what function do I need to use?
Here's how to combine the data from the yearly dataframes. Let's assume that the yearly dataframes somehow have been stored in a dictionary:
df = {2000: ..., 2001: ..., ..., 2008: ...}
yearly = []
for N in df.keys():
tmp = df[N].pivot('Country','Gender','Total').fillna(0).astype(int)
tmp['Year'] = N # Store the year
tmp['totmigrants'] = tmp['M'] + tmp['F']
yearly.append(tmp)
df = pd.concat(yearly)
print(df)
#Gender F M Year totmigrants
#Country
#Afghanistan 3 70 2000 73
#Albania 6 5 2000 11
#Algeria 0 44 2000 44
#Zimbabwe 1 0 2000 1
Now you can merge df with the first dataframe using ['Country','Year'] as the keys.
I am not sure you need the first table. I did the following, I hope it helps.
data2000 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 73, 'M', 70],
['2','Afghanistan', None, 'F', 3],
['3','Albania', 11, 'M', 5],
['4','Albania', None ,'F', 6]])
data2001 = np.array([['','Country','totmigrants','Gender', 'Total'],
['1','Afghanistan', 75, 'M', 60],
['2','Afghanistan', None, 'F', 15],
['3','Albania', 15, 'M', 11],
['4','Albania', None ,'F', 4]])
# and so on
datas = {'2000':data2000, '2001':data2001}
reg_dfs = []
for year,data in datas.items():
df = pd.DataFrame(data=data[1:,1:],
index=data[1:,0],
columns=data[0,1:])
new=pd.merge(df,df,how='inner',on=['Country'])
reg_df = new.query('Gender_x == "M" & Gender_y == "F"' )[['Country', 'Total_x', 'Total_y', 'totmigrants_x']]
reg_df.columns = ['Country', 'M', 'F', 'Total']
reg_df['Year'] = year
reg_dfs.append(reg_df)
print(pd.concat(reg_dfs).sort(['Country']))
# Country M F Total Year
#1 Afghanistan 70 3 73 2000
#1 Afghanistan 60 15 75 2001
#5 Albania 5 6 11 2000
#5 Albania 11 4 15 2001

how to shift single value of a pandas dataframe column

Using pandas first_valid_index() to get index of first non-null value of a column, how can I shifta single value of column rather than the whole column. i.e.
data = {'year': [2010, 2011, 2012, 2013, 2014, 2015, 2016,2017, 2018, 2019],
'columnA': [10, 21, 20, 10, 39, 30, 31,45, 23, 56],
'columnB': [None, None, None, 10, 39, 30, 31,45, 23, 56],
'total': [100, 200, 300, 400, 500, 600, 700,800, 900, 1000]}
df = pd.DataFrame(data)
df = df.set_index('year')
print df
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 10 400
2014 39 39 500
2015 30 30 600
2016 31 31 700
2017 45 45 800
2018 23 23 900
2019 56 56 1000
for col in df.columns:
if col not in ['total']:
idx = df[col].first_valid_index()
df.loc[idx, col] = df.loc[idx, col] + df.loc[idx, 'total'].shift(1)
print df
AttributeError: 'numpy.float64' object has no attribute 'shift'
desired result:
print df
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310 400
2014 39 39 500
2015 30 30 600
2016 31 31 700
2017 45 45 800
2018 23 23 900
2019 56 56 1000
is that what you want?
In [63]: idx = df.columnB.first_valid_index()
In [64]: df.loc[idx, 'columnB'] += df.total.shift().loc[idx]
In [65]: df
Out[65]:
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310.0 400
2014 39 39.0 500
2015 30 30.0 600
2016 31 31.0 700
2017 45 45.0 800
2018 23 23.0 900
2019 56 56.0 1000
UPDATE: starting from Pandas 0.20.1 the .ix indexer is deprecated, in favor of the more strict .iloc and .loc indexers.
You can filter all column names, where is least one NaN value and then use union with column total:
for col in df.columns:
if col not in pd.Index(['total']).union(df.columns[~df.isnull().any()]):
idx = df[col].first_valid_index()
df.loc[idx, col] += df.total.shift().loc[idx]
print (df)
columnA columnB total
year
2010 10 NaN 100
2011 21 NaN 200
2012 20 NaN 300
2013 10 310.0 400
2014 39 39.0 500
2015 30 30.0 600
2016 31 31.0 700
2017 45 45.0 800
2018 23 23.0 900
2019 56 56.0 1000

Enter Missing Year Amounts with Zeros After GroupBy in Pandas

I am grouping the following rows.
df = df.groupby(['id','year']).sum().sort(ascending=False)
print df
amount
id year
1 2009 120
2008 240
2007 240
2006 240
2005 240
2 2014 100
2013 50
2012 50
2011 100
2010 50
2006 100
... ...
Is there a way to add years that do not have any values with the amount equal to zero until a specific year, in this case 2005, as I am showing below?
Expected Output:
amount
id year
2015 0
2014 0
2013 0
2012 0
2011 0
2010 0
2009 120
2008 240
2007 240
2006 240
2005 240
2 2015 0
2014 100
2013 50
2012 50
2011 100
2010 50
2009 0
2008 0
2007 0
2006 100
2005 0
... ...
Starting with your first DataFrame, this will add all years that occur with some id to all ids.
df = df.unstack().fillna(0).stack()
e.g.
In [16]: df
Out[16]:
amt
id year
1 2001 1
2002 2
2003 3
2 2002 4
2003 5
2004 6
In [17]: df = df.unstack().fillna(0).stack()
In [18]: df
Out[18]:
amt
id year
1 2001 1
2002 2
2003 3
2004 0
2 2001 0
2002 4
2003 5
2004 6

Categories

Resources