Apply function across all columns using the column name - Python, Pandas - python

Basically:
Is there a way to apply a function that uses the column name of a dataframe in Pandas?
Like this:
df['label'] = df.apply(lambda x: '_'.join(labels_dict[column_name][x]), axis=1)
Where column name is the column that the apply is 'processing'.
Details:
I'd like to create a label for each row of a dataframe, based on a dictionary.
Let's take the dataframe df:
df = pd.DataFrame({ 'Application': ['Compressors', 'Fans', 'Fans', 'Material Handling'],
'HP': ['0.25', '0.25', '3.0', '15.0'],
'Sector': ['Commercial', 'Industrial', 'Commercial', 'Residential']},
index=[0, 1, 2, 3])
After I apply the label:
In [139]: df['label'] = df.apply(lambda x: '_'.join(x), axis=1)
In [140]: df
Out[140]:
Application HP Sector label
0 Compressors 0.25 Commercial Compressors_0.25_Commercial
1 Fans 0.25 Industrial Fans_0.25_Industrial
2 Fans 3.0 Commercial Fans_3.0_Commercial
3 Material Handling 15.0 Residential Material Handling_15.0_Residential
But the label is too long, especially when I consider the full dataframe, which contains a lot more columns. What I want is to use a dictionary to shorten the fields that come from the columns (I pasted the code for the dictionary at the end of the question).
I can do that for one field:
In [145]: df['application_label'] = df['Application'].apply(
lambda x: labels_dict['Application'][x])
In [146]: df
Out[146]:
Application HP Sector application_label
0 Compressors 0.25 Commercial cmp
1 Fans 0.25 Industrial fan
2 Fans 3.0 Commercial fan
3 Material Handling 15.0 Residential mat
But I want to do it for all the fields, like I did in snippet #2. So I'd like to do something like:
df['label'] = df.apply(lambda x: '_'.join(labels_dict[column_name][x]), axis=1)
Where column name is the column of df to which the function is being applied. Is there a way to access that information?
Thank you for your help!
I defined the dictionary as:
In [141]: labels_dict
Out[141]:
{u'Application': {u'Compressors': u'cmp',
u'Fans': u'fan',
u'Material Handling': u'mat',
u'Other/General': u'oth',
u'Pumps': u'pum'},
u'ECG': {u'Polyphase': u'pol',
u'Single-Phase (High LRT)': u'sph',
u'Single-Phase (Low LRT)': u'spl',
u'Single-Phase (Med LRT)': u'spm'},
u'Efficiency Level': {u'EL0': u'el0',
u'EL1': u'el1',
u'EL2': u'el2',
u'EL3': u'el3',
u'EL4': u'el4'},
u'HP': {0.25: 1.0,
0.33: 2.0,
0.5: 3.0,
0.75: 4.0,
1.0: 5.0,
1.5: 6.0,
2.0: 7.0,
3.0: 8.0,
10.0: 9.0,
15.0: 10.0},
u'Sector': {u'Commercial': u'com',
u'Industrial': u'ind',
u'Residential': u'res'}}

I worked out one way to do it, but it seems clunky. I'm hoping there's something more elegant out there.
df['label'] = pd.DataFrame([df[column_name].apply(lambda x: labels_dict[column_name][x])
for column_name in df.columns]).apply('_'.join)

I would say this is a bit more elegant
df.apply(lambda x: '_'.join([str(labels_dict[col][v]) for col, v in zip(df.columns, x)]), axis=1)

Related

How to group two columns and multiply other two into a new column in Pandas?

I'm importing the following .xlsx file into a dataframe.
dfMenu = pd.read_excel("/Users/FoodTrucks.xlsx")
Price
Quantity
FoodTruck
FoodTruck_ID
3.00
10
Burgers
1
1.20
50
Tacos
2
0.60
30
Tacos
2
1.12
40
Drinks
4
2.00
20
Burgers
1
My goal is to show the total revenue for each food truck with its ID and name in a new column, called "Revenue".
I am currently trying to use the code below, but I'm struggling with the multiplication of the columns "Price" and "Quantity" into a new one and grouping "FoodTruck" and "FoodTruck_ID" in an elegant way.
df = df.groupby((['FoodTruck', 'FoodTruck_ID'])(df['Revenue'] = df['Price'] * q9['Quantity']))
I am getting a syntax error "SyntaxError: cannot assign to subscript here. Maybe you meant '==' instead of '='?"
What would be the most elegant way to solve it?
It will be easier to first calculate Price*Quantity before doing the groupby:
import pandas as pd
df = pd.DataFrame({
'Price': [3.0, 1.2, 0.6, 1.12, 2.0],
'Quantity': [10, 50, 30, 40, 20],
'FoodTruck': ['Burgers', 'Tacos', 'Tacos', 'Drinks', 'Burgers'],
'FoodTruck_ID': [1, 2, 2, 4, 1]
})
df['Revenue'] = df['Price']*df['Quantity']
df.groupby(['FoodTruck','FoodTruck_ID'])['Revenue'].sum()
Output
FoodTruck FoodTruck_ID
Burgers 1 70.0
Drinks 4 44.8
Tacos 2 78.0
Name: Revenue, dtype: float64

Sum values in one dataframe based on date range in a second dataframe

I have two dataframes (simplified examples below). One contains a series of dates and values (df1), the second contains a date range (df2). I would like to identify/select/mask the date range from df2 in df1, sum the associated df1 values and add them to a new column in df2.
I'm a novice and all the techniques I have tried have been unsuccessful--a combination of wrong method, combining incompatible methods, syntax errors and so on. I have searched the Q&As here, but none have quite addressed this issue.
import pandas as pd
#********** df1: dates and values ***********
rng = pd.date_range('2012-02-24', periods=12, freq='D')
df1 = pd.DataFrame({ 'STATCON': ['C00028', 'C00489', 'C00038', 'C00589', 'C10028', 'C00499', 'C00238', 'C00729',
'C10044', 'C00299', 'C00288', 'C00771'],
'Date': rng,
'Val': [0.96, 0.57, 0.39, 0.17, 0.93, 0.86, 0.54, 0.58, 0.43, 0.19, 0.40, 0.32]
})
#********** df2: date range ***********
df2 = pd.DataFrame({
'BCON': ['B002', 'B004', 'B005'],
'Start': ['2012-02-25', '2012-02-28', '2012-03-01'],
'End': ['2012-02-29', '2012-03-04', '2012-03-06']
})
df2[['Start','End']] = df2[['Start','End']].apply(pd.to_datetime)
#********** Desired Output: df2 -- date range with summed values ***********
df3 = pd.DataFrame({
'BCON': ['B002', 'B004', 'B005'],
'Start': ['2012-02-25', '2012-02-28', '2012-03-01'],
'End': ['2012-02-29', '2012-03-04', '2012-03-06'],
'Sum_Val': [2.92, 3.53, 2.46]
})
You can solve this with the Dataframe.apply function as follow:
def to_value(row):
return df1[(row['Start'] <= df1['Date']) & (df1['Date'] <= row['End'])]['Val'].sum()
df3 = df2.copy()
df3['Sum_Val'] = df3.apply(to_value, axis=1)
The to_value function is called on every row of the df3 dataframe.
See here for a live implementation of the solution: https://1000words-hq.com/n/TcYN1Fz6Izp
One option is with conditional_join from pyjanitor - it tries to avoid searching every row (which can be memory consuming, depending on the data size):
# pip install pyjanitor
import pandas as pd
import numpy as np
df2 = df2.astype({'Start':np.datetime64, 'End':np.datetime64})
(df1
.conditional_join(
df2,
('Date', 'Start', '>='),
('Date', 'End', '<='))
.loc[:, ['BCON', 'Start', 'End', 'Val']]
.groupby(['BCON', 'Start', 'End'], as_index = False)
.agg(sum_val = ('Val', 'sum'))
)
BCON Start End sum_val
0 B002 2012-02-25 2012-02-29 2.92
1 B004 2012-02-28 2012-03-04 3.53
2 B005 2012-03-01 2012-03-06 2.46

Create new dataset with columns of specific strings with daily average

I try to figure out an efficient way for a big dataset to deal with the following: The data contains multiple rows per day with specified codes (string) and ratings as columns. I try to create a new dataset with columns for all the strings in this list; string=['239', '345', '346'] and the new dataset should contain the mean value of the rating on each day. So that I get a time series of means of the specified numbers.
This would be a simple example dataset:
df1 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-02', '2021-01-03'],
'Code':['P:346 K,329 28', 'N2:345 P239', 'P:346 K2', 'E32 345', 'Q2_325', 'P;235 K345', '2W345', 'Pq-245 3460239'],
'Ratings':[9.0, 8.0, 5.0, 3.0, 2, 3, 6, 5]})
I try to achieve something similar to that table, but so far I wasn't able to get it done efficiently.
strings = ['239', '345', '346']
df2 = pd.DataFrame({
'Date':['2021-01-01', '2021-01-02', '2021-01-03'],
'239':[8.5, 'NA', '5'],
'345':[8, 4, 'NA'],
'346':[7, 'NA', 5],})
Thank you very much for your help:)
IIUC you can extract the strings in the code column and then pivot:
print (df1.assign(Code=df1["Code"].str.extractall(f"({'|'.join(strings)})").groupby(level=0).agg(tuple))
.explode("Code")
.pivot_table(index="Date", columns="Code", values="Ratings", aggfunc="mean"))
Code 239 345 346
Date
2021-01-01 8.5 8.0 7.0
2021-01-02 NaN 4.0 NaN
2021-01-03 5.0 NaN 5.0

function in groupby pandas

I would like to calculate a mean value of "bonus" according to column "first_name", but the denominator is not the sum of the cases, because not all the cases have weight of 1, instead the may have 0.5 weight.
for instance in the case of Jason the value that I want is the sum of his bonus divided by 2.5.
Since in real life I have to group by several columns, like area, etc, I would like to adapt a groupby to this situation.
Here is my try, but it gives me the normal mean
raw_data = {'area': [1,2,3,3,4],'first_name': ['Jason','Jason','Jason', 'Jake','Jake'],
'bonus': [10,20, 10, 30, 20],'weight': [1,1,0.5,0.5,1]}
df = pd.DataFrame(raw_data, columns = ['area','first_name','bonus','weight'])
df
Use:
(df.groupby('first_name')[['bonus', 'weight']].sum()
#.add_prefix('sum_') # you could also want it
.assign(result = lambda x: x['bonus'].div(x['weight'])))
or
(df[['first_name', 'bonus', 'weight']].groupby('first_name').sum()
#.add_prefix('sum_')
.assign(result = lambda x: x['bonus'].div(x['weight'])))
Output
bonus weight result
first_name
Jake 50 1.5 33.333333
Jason 40 2.5 16.000000
One way is to use groupby().apply and np.average:
df.groupby('first_name').apply(lambda x: np.average(x.bonus, weights=x.weight))
Output:
first_name
Jake 23.333333
Jason 14.000000
dtype: float64

Complex pivot and resample

I'm not sure where to start with this so apologies for my lack of an attempt.
This is the initial shape of my data:
df = pd.DataFrame({
'Year-Mth': ['1900-01'
,'1901-02'
,'1903-02'
,'1903-03'
,'1903-04'
,'1911-08'
,'1911-09'],
'Category': ['A','A','B','B','B','B','B'],
'SubCategory': ['X','Y','Y','Y','Z','Q','Y'],
'counter': [1,1,1,1,1,1,1]
})
df
This is the result I'd like to get to - the Mth-Year in the below has been resampled to 4 year buckets:
If possible I'd like to do this via a process that makes 'Year-Mth' resamplable - so I can easily switch to different buckets.
Here's my attempt:
df['Year'] = pd.cut(df['Year-Mth'].str[:4].astype(int),
bins=np.arange(1900, 1920, 5), right=False)
df.pivot_table(index=['SubCategory', 'Year'], columns='Category',
values='counter', aggfunc='sum').dropna(how='all').fillna(0)
Out:
Category A B
SubCategory Year
Q [1910, 1915) 0.0 1.0
X [1900, 1905) 1.0 0.0
Y [1900, 1905) 1.0 2.0
[1910, 1915) 0.0 1.0
Z [1900, 1905) 0.0 1.0
The year column is not parameterized as pandas (or numpy) does not offer a cut option with step size, as far as I know. But I think it can be done with a little arithmetic on minimums/maximums. Something like:
df['Year'] = pd.to_datetime(df['Year-Mth']).dt.year
df['Year'] = pd.cut(df['Year'], bins=np.arange(df['Year'].min(),
df['Year'].max() + 5, 5), right=False)
This wouldn't create nice bins like Excel does, though.
cols = [df.SubCategory, pd.to_datetime(df['Year-Mth']), df.Category]
df1 = df.set_index(cols).counter
df1.unstack('Year-Mth').T.resample('60M', how='sum').stack(0).swaplevel(0, 1).sort_index().fillna('')

Categories

Resources