Creating a new feature column on grouped data in a Pandas dataframe - python

I have a Pandas dataframe with the columns ['week', 'price_per_unit', 'total_units']. I wish to create a new column called 'weighted_price' as follows: first group by 'week' and then for each week calculate price_per_unit * total_units / sum(total_units) for that week. I have code that does this:
import pandas as pd
import numpy as np
def create_features_by_group(df):
# first group data
grouped = df.groupby(['week'])
df_temp = pd.DataFrame(columns=['weighted_price'])
# run through the groups and create the weighted_price per group
for name, group in grouped:
res = (group['total_units'] * group['price_per_unit']) / np.sum(group['total_units'])
for idx in res.index:
df_temp.loc[idx] = [res[idx]]
df.join(df_temp['weighted_price'])
return df
The only problem is that this is very, very slow. Is there some faster way to do this?
I used the following code to test the function.
import pandas as pd
import numpy as np
df = pd.DataFrame(columns=['week', 'price_per_unit', 'total_units'])
for i in range(10):
df.loc[i] = [round(int(i % 3), 0) , 10 * np.random.rand(), round(10 * np.random.rand(), 0)]

I think you need to do it this way:
df
price total_units week
0 5 100 1
1 7 200 1
2 9 150 2
3 11 250 2
4 13 125 2
def fun(table):
table['measure'] = table['price'] * (table['total_units'] / table['total_units'].sum())
return table
df.groupby('week').apply(fun)
price total_units week measure
0 5 100 1 1.666667
1 7 200 1 4.666667
2 9 150 2 2.571429
3 11 250 2 5.238095
4 13 125 2 3.095238

I have grouped the dataset by 'Week' to calculate the weighted price for each week.
Then I joined the original dataset with the grouped dataset to get the result:
# importing the libraries
import pandas as pd
import numpy as np
# creating the dataset
df = {
'Week' : [1,1,1,1,2,2],
'price_per_unit' : [10,11,22,12,12,45],
'total_units' : [10,10,10,10,10,10]
}
df = pd.DataFrame(df)
df['price'] = df['price_per_unit'] * df['total_units']
# calculate the total sales and total number of units sold in each week
df_grouped_week = df.groupby(by = 'Week').agg({'price' : 'sum', 'total_units' : 'sum'}).reset_index()
# calculate the weighted price
df_grouped_week['wt_price'] = df_grouped_week['price'] / df_grouped_week['total_units']
# merging df and df_grouped_week
df_final = pd.merge(df, df_grouped_week[['Week', 'wt_price']], how = 'left', on = 'Week')

Related

Given windows of start and stop times for each object, how can I count how many objects were on for each second?

I have the following pandas dataframe:
import pandas as pd
TurnedOn = pd.Series([1000.4,1200.5,1550.1,500.3])
TurnedOff = pd.Series([1400.2,1600.8,1570.3,74500.6])
df = pd.DataFrame(data=[TurnedOn,TurnedOff]).T
df.index = ['OBJ1','OBJ2','OBJ3','OBJ4']
I want to get a time based count in seconds throughout the day of how many lights were on at a 0.1 second sampling rate.
I've tried doing this by making a large dataframe from 0 to 864000 (seconds per day times 10), and setting each object true for each 0.1 second in the time window of between Turned on and Turned off and then counting them, but this is horribly inefficient for large dataframes.
Is there something in python that I can use to count how many lights are on each second?
For instance, the output would be:
500.3-1000.4: 1 light
1000.4-1200.5: 2 lights
1200.5 - 1400.2: 3 lights
1400.2-1550.1: 2 lights
1550.1-1570.3: 3 lights
1570.3-1600.8: 2 lights
1600.8-74500.6: 1 light
With the following toy dataframe:
import pandas as pd
TurnedOn = pd.Series([1000.4, 1200.5, 1550.1, 500.3])
TurnedOff = pd.Series([1400.2, 1600.8, 1570.3, 74500.6])
df = pd.DataFrame(data=[TurnedOn, TurnedOff]).T
df.columns = ["TurnedOn", "TurnedOff"]
print(df)
# Output
TurnedOn TurnedOff
0 1000.4 1400.2
1 1200.5 1600.8
2 1550.1 1570.3
3 500.3 74500.6
Here is one way to do it with Pandas unstack and cumsum:
# Prep data
df = (
df.unstack()
.reset_index()
.drop(columns="level_1")
.rename(columns={"level_0": "status", 0: "start"})
)
df = df.sort_values(by="start", ignore_index=True)
df["end"] = df["start"].shift(-1)
# Count how many lights are simultaneously on
df["num_lights_on"] = df.apply(lambda x: 1 if x["status"] == "TurnedOn" else -1, axis=1)
df["num_lights_on"] = df["num_lights_on"].cumsum()
# Cleanup
df = df.reindex(["start", "end", "num_lights_on"], axis=1).dropna()
Then:
print(df)
# Output
start end num_lights_on
0 500.3 1000.4 1
1 1000.4 1200.5 2
2 1200.5 1400.2 3
3 1400.2 1550.1 2
4 1550.1 1570.3 3
5 1570.3 1600.8 2
6 1600.8 74500.6 1

Split a dataframe based on values in a column

I want to split a dataframe into quartiles of a specific column.
I have data from 800 companies. One row displays a specific score which ranges from 0 to 100.
I want to split the dataframe in 4 groups (quartiles) with same size (Q1 to Q4, Q4 should contain the companies with the highest scores). So each group should contain 200 companies. How can I divide the companies into 4 equal sized groups according to their score of a specific column (here the last column "ESG Combined Score 2011")? I want to extract the groups to separate sheets in excel (Q1 in a sheet named Q1, Q2 in a sheet named Q2 and so on).
Here is an extract of the data:
df1
Company Common Name Company Market Capitalization ESG Combined Score 2011
0 SSR Mining Inc 3.129135e+09 32.817325
1 Fluor Corp 3.958424e+09 69.467729
2 CBRE Group Inc 2.229251e+10 59.632423
3 Assurant Inc 8.078239e+09 46.492803
4 CME Group Inc 6.269954e+10 42.469682
5 Peabody Energy Corp 3.842130e+09 73.374671
And as an additional question: How can I turn off the scientific notation of the column in the middle? I want it to display with separators.
Thanks for your help
Suppose your dataframe sorted by some values already.
import numpy as np
import pandas as pd
writer = pd.ExcelWriter('splited_df.xlsx', engine='xlsxwriter')
# determine you want to divide into how many parts
n_groups= 4
# make the dataframe slicing list
separator= list(map(int, np.linspace(0,len(df1),n_groups+1)))
for idx in range(len(separator)):
if idx>=len(separator)-2:
df1.iloc[separator[idx]:,:].to_excel(writer, sheet_name=f'Sheet{idx+1}')
break
df1.iloc[separator[idx]:separator[idx+1],:].to_excel(writer, sheet_name=f'Sheet{idx+1}')
writer.save()
writer.close()
And to suppress scientific notation, you can refer to this Stackoverflow post
I hope this can satisfy your question.
You will need to sort the dataframe and partition it based on indices corresponding to quantiles:
def partition_quantiles(df, by, quantiles):
num_samples = len(df)
df = df.sort_values(by)
q_idxs = [0, *(int(num_samples * q) for q in quantiles), num_samples + 1]
for q_start, q_end in zip(q_idxs[:-1], q_idxs[1:]):
yield df.iloc[q_start:q_end]
It will work as follows:
from random import choices
from string import ascii_letters
import numpy as np
import pandas as pd
num_rows = 12
companies = ["".join(choices(ascii_letters, k=10)) for _ in range(num_rows)]
capitalizations = np.random.rand(num_rows) * 1e6
scores = np.random.rand(num_rows) * 1e2
df = pd.DataFrame(
{
"company": companies,
"capital": capitalizations,
"score": scores,
}
)
for partition in partition_quantiles(df, "score", [0.25, 0.5, 0.75]):
print("-" * 40)
print(partition)
which prints:
----------------------------------------
company capital score
7 QVdnUUiaSV 599523.318607 0.506453
2 CahcnFEMlB 247175.132381 11.201345
10 OpvllkCfWp 203289.934774 36.328395
----------------------------------------
company capital score
6 YzqHvWewOC 774025.801826 49.618631
4 taDrQHvHoB 354491.773921 60.153841
11 JrZmmTvwyD 248947.408524 62.414680
----------------------------------------
company capital score
8 nvkomHSjtP 139345.993291 63.949291
9 soigFZMVjo 666688.879067 64.449568
5 LQSInRRnZd 691896.831968 85.375991
----------------------------------------
company capital score
0 wNMoypFeXN 12712.591339 85.638396
1 XNDqUMDrTb 858545.389446 92.531258
3 okUNZChvsJ 697386.417437 95.398392
You can use numpy's array_split for that:
import numpy as np
dfs = np.array_split(df.sort_values(by=['ESG Combined Score 2011']), 4)
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for index, df in enumerate(dfs):
df.to_excel(writer, sheet_name=f'Sheet{index+1}')
You can use pandas.qcut
Overall solution (edited, to use partial of #RJ Adriaansen solution):
df['categories'] = pd.qcut(df['score'], 4, retbins=True, labels=['low', 'low-mid', 'mid-high', 'high'])[0]
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for i, category in enumerate(pd.Categorical(df['categories']).categories):
df[df['categories'] == category].to_excel(writer, sheet_name=f'SHEET_NAME_{1}')
Input:
df = pd.DataFrame({'company': ['A', 'B', 'C', 'D'], 'score': [1, 2, 3, 4]})
company score
0 A 1
1 B 2
2 C 3
3 D 4
Script:
df['categories'] = pd.qcut(df['score'], 4, retbins=True, labels=['low', 'low-mid', 'mid-high', 'high'])[0]
Output:
company score categories
0 A 1 low
1 B 2 low-mid
2 C 3 mid-high
3 D 4 high
Then separate to different Excel sheets (edited, to use partial of #RJ Adriaansen solution):
writer = pd.ExcelWriter('dataframe.xlsx', engine='xlsxwriter')
for i, category in enumerate(pd.Categorical(df['categories']).categories):
df[df['categories'] == category].to_excel(writer, sheet_name=f'SHEET_NAME_{1}')

How to create a dataframe that only selects rows that have value more than avg +/* standard deviation in Pandas?

This is the code that I have so far, but I am just not sure if I should just join the dataframes after figuring out the standard deviation and average. Where I am currently stuck is selecting the rows that have value more or less than avg +/1 std. I just do not know how I would be able to iterate through each column and do this? I thought about a for loop, but wasn't exactly sure how to go about it.
import pandas_datareader.data as web
import datetime as date
fromDate ="2014-01-02"
toDate = "2016-01-02"
dfSixMo = web.DataReader('DGS6MO','fred',fromDate,toDate)
dfOneYear = web.DataReader('DGS1','fred',fromDate,toDate)
dfFiveYear = web.DataReader('DGS5','fred',fromDate,toDate)
dfTenYear = web.DataReader('DGS10','fred',fromDate,toDate)
dfJoin1 = dfSixMo.join(dfOneYear,how = 'inner')
dfJoin2 = dfFiveYear.join(dfTenYear,how='inner')
dfFinal = dfJoin1.join(dfJoin2,how='inner')
print(dfFinal)
mean = dfFinal.mean()
print('\nMean:')
print(mean)
StDev = dfFinal.std()
print('\n Standard Deviation:')
print(StDev)
IIUC this is what you want:
#setup
df = pd.DataFrame(np.random.randint(0,10,(3,3)), columns = list('abc'))
# a b c
#0 3 2 8
#1 0 6 7
#2 8 3 9
mean = df.mean()
std = df.std()
df[((mean-std < df) & (df< mean+std)).all(1)]
# a b c
#0 3 2 8

Populate pandas dataframe using column and row indices as variables

Overview
How do you populate a pandas dataframe using math which uses column and row indices as variables.
Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(index = range(5), columns = ['Combo_Class0', 'Combo_Class1', 'Combo_Class2', 'Combo_Class3', 'Combo_Class4'])
Objective
Each cell in df = row index * (column index + 2)
Attempt 1
You can use this solution to produce the following code:
row = 0
for i in range(5):
row = row + 1
df.loc[i] = [(row)*(1+2), (row)*(2+2), (row)*(3+2), (row)*(4+2), (row)*(4+2), (row)*(5+2)]
Attempt 2
This solution seemed relevant as well, although I believe I've read you're not supposed to loop through dataframes. Besides, I'm not seeing how to loop through rows and columns:
for i, j in df.iterrows():
df.loc[i] = i
You can leverage broadcasting for a more efficient approach:
ix = (df.index+1).to_numpy() # .values for pandas 0.24<
df[:] = ix[:,None] * (ix+2)
print(df)
Combo_Class0 Combo_Class1 Combo_Class2 Combo_Class3 Combo_Class4
0 3 4 5 6 7
1 6 8 10 12 14
2 9 12 15 18 21
3 12 16 20 24 28
4 15 20 25 30 35
Using multiply outer
df[:]=np.multiply.outer((np.arange(5)+1),(np.arange(5)+3))

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316

Categories

Resources