I have a data frame like this:
data = {'id': ['id_01, id_02',
'id_03',
'id_04',
'id_05',
'id_06, id_07, id_08'],
'price': [100, 200, 300, 400, 500]}
df = pd.DataFrame(data)
df
output:
I did this to split each id and analyse each as single row:
new_df = df.assign(new_id=df.id.str.split(",")).explode('new_id')
new_df
output:
So far, so good =)
Now I'd like to reach this result below, where I calculate each price divided by the lenght of id items in each row, like this:
How can I reach this result using the most simple way for a beginner student?
Calculate this before you explode when you can access the length of id items via .str.len():
(df.assign(new_id=df.id.str.split(","))
.assign(new_price=lambda df: df.price / df.new_id.str.len())
.explode('new_id'))
id price new_id new_price
0 id_01, id_02 100 id_01 50.000000
0 id_01, id_02 100 id_02 50.000000
1 id_03 200 id_03 200.000000
2 id_04 300 id_04 300.000000
3 id_05 400 id_05 400.000000
4 id_06, id_07, id_08 500 id_06 166.666667
4 id_06, id_07, id_08 500 id_07 166.666667
4 id_06, id_07, id_08 500 id_08 166.666667
Another way and in one line, str. split into a column, find the len of new list and use it to find average on dynamically. Faster than a lambda.
new_df = (df.assign(new_id=df.id.str.split(","),#new colume
price=df['price'].div(df.id.str.split(",").str.len())#Find average
.astype(int)).explode('new_id')#Explode to expnd the df
)
output
id price new_id
0 id_01, id_02 50 id_01
0 id_01, id_02 50 id_02
1 id_03 200 id_03
2 id_04 300 id_04
3 id_05 400 id_05
4 id_06, id_07, id_08 166 id_06
4 id_06, id_07, id_08 166 id_07
4 id_06, id_07, id_08 166 id_08
Probably not the best solution, but you can use the fact that exploded rows have the same index:
new_df['new_price'] = new_df['price']/new_df.groupby(new_df.index).transform('count')['id']
I have two dataframe.As a sample please see the bellow.
How can I fill the df[GrossRate]== 0 with the same value from dfB when having the same ProductID
Basically my GrossRate in df should be
150
40
238
32
dataA = {'date': ['20210101','20210102','20210103','20210104'],
'quanitity': [22000,25000,27000,35000],
'NetRate': ['nan','nan','nan','nan'],
'GrossRate': [150,0,238,0],
'ProductID': [9613,7974,1714,5302],
}
df = pd.DataFrame(dataA, columns = ['date', 'quanitity', 'NetRate', 'GrossRate','ProductID' ])
date quanitity NetRate GrossRate ProductID
0 20210101 22000 nan 150 9613
1 20210102 25000 nan 0 7974
2 20210103 27000 nan 238 1714
3 20210104 35000 nan 0 5302
dataB = {
'ProductID': ['9613.T','7974.T','1714.T','5302.T'],
'GrossRate': [10,40,28,32],
}
dfB = pd.DataFrame(dataB, columns = ['ProductID', 'GrossRate' ])
dfB.ProductID = dfB.ProductID.str.replace('.T','')
print (dfB)
ProductID GrossRate
0 9613 10
1 7974 40
2 1714 28
3 5302 32
Try this list comprehension:
df['GrossRate'] = [x if x != 0 else y for x, y in zip(df['GrossRate'], dfB['GrossRate'])]
If same number of rows and same order in column ProductID is not necessary matching by ProductID, so use numpy.where:
df['GrossRate'] = np.where(df['GrossRate'] == 0, dfB['GrossRate'], df['GrossRate'])
print (df)
date quanitity NetRate GrossRate ProductID
0 20210101 22000 nan 150 9613
1 20210102 25000 nan 40 7974
2 20210103 27000 nan 238 1714
3 20210104 35000 nan 32 5302
If need matching by ProductID use:
dfB.ProductID = dfB.ProductID.str.replace('.T','').astype(int)
df['GrossRate'] = (np.where(df['GrossRate'] == 0,
df['ProductID'].map(dfB.set_index('ProductID')['GrossRate']),
df['GrossRate']))
I have two dataframes as follows
transactions
buy_date buy_price
0 2018-04-16 33.23
1 2018-05-09 33.51
2 2018-07-03 32.74
3 2018-08-02 33.68
4 2019-04-03 33.58
and
cii
from_fy to_fy score
0 2001-04-01 2002-03-31 100
1 2002-04-01 2003-03-31 105
2 2003-04-01 2004-03-31 109
3 2004-04-01 2005-03-31 113
4 2005-04-01 2006-03-31 117
In the transactions dataframe I need to create a new columns cii_score based on the following condition
if transactions['buy_date'] is between cii['from_fy'] and cii['to_fy'] take the cii['score'] value for transactions['cii_score']
I have tried list comprehension but it is no good.
Request your inputs to tackle this.
First, we set up your dfs. Note I modified the dates in transactions in this short example to make it more interesting
import pandas as pd
from io import StringIO
trans_data = StringIO(
"""
,buy_date,buy_price
0,2001-04-16,33.23
1,2001-05-09,33.51
2,2002-07-03,32.74
3,2003-08-02,33.68
4,2003-04-03,33.58
"""
)
cii_data = StringIO(
"""
,from_fy,to_fy,score
0,2001-04-01,2002-03-31,100
1,2002-04-01,2003-03-31,105
2,2003-04-01,2004-03-31,109
3,2004-04-01,2005-03-31,113
4,2005-04-01,2006-03-31,117
"""
)
tr_df = pd.read_csv(trans_data, index_col = 0)
tr_df['buy_date'] = pd.to_datetime(tr_df['buy_date'])
cii_df = pd.read_csv(cii_data, index_col = 0)
cii_df['from_fy'] = pd.to_datetime(cii_df['from_fy'])
cii_df['to_fy'] = pd.to_datetime(cii_df['to_fy'])
The main thing is the following calculation: for each row index of tr_df find the index of the row in cii_df that satisfies the condition. The following calculates this match, each element of the list is equal to the appropriate row index of cii_df:
match = [ [(f<=d) & (d<=e) for f,e in zip(cii_df['from_fy'],cii_df['to_fy']) ].index(True) for d in tr_df['buy_date']]
match
produces
[0, 0, 1, 2, 2]
now we can merge on this
tr_df.merge(cii_df, left_on = np.array(match), right_index = True)
so that we get
key_0 buy_date buy_price from_fy to_fy score
0 0 2001-04-16 33.23 2001-04-01 2002-03-31 100
1 0 2001-05-09 33.51 2001-04-01 2002-03-31 100
2 1 2002-07-03 32.74 2002-04-01 2003-03-31 105
3 2 2003-08-02 33.68 2003-04-01 2004-03-31 109
4 2 2003-04-03 33.58 2003-04-01 2004-03-31 109
and the score column is what you asked for
I have 2 dataframes. rdf is the reference dataframe I am trying to use to define the interval (top and bottom) to calculate an average between (all of the depths between this interval), but use ldf to actually run that calculation since it contains the values. rdf defines the top and bottom for each id number an average should be run for. There are multiple intervals for each id.
rdf is formatted as such:
ID Top Bottom
1 2010 3000
1 4300 4500
1 4550 5000
1 7100 7700
2 3200 4100
2 4120 4180
2 4300 5300
2 5500 5520
3 2300 2380
3 3200 4500
ldf is fromated as such:
ID Depth(ft) Value1 Value2 Value3
1 2000 45 .32 423
1 2000.5 43 .33 500
1 2001 40 .12 643
1 2001.5 28 .10 20
1 2002 40 .10 34
1 2002.5 23 .11 60
1 2003 34 .08 900
1 2003.5 54 .04 1002
2 2000 40 .28 560
2 2000 38 .25 654
...
3 2000 43 .30 343
I want to use rdf to define the top and bottom of the interval to calculate the average for Value1, Value2, and Value3. I would also like to have a count documented as well (not all of the values between the intervals necessarily exist, so it could be less than the difference of Bottom - Top). This will then modify rdf to make a new file:
new_rdf is formatted as such:
ID Top Bottom avgValue1 avgValue2 avgValue3 ThicknessCount(ft)
1 2010 3000 54 .14 456 74
1 4300 4500 23 .18 632 124
1 4550 5000 34 .24 780 111
1 7100 7700 54 .19 932 322
2 3200 4100 52 .32 134 532
2 4120 4180 16 .11 111 32
2 4300 5300 63 .29 872 873
2 5500 5520 33 .27 1111 9
3 2300 2380 63 .13 1442 32
3 3200 4500 37 .14 1839 87
I've been going back and forth on the best way to do this. I tried mimicking this time series example: Sum set of values from pandas dataframe within certain time frame
but it doesn't seem translatable:
import pandas as pd
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']
def get_depths(x):
n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
return n['ID'].values[0],n['DEPTH'].sum()
test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)
I get "TypeError: Invalid comparison between dtype=float64 and str"
And it works if I use the samples they made in the post, but it doesn't work with my data. I'm also hoping there's a more, simple way to do this.
Edit # 2A:
Note:
Sample DataFrame below is not exactly the same as posted in question
Posting a new code here that does uses Top and Bottom from rdf to check for DEPTH in ldf to calculate .mean() for each group using for-loop. A range_key is created in rdf that is unique to each row, assuming that the DataFrame rdf does not have any duplicates.
# Import libraries
import pandas as pd
# Create DataFrame
rdf = pd.DataFrame({
'ID': [1,1,1,1,2,2,2,2,3,3],
'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
'ID': [1,1,1,1,1,1,1,1,2,2,3],
'DEPTH': [2000,2000.5,2001,2001.5,4002,4002.5,5003,5003.5,2000,2000,2000],
'Value1':[45,43,40,28,40,23,34,54,40,38,43],
'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})
# Create a key for merge later
ldf['range_key'] = np.nan
rdf['range_key'] = np.linspace(1,rdf.shape[0],rdf.shape[0]).astype(int).astype(str)
# Flag each row for a range
for i in range(ldf.shape[0]):
for j in range(rdf.shape[0]):
d = ldf['DEPTH'][i]
if (d>= rdf['Top'][j]) & (d<=rdf['Bottom'][j]):
rkey = rdf['range_key'][j]
ldf['range_key'][i]=rkey
break;
ldf['range_key'] = ldf['range_key'].astype(int).astype(str) # Convert to string
# Calculate mean for groups
ldf_mean = ldf.groupby(['ID','range_key']).mean().reset_index()
ldf_mean = ldf_mean.drop(['DEPTH'], axis=1)
# Merge into 'rdf'
new_rdf = rdf.merge(ldf_mean, on=['ID','range_key'], how='left')
new_rdf = new_rdf.drop(['range_key'], axis=1)
new_rdf
Output:
ID Top Bottom Value1 Value2 Value3
0 1 2000 2500 39.0 0.2175 396.5
1 1 4300 4500 NaN NaN NaN
2 1 4500 5000 NaN NaN NaN
3 1 7100 7700 NaN NaN NaN
4 2 3200 4100 NaN NaN NaN
5 2 4120 4180 NaN NaN NaN
6 2 4300 5300 NaN NaN NaN
7 2 5500 5520 NaN NaN NaN
8 3 2300 2380 NaN NaN NaN
9 3 3200 4500 NaN NaN NaN
Edit # 1:
Code below seems to work. Added an if-statement to the return from the code posted in question above. Not sure if this is what you were looking to get. It calculates the .sum(). The first value in rdf is changed to a lower the range to match the data in ldf.
# Import libraries
import pandas as pd
# Create DataFrame
rdf = pd.DataFrame({
'ID': [1,1,1,1,2,2,2,2,3,3],
'Top': [2000,4300,4500,7100,3200,4120,4300,5500,2300,3200],
'Bottom':[2500,4500,5000,7700,4100,4180,5300,5520,2380,4500]
})
ldf = pd.DataFrame({
'ID': [1,1,1,1,1,1,1,1,2,2,3],
'DEPTH': [2000,2000.5,2001,2001.5,2002,2002.5,2003,2003.5,2000,2000,2000],
'Value1':[45,43,40,28,40,23,34,54,40,38,43],
'Value2':[.32,.33,.12,.10,.10,.11,.08,.04,.28,.25,.30],
'Value3':[423,500,643,20,34,60,900,1002,560,654,343]
})
##### Code from the question (copy-pasted here)
Top = rdf['Top']
Bottom = rdf['Bottom']
Depths = ldf['DEPTH']
def get_depths(x):
n = ldf[(ldf['DEPTH']>x['top']) & (ldf['DEPTH']<x['bottom'])]
if (n.shape[0]>0):
return n['ID'].values[0],n['DEPTH'].sum()
test = pd.DataFrame({'top':Top, 'bottom':Bottom})
test[['ID','Value1']] = test.apply(lambda x : get_depths(x),1).apply(pd.Series)
Output:
test
top bottom ID Value1
0 2000 2500 1.0 14014.0
1 4300 4500 NaN NaN
2 4500 5000 NaN NaN
3 7100 7700 NaN NaN
4 3200 4100 NaN NaN
5 4120 4180 NaN NaN
6 4300 5300 NaN NaN
7 5500 5520 NaN NaN
8 2300 2380 NaN NaN
9 3200 4500 NaN NaN
Sample data and imports
import pandas
import numpy
import random
# dfr
rdata = {'ID': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3],
'Top': [2010, 4300, 4550, 7100, 3200, 4120, 4300, 5500, 2300, 3200],
'Bottom': [3000, 4500, 5000, 7700, 4100, 4180, 5300, 5520, 2380, 4500]}
dfr = pd.DataFrame(rdata)
# display(dfr.head())
ID Top Bottom
0 1 2010 3000
1 1 4300 4500
2 1 4550 5000
3 1 7100 7700
4 2 3200 4100
# df
np.random.seed(365)
random.seed(365)
rows = 10000
data = {'id': [random.choice([1, 2, 3]) for _ in range(rows)],
'depth': [np.random.randint(2000, 8000) for _ in range(rows)],
'v1': [np.random.randint(40, 50) for _ in range(rows)],
'v2': np.random.rand(rows),
'v3': [np.random.randint(20, 1000) for _ in range(rows)]}
df = pd.DataFrame(data)
df.sort_values(['id', 'depth'], inplace=True)
df.reset_index(drop=True, inplace=True)
# display(df.head())
id depth v1 v2 v3
0 1 2004 48 0.517014 292
1 1 2004 41 0.997347 859
2 1 2006 42 0.278217 851
3 1 2006 49 0.570363 32
4 1 2009 43 0.462985 409
Use each row of dfr to filter and extract stats from df
There are plenty of answers on SO dealing with "TypeError: Invalid comparison between dtype=float64 and str". The numeric columns need to be cleaned of any value that can't be converted to a numeric type.
This code deals with using one dataframe to filter and return metrics for another dataframe.
For each row in dfr:
Filter df
Aggregate the mean and count for v1, v2 and v3
.T to transpose the mean and count rows to columns
Convert to a numpy array
Slice the array for the 3 means and append the array to the v_mean
Slice the array for the max count and append the value to count
They could be all the same, if there are no NaNs in the data
Convert the list of arrays, v_mean to a dataframe, and join it to dfr_new
Add counts a column in dfr_new
v_mean = list()
counts = list()
for idx, (i, t, b) in dfr.iterrows(): # iterate through each row of dfr
data = df[['v1', 'v2', 'v3']][(df.id == i) & (df.depth >= t) & (df.depth <= b)].agg(['mean', 'count']).T.to_numpy() # apply filters and get stats
v_mean.append(data[:, 0]) # get the 3 means
counts.append(data[:, 1].max()) # get the max of the 3 counts; each column has a count, the count cound be different if there are NaNs in data
# copy dfr to dfr_new
dfr_new = dfr.copy()
# add stats values
dfr_new = dfr_new.join(pd.DataFrame(v_mean, columns=['v1_m', 'v2_m', 'v3_m']))
dfr_new['counts'] = counts
# display(dfr_new)
ID Top Bottom v1_mean v2_mean v3_mean count
0 1 2010 3000 44.577491 0.496768 502.068266 542.0
1 1 4300 4500 44.555556 0.518066 530.968254 126.0
2 1 4550 5000 44.446281 0.538855 482.818182 242.0
3 1 7100 7700 44.348083 0.489983 506.681416 339.0
4 2 3200 4100 44.804040 0.487011 528.707071 495.0
5 2 4120 4180 45.096774 0.526687 520.967742 31.0
6 2 4300 5300 44.476980 0.529476 523.095764 543.0
7 2 5500 5520 46.000000 0.608876 430.500000 12.0
8 3 2300 2380 44.512195 0.456632 443.195122 41.0
9 3 3200 4500 44.554755 0.516616 501.841499 694.0
I'm trying to write a function to aggregate and perform various stats calcuations on a dataframe in Pandas and then merge it to the original dataframe however, I'm running to issues. This is code equivalent in SQL:
SELECT EID,
PCODE,
SUM(PVALUE) AS PVALUE,
SUM(SQRT(SC*EXP(SC-1))) AS SC,
SUM(SI) AS SI,
SUM(EE) AS EE
INTO foo_bar_grp
FROM foo_bar
GROUP BY EID, PCODE
And then join on the original table:
SELECT *
FROM foo_bar_grp INNER JOIN
foo_bar ON foo_bar.EID = foo_bar_grp.EID
AND foo_bar.PCODE = foo_bar_grp.PCODE
Here are the steps: Loading the data
IN:>>
pol_dict = {'PID':[1,1,2,2],
'EID':[123,123,123,123],
'PCODE':['GU','GR','GU','GR'],
'PVALUE':[100,50,150,300],
'SI':[400,40,140,140],
'SC':[230,23,213,213],
'EE':[10000,10000,2000,30000],
}
pol_df = DataFrame(pol_dict)
pol_df
OUT:>>
EID EE PCODE PID PVALUE SC SI
0 123 10000 GU 1 100 230 400
1 123 10000 GR 1 50 23 40
2 123 2000 GU 2 150 213 140
3 123 30000 GR 2 300 213 140
Step 2: Calculating and Grouping on the data:
My pandas code is as follows:
#create aggregation dataframe
poagg_df = pol_df
del poagg_df['PID']
po_grouped_df = poagg_df.groupby(['EID','PCODE'])
#generate acc level aggregate
acc_df = po_grouped_df.agg({
'PVALUE' : np.sum,
'SI' : lambda x: np.sqrt(np.sum(x * np.exp(x-1))),
'SC' : np.sum,
'EE' : np.sum
})
This works fine until I want to join on the original table:
IN:>>
po_account_df = pd.merge(acc_df, po_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
OUT:>>
KeyError: u'no item named EID'
For some reason, the grouped dataframe can't join back to the original table. I've looked at ways of trying to convert the groupby columns to actual columns but that doesn't seem to work.
Please note, the end goal is to be able to find the percentage for each column (PVALUE, SI, SC, EE) IE:
pol_acc_df['PVALUE_PCT'] = np.round(pol_acc_df.PVALUE_Po/pol_acc_df.PVALUE_Acc,4)
Thanks!
By default, groupby output has the grouping columns as indicies, not columns, which is why the merge is failing.
There are a couple different ways to handle it, probably the easiest is using the as_index parameter when you define the groupby object.
po_grouped_df = poagg_df.groupby(['EID','PCODE'], as_index=False)
Then, your merge should work as expected.
In [356]: pd.merge(acc_df, pol_df, on=['EID','PCODE'], how='inner',suffixes=('_Acc','_Po'))
Out[356]:
EID PCODE SC_Acc EE_Acc SI_Acc PVALUE_Acc EE_Po PVALUE_Po \
0 123 GR 236 40000 1.805222e+31 350 10000 50
1 123 GR 236 40000 1.805222e+31 350 30000 300
2 123 GU 443 12000 8.765549e+87 250 10000 100
3 123 GU 443 12000 8.765549e+87 250 2000 150
SC_Po SI_Po
0 23 40
1 213 140
2 230 400
3 213 140
From the pandas docs:
Transformation: perform some group-specific computations and return a like-indexed object
Unfortunately, transform works series by series, so you wouldn't be able to perform multiple functions on multiple columns as you've done with agg, but transform does allow you to skip merge
po_grouped_df = pol_df.groupby(['EID','PCODE'])
pol_df['sum_pval'] = po_grouped_df['PVALUE'].transform(sum)
pol_df['func_si'] = po_grouped_df['SI'].transform(lambda x: np.sqrt(np.sum(x * np.exp(x-1))))
pol_df['sum_sc'] = po_grouped_df['SC'].transform(sum)
pol_df['sum_ee'] = po_grouped_df['EE'].transform(sum)
pol_df
Results in:
PID EID PCODE PVALUE SI SC EE sum_pval func_si sum_sc sum_ee
1 123 GU 100 400 230 10000 250 8.765549e+87 443 12000
1 123 GR 50 40 23 10000 350 1.805222e+31 236 40000
2 123 GU 150 140 213 2000 250 8.765549e+87 443 12000
2 123 GR 300 140 213 30000 350 1.805222e+31 236 40000
For more info, check out this SO answer.