How to apply unique function and mean function after grouping the dataframe? - python

I am working on GPS trajectories.
I am trying to find the mean of velocity of vehicles that belong to three different classes.
Mean of each vehicle is needed.
"Vehicle ID","Frame ID","Total Frames","Global Time","Local X","Local Y","Global X","Global Y","V_Len","V_Width","V_Class","V_Vel","V_Acc","Lane_ID","Pre_Veh","Fol_Veh","Spacing","Headway"
3033,9064,633,1118847885300,42.016,377.256,6451360.093,1873080.530,19.5,8.5,2,27.90,4.29,4,3022,0,93.16,3.34
3033,9065,633,1118847885400,42.060,380.052,6451362.114,1873078.608,19.5,8.5,2,28.43,6.63,4,3022,0,93.87,3.30
3033,9066,633,1118847885500,42.122,382.924,6451364.187,1873076.613,19.5,8.5,2,29.07,6.89,4,3022,0,94.49,3.25
3033,9067,633,1118847885600,42.200,385.882,6451366.307,1873074.553,19.5,8.5,2,29.62,4.41,4,3022,0,95.04,3.21
3033,9068,633,1118847885700,42.265,388.885,6451368.490,1873072.453,19.5,8.5,2,29.93,1.57,4,3022,0,95.57,3.19
df.sort_values(by=["Global Time"])
df["US Time"]=pd.to_datetime(df["Global Time"], unit='ms').dt.tz_localize('UTC' ).dt.tz_convert('America/Los_Angeles')
#Converting gps millisecond TS to US Local Time date format
#sorting
grouped=df.groupby('V_Class')
#find mean of all vehicles in each class
print( grouped['V_Vel'].agg([np.mean,np.std]))
for index, row in df.iterrows():
print (row["Vehicle ID"], row["V_Class"])
Actual output
V_Class mean std
1 40.487673 14.647576
2 37.376317 14.940034
3 40.953483 11.214995
Expected output
Vehicle ID V_Class mean std
3033 2 32.4 12.4
125 1 41.3 9.2
.
likewise

If you want the mean per vehicle, just group by vehicle:
df.groupby(['Vehicle ID','V_Class'])['V_Vel'].agg([np.mean, np.std])
it should give (with your sample data):
mean std
Vehicle ID V_Class
3033 2 28.99 0.834955

Related

Pandas group sum divided by unique items in group

I have a data in excel of employees and no. of hours worked in a week. I tagged each employee to a project he/she is working on. I can get sum of hours worked in each project by doing groupby as below:
util_breakup_sum = df[["Tag", "Bill. Hours"]].groupby("Tag").sum()
Bill. Hours
Tag
A61H 92.00
A63B 139.75
An 27.00
B32B 33.50
H 37.00
Manager 8.00
PP 23.00
RP0117 38.50
Se 37.50
However, when I try to calculate average time spent on each project per person, it gives me (sum/ total number of entries by employee), whereas correct average should be (sum / unique employee in group).
Example of mean is given below:
util_breakup_mean = df[["Tag", "Bill. Hours"]].groupby("Tag").mean()
Bill. Hours
Tag
A61H 2.243902
A63B 1.486702
An 1.000000
B32B 0.712766
H 2.055556
Manager 0.296296
PP 1.095238
RP0117 1.425926
Se 3.750000
For example, group A61H has just two employees, so there average should be (92/2) = 46. However, the code is dividing by total number of entries by these employees and hence giving an average of 2.24.
How to get the average from unique employee names in the group?
Try:
df.groupby("Tag")["Bill. Hours"].sum().div(df.groupby("Tag")["Employee"].nunique()
Where Employee is column identifying employees.
You can try nunique
util_breakup_mean = util_breakup_sum/df.groupby("Tag")['employee'].nunique()

Lifetimes package gives inconsistent results

I am using Lifetimes to compute CLV of some customers of mine.
I have transactional data and, by means of summary_data_from_transaction_data (the implementation can be found here) I would like to compute
the recency, the frequency and the time interval T of each customer.
Unfortunately, it seems that the method does not compute correctly the frequency.
Here is the code for testing my dataset:
df_test = pd.read_csv('test_clv.csv', sep=',')
RFT_from_libray = summary_data_from_transaction_data(df_test,
'Customer',
'Transaction date',
observation_period_end='2020-02-12',
freq='D')
According to the code, the result is:
frequency recency T
Customer
1158624 18.0 389.0 401.0
1171970 67.0 396.0 406.0
1188564 12.0 105.0 401.0
The problem is that customer 1188564 and customer 1171970 did respectively 69 and 14 transaction, thus the frequency should have been 68 and 13.
Printing the size of each customer confirms that:
print(df_test.groupby('Customer').size())
Customer
1158624 19
1171970 69
1188564 14
I did try to use natively the underlying code in the summary_data_from_transaction_data like this:
RFT_native = df_test.groupby('Customer', sort=False)['Transaction date'].agg(["min", "max", "count"])
observation_period_end = (
pd.to_datetime('2020-02-12', format=None).to_period('D').to_timestamp()
)
# subtract 1 from count, as we ignore their first order.
RFT_native ["frequency"] = RFT_native ["count"] - 1
RFT_native ["T"] = (observation_period_end - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
RFT_native ["recency"] = (RFT_native ["max"] - RFT_native ["min"]) / np.timedelta64(1, 'D') / 1
As you can see, the result is indeed correct.
min max count frequency T recency
Customer
1171970 2019-01-02 15:45:39 2020-02-02 13:40:18 69 68 405.343299 395.912951
1188564 2019-01-07 18:10:55 2019-04-22 14:27:08 14 13 400.242419 104.844595
1158624 2019-01-07 10:52:33 2020-01-31 13:50:36 19 18 400.546840 389.123646
Of course my dataset is much bigger, and a slight difference in my frequency and/or recency alters a lot the computation of the BGF model.
What am I missing? Is there something that I should consider when using the method?
I might be a bit late to answer your query, but here it goes.
The documentation for the Lifestyles package defines frequency as:
frequency represents the number of repeat purchases the customer has made. This means that it’s one less than the total number of purchases. This is actually slightly wrong. It’s the count of time periods the customer had a purchase in. So if using days as units, then it’s the count of days the customer had a purchase on.
So, it's basically the number of time periods when the customer has made a repeat purchase, not the number of individual repeat purchases. A quick scan of your sample dataset confirmed that both 1188564 and 1171970 indeed made 2 purchases on a single day, 13Jan2019 and 15Jun2019, respectively. So these 2 transactions would be considered as 1 when calculating frequency that would result in the frequency calculated by summary_data_from_transaction_data function to be 2 less than your manual count.
According to documentation, you need to set:
include_first_transaction = True
include_first_transaction (bool, optional) – Default: False By default
the first transaction is not included while calculating frequency and
monetary_value. Can be set to True to include it. Should be False if
you are going to use this data with any fitters in lifetimes package

Pandas: calculate mean for each row by cycle number

I have a CSV file (Mspec Data) which looks like this:
#Header
#
"Cycle";"Time";"ms";"mass amu";"SEM c/s"
0000000001;00:00:01;0000001452; 1,00; 620
0000000001;00:00:01;0000001452; 1,20; 4730
0000000001;00:00:01;0000001452; 1,40; 4610
... ;..:..:..;..........;.........;...........
I read it via:
df = pd.read_csv(Filename, header=30,delimiter=';',decimal= ',' )
the result looks like this:
Cycle Time ms mass amu SEM c/s
0 1 00:00:01 1452 1.0 620
1 1 00:00:01 1452 1.2 4730
2 1 00:00:01 1452 1.4 4610
... ... ... ... ... ...
3872 4 00:06:30 390971 1.0 32290
3873 4 00:06:30 390971 1.2 31510
This data contains several Mass spec scans with identical parameters. Cycle number 1 means scan 1 and so forth. I would like to calculate the mean in the last column SEM c/s for each corresponding identical mass. in the end i would like to have a new data frame containing only:
ms "mass amu" "SEM c/s(mean over all cycles)"
obviously the mean of the mass does not need to be calculated. I would like to avoid to read each cycle into a new dataframe as this would mean I have to look up the length of each Mass spectrum . The "mass range" and " resolution" is obviously different for different measurements (Solution).
I guess doing the calculation in numpy directly would be best but I am stuck?
Thank you in advance
You can use groupby(), something like this:
df.groupby(['ms', 'mass amu'])['SEM c/s'].mean()
You have different ms over all the cycles, and you want to calculate the mean of SEM over each group of same ms. I will show you a step-by-step example.
You should invoke each group and then put the mean in a dictionary to convert in DataFrame.
ms_uni = df['ms'].unique() #calculate the unique ms values
new_df_dict = { "ma":[], "SEM":[] } #later you will rename them
for un in range( len(ms_uni) ):
cms = ms_uni[un]
new_df_dict['ma'].append( cms )
new_df_dict['SEM'].append( df[ df['ms']==cms ]['SEM c/s'].mean() ) #advise: change the column name in a more safe SEM-c_s
new_df = pd.DataFrame(new_df_dict) #end of the dirty work
new_df.rename(index=str, columns={'ma':"mass amu", "SEM": "SEM c/s(mean over all cycles)"} )
Hope it will be helpful

Pandas averaging selected rows and columns

I am working with some EPL stats. I have csv with all matches from one season in following format.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal 0 2 5.00 4.00 1.73
11.05.2014 Chelsea Swansea 0 0 1.50 3.00 5.00
What I would like to do is for each match calculate average stats of teams from N previous matches. The result should look something like this.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal avgNorwichSC avgArsenalSC 5.00 4.00 1.73
11.05.2014 Chelsea Swansea avgChelseaSC avgSwanseaSC 1.50 3.00 5.00
So the date, teams and odds remains untouched and other stats are replaced with average from N previous matches. EDIT: The matches from first N rounds should not be in final table because there is not enough data to calculate averages.
The most tricky part for me is that the stats I am averaging have different prefix (H_ or A_) depending on where was the match played.
All I managed to do for now is to create dictionary, where key is club name and value is DataFrame containing all matches played by club.
D H A H_SC A_SC H_ODDS D_ODDS A_ODDS...
11.05.2014 Norwich Arsenal 0 2 5.00 4.00 1.73
04.05.2014 Arsenal West Brom 1 0 1.40 5.25 8.00
I have also previously coded this without pandas, but I was not satisfied with the code and i would like to learn pandas :).
You say you want to learn pandas, so I've given a few examples (tested with similar data) to get you going along the right track. It's a bit of an opinion, but I think finding the last N games is hard, so I'll initially assume / pretend you want to find averages over the whole table at first. If finding "last N" is really import, I can add to the answer. This should get you going with pandas and gropuby - I've left prints in so you can understand what's going on.
import pandas
EPL_df = pandas.DataFrame.from_csv('D:\\EPLstats.csv')
#Find most recent date for each team
EPL_df['D'] = pandas.to_datetime(EPL_df['D'])
homeGroup = EPL_df.groupby('H')
awayGroup = EPL_df.groupby('A')
#Following will give you dataframes, team against last game, home and away
homeLastGame = homeGroup['D'].max()
awayLastGame = awayGroup['D'].max()
teamLastGame = pandas.concat([homeLastGame, awayLastGame]).reset_index().groupby('index')['D'].max()
print teamLastGame
homeAveScore = homeGroup['H_SC'].mean()
awayAveScore = awayGroup['A_SC'].mean()
teamAveScore = (homeGroup['H_SC'].sum() + awayGroup['A_SC'].sum()) / (homeGroup['H_SC'].count() + awayGroup['A_SC'].count())
print teamAveScore
You now have average scores for each team along with their most recent match dates. All you have to do now is select the relevant rows of the original dataframe using the most recent dates (i.e. eveything apart from the score columns) and then select from the average score dataframes using the team names from that row.
e.g.
recentRows = EPL_df.loc[EPL_df['D'] > pandas.to_datetime("2015/01/10")]
print recentRows
def insertAverages(s):
a = teamAveScore[s['H']]
b = teamAveScore[s['A']]
print a,b
return pandas.Series(dict(H_AVSC=a, A_AVSC=b))
finalTable = pandas.concat([recentRows, recentRows.apply(insertAverages, axis = 1)], axis=1)
print finalTable
finalTable has your original odds etc for the most recent games with two extra columns (H_AVSC and A_AVSC) for the average scores of home and away teams involved in those matches
Edit
Couple of gotchas
just noticed I didn't put a format string in to_datetime(). For your dates - they look like UK format with dots so you should do
EPL_df['D'] = pandas.to_datetime(EPL_df['D'], format='%d.%m.%Y')
You could use the minimum of the dates in teamLastGame instead of the hard coded 2015/01/10 in my example.
If you really need to replace column H_SC with H_AVSC in your finalTable, rather than add on the averages:
newCols = recentRows.apply(insertAverages, axis = 1)
recentRows['H_SC'] = newCols['H_AVSC']
recentRows['A_SC'] = newCols['A_AVSC']
print recentRows

Python - time series alignment and "to date" functions

I have a dataset with the following first three columns.
Include Basket ID (unique identifier), Sale amount (dollars) and date of the transaction. I want to calculate the following column for each row of the dataset, and I would like to it in Python.
Previous Sale of the same basket (if any); Sale Count to date for current basket; Mean To Date for current basket (if available); Max To Date for current basket (if available)
Basket Sale Date PrevSale SaleCount MeanToDate MaxToDate
88 $15 3/01/2012 1
88 $30 11/02/2012 $15 2 $23 $30
88 $16 16/08/2012 $30 3 $20 $30
123 $90 18/06/2012 1
477 $77 19/08/2012 1
477 $57 11/12/2012 $77 2 $67 $77
566 $90 6/07/2012 1
I'm pretty new with Python, and I really struggle to find anything to do it in a fancy way. I've sorted the data (as above) by BasketID and Date, so I can get the previous sale in bulk by shifting forward by one for each single basket. No clue how to get the MeanToDate and MaxToDate in an efficient way apart from looping... any ideas?
This should do the trick:
from pandas import concat
from pandas.stats.moments import expanding_mean, expanding_count
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
# se is the (ordered) time series of sales restricted to a single basket
# we can now create a dataframe by combining different metrics
# pandas has a function for each of the ones you are interested in!
return concat(
{
'MeanToDate': expanding_mean(se), # cumulative mean
'MaxToDate': se.cummax(), # cumulative max
'SaleCount': expanding_count(se), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
# we then apply this handler to all the groups and pandas combines them
# back into a single dataframe indexed by (Basket, Date)
# we simply need to reset the index to get the shape you mention in your question
new_df = df.groupby('Basket').apply(handler).reset_index()
You can read more about grouping/aggregating here.
import pandas as pd
pd.__version__ # u'0.24.2'
from pandas import concat
def handler(grouped):
se = grouped.set_index('Date')['Sale'].sort_index()
return concat(
{
'MeanToDate': se.expanding().mean(), # cumulative mean
'MaxToDate': se.expanding().max(), # cumulative max
'SaleCount': se.expanding().count(), # cumulative count
'Sale': se, # simple copy
'PrevSale': se.shift(1) # previous sale
},
axis=1
)
###########################
from datetime import datetime
df = pd.DataFrame({'Basket':[88,88,88,123,477,477,566],
'Sale':[15,30,16,90,77,57,90],
'Date':[datetime.strptime(ds,'%d/%m/%Y')
for ds in ['3/01/2012','11/02/2012','16/08/2012','18/06/2012',
'19/08/2012','11/12/2012','6/07/2012']]})
#########
new_df = df.groupby('Basket').apply(handler).reset_index()

Categories

Resources