Traversing groups of group by object pandas - python

I need help with some big pandas issue.
As a lot of people asked to have the real input and real desired output in order to answer the question, there it goes:
So I have the following dataframe
Date user cumulative_num_exercises total_exercises %_exercises
2017-01-01 1 2 7 28,57
2017-01-01 2 1 7 14.28
2017-01-01 4 3 7 42,85
2017-01-01 10 1 7 14,28
2017-02-02 1 2 14 14,28
2017-02-02 2 3 14 21,42
2017-02-02 4 4 14 28,57
2017-02-02 10 5 14 35,71
2017-03-03 1 3 17 17,64
2017-03-03 2 3 17 17,64
2017-03-03 4 5 17 29,41
2017-03-03 10 6 17 35,29
%_exercises_accum
28,57
42,85
85,7
100
14,28
35,7
64,27
100
17,64
35,28
64,69
100
-The column %_exercises is the value of the column (cumulative_num_exercises/total_exercises)*100
-The column %_exercises_accum is the value of the sum of the %_exercises for each month. (Note that at the end of each month, it reaches the value 100).
-I need to calculate, whith this data, the % of users that contributed to do a 50%, 80% and 90% of the total exercises, during each month.
-In order to do so, I have thought to create a new column, called category, which will later be used to count how many users contributed to each of the 3 percentages (50%, 80% and 90%). The category column takes the following values:
0 if the user did a %_exercises_accum = 0.
1 if the user did a %_exercises_accum < 50 and > 0.
50 if the user did a %_exercises_accum = 50.
80 if the user did a %_exercises_accum = 80.
90 if the user did a %_exercises_accum = 90.
And so on, because there are many cases in order to determine who contributes to which percentage of the total number of exercises on each month.
I have already determined all the cases and all the values that must be taken.
Basically, I traverse the dataframe using a for loop, and with two main ifs:
if (df.iloc[i][date] == df.iloc[i][date].shift()):
calculations to determine the percentage or percentages to which the user from the second to the last row of the same month group contributes
(because the same user can contribute to all the percentages, or to more than one)
else:
calculations to determine to which percentage of exercises the first
member of each
month group contributes.
The calculations involve:
Looking at the value of the category column in the previous row using shift().
Doing while loops inside the for, because when a user suddenly reaches a big percentage, we need to go back for the users in the same month, and change their category_column value to 50, as they have contributed to the 50%, but didn't reach it. for instance, in this situation:
Date %_exercises_accum
2017-01-01 1,24
2017-01-01 3,53
2017-01-01 20,25
2017-01-01 55,5
The desired output for the given dataframe at the beginning of the question would include the same columns as before (date, user, cumulative_num_exercises, total_exercises, %_exercises and %_exercises_accum) plus the category column, which is the following:
category
50
50
508090
90
50
50
5080
8090
50
50
5080
8090
Note that the rows with the values: 508090, or 8090, mean that that user is contributing to create:
508090: both 50%, 80% and 90% of total exercises in a month.
8090: both 80% and 90% of exercises in a month.
Does anyone know how can I simplify this for loop by traversing the groups of a group by object?
Thank you very much!

Given no sense of what calculations you wish to accomplish, this is my best guess at what you're looking for. However, I'd re-iterate Datanovice's point that the best way to get answers is to provide a sample output.
You can slice to each unique date using the following code:
dates = ['2017-01-01', '2017-01-01','2017-01-01','2017-01-01','2017-02-02','2017-02-02','2017-02-02','2017-02-02','2017-03-03','2017-03-03','2017-03-03','2017-03-03']
df = pd.DataFrame(
{'date':pd.to_datetime(dates),
'user': [1,2,4,10,1,2,4,10,1,2,4,10],
'cumulative_num_exercises':[2,1,3,1,2,3,4,5,3,3,5,6],
'total_exercises':[7,7,7,7,14,14,14,14,17,17,17,17]}
)
df = df.set_index('date')
for idx in df.index.unique():
hold = df.loc[idx]
### YOUR CODE GOES HERE ###

Related

calculate total count and sum of value of pandas series 7 days before today

I would like to calculate how many transactions (as frequency) and how much money each customer makes in the last 7 days before his subscribe date. I set the subscribe date to today to make a what-if scenario.
The code below returns:
d = {'count':'frequency','sum':'monetary'}
df['trx_date']=datetime.date.today()-datetime.timedelta(days=7)
df['trx_date']=df['trx_date'].values.astype('datetime64[D]')
diff_ = df['subscribe_date'].sub(df['trx_date']).dt.days
out = (df.assign(Before=np.select([diff_>0],["Before"],"subscribe_date"))
.groupby(['customer_code','Before'])['sales'].agg(['count','sum'])
.rename(columns=d)).unstack().swaplevel(axis=1)
final_dict = {i: out.loc[:,i] for i in out.columns.levels[0]}
returns a silent error.
Basically, df['trx_date'] must be 7 days before today, and df['subscribe_date'] is set to today.
As each customer's transaction value and frequency of purchases are different, the code should return a different result for each row.
Below is the expected output
frequency monetary
cust_code
6 300.0 46822.07
7 500.0 48973.42
8 451.0 51488.03
9 345.0 40883.15
10 622.0 38458.11

Creating Different Dataframe for conditions on multiple column in Excel using Python

I want to make a different dataframe for those Number(Column B) where Main Date > Reported Date (see the below image). If this condition comes true then I have to make other dataframe displaying that Number Data.
Example
:- if take Number (column B) 223311, now if any main date > Reported Date, then display all the records of that Number
Here is a simple solution with Pandas. You can separate out Dataframes very easily by column values of a particular column. From there, iterate the new Dataframe, resetting for index (if you want to keep the index, use dataframe.shape instead). I appended them to a list for convenience, which could be easily extracted into labeled dataframes, or combined. Long variable names are to help comprehension.
df = pd.read_csv('forstack.csv')
list_of_dataframes = [] #A place to store each dataframe. You could also name them as you go
checked_Numbers = [] #Simply to avoid multiple of same dataframe
for aNumber in df['Number']: #For every number in the column "Number"
if(aNumber not in checked_Numbers): #While this number has not been processed
checked_Numbers.append(aNumber) #Mark as checked
df_forThisNumber = df[df.Number == aNumber].reset_index(drop=True) #"Make a different Dataframe" Per request, with new index
for index in range(0,len(df_forThisNumber)): #Parse each element of this dataframe to see if it matches criteria
if(df_forThisNumber.at[index,'Main Date'] > df_forThisNumber.at[index,'Reported Date']):
list_of_dataframes.append(df_forThisNumber) #If it matches the criteria, append it
Outputs :
Main Date Number Reported Date Fee Amount Cost Name
0 1/1/2019 223311 1/1/2019 100 12 20 11
1 1/7/2019 223311 1/1/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/2/2019 111111 1/2/2019 100 12 20 11
1 1/6/2019 111111 1/2/2019 100 12 20 11
Main Date Number Reported Date Fee Amount Cost Name
0 1/3/2019 222222 1/3/2019 100 12 20 11
1 1/8/2019 222222 1/3/2019 100 12 20 11

Pandas: Group by, Cumsum + Shift with a "where clause"

I am attempting to learn some Pandas that I otherwise would be doing in SQL window functions.
Assume I have the following dataframe which shows different players previous matches played and how many kills they got in each match.
date player kills
2019-01-01 a 15
2019-01-02 b 20
2019-01-03 a 10
2019-03-04 a 20
Throughout the below code I managed to create a groupby where I only show previous summed values of kills (the sum of the players kills excluding the kills he got in the game of the current row).
df['sum_kills'] = df.groupby('player')['kills'].transform(lambda x: x.cumsum().shift())
This creates the following values:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 25
However what I ideally want is the option to include a filter/where clause in the grouped values. So let's say I only wanted to get the summed values from the previous 30 days (1 month). Then my new dataframe should instead look like this:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 NaN
The last row would provide zero summed_kills because no games from player a had been played over the last month. Is this possible somehow?
I think you are a bit in a pinch using groupby and transform. As explained here, transform operates on a single series, so you can't access data of other columns.
groupby and apply does not seem the correct way too, because the custom function is expected to return an aggregated result for the group passed by groupby, but you want a different result for each row.
So the best solution I can propose is to use apply without groupy, and perform all the selection by yourself inside the custom function:
def killcount(x, data, timewin):
"""count the player's kills in a time window before the time of current row.
x: dataframe row
data: full dataframe
timewin: a pandas.Timedelta
"""
return data.loc[(data['date'] < x['date']) #select dates preceding current row
& (data['date'] >= x['date']-timewin) #select dates in the timewin
& (data['player'] == x['player'])]['kills'].sum() #select rows with same player
df['sum_kills'] = df.apply(lambda r : killcount(r, df, pd.Timedelta(30, 'D')), axis=1)
This returns:
date player kills sum_kills
0 2019-01-01 a 15 0
1 2019-01-02 b 20 0
2 2019-01-03 a 10 15
3 2019-03-04 a 20 0
In case you haven't done yet, remember do parse 'date' column to datetime type using pandas.to_datetime otherwise you cannot perform date comparison.

Tricky groupby/moving average by date calculation

I am having trouble illustrating my problem with the form the data is in without complicating things. So bear with me as I would like to start with the following screen shot is for explaining the problem only (aka the data is not in this form) :
I would like to identify the past 14 days with a number > 0 across all bins (aka the total row has a value greater than 0). This would include all days except for days 5 and 12 (highlighted in red). I would then like to sum across bins horizontally for those 14 days (aka sum all days expect for 5 and 12, by bin), with the goal of ultimately calculating a 14 day average by Bin number.
Note the example above would be for one “Lane”, where my data has > 10,000. The example also only illustrates today being day 16. But I would like to apply this logic to every day in the data set. I.e. on day 20 (along with any other date), it would look at the last 14 days with a value across all bins, then use that data range to aggregate across Bin. This is a screenshot sample of how the data looks:
A simple example using the data as it is structured, with only 3 Bins, 1 Lane, and a 3 data point/date look back:
Lane Date Bin KG
AMS-ORD 2018-08-26 3 10
AMS-ORD 2018-08-29 1 25
AMS-ORD 2018-08-30 2 30
AMS-ORD 2018-09-03 2 20
AMS-ORD 2018-09-04 1 40
Note KG here is a sum. Again this is for one day (aka today), but I would like every date in my data set to follow the same logic. The output would look like the following:
Lane Date Bin KG Average
AMS-ORD 2018-09-04 1 40 13.33
AMS-ORD 2018-09-04 2 50 16.67
AMS-ORD 2018-09-04 3 0 -
I have messed around with .rolling(14).mean(), .tail(), and some others. The problem I have is specifying the correct date range for the correct Bin aggregation.

Add value from series index to row of equal value in Pandas DataFrame

I'm facing bit of an issue adding a new column to my Pandas DataFrame: I have a DataFrame in which each row represents a record of location data and a timestamp. Those records belong to trips, so each row also contains a trip id. Imagine the DataFrame looks kind of like this:
TripID Lat Lon time
0 42 53.55 9.99 74
1 42 53.58 9.99 78
3 42 53.60 9.98 79
6 12 52.01 10.04 64
7 12 52.34 10.05 69
Now I would like to delete the records of all trips that have less than a minimum amount of records to them. I figured I could simply get the number of records of each trip like so:
lengths = df['TripID'].value_counts()
Then my idea was to add an additional column to the DataFrame and fill it with the values from that Series corresponding to the trip id of each record. I would then be able to get rid of all rows in which the value of the lengthcolumn is too small.
However, I can't seem to find a way to get the length values into the correct rows. Would any one have an idea for that or even a better approach to the entire problem?
Thanks very much!
EDIT:
My desired output should look something like this:
TripID Lat Lon time length
0 42 53.55 9.99 74 3
1 42 53.58 9.99 78 3
3 42 53.60 9.98 79 3
6 12 52.01 10.04 64 2
7 12 52.34 10.05 69 2
If I understand correctly, to get the length of the trip, you'd want to get the difference between the maximum time and the minimum time for each trip. You can do that with a groupby statement.
# Groupby, get the minimum and maximum times, then reset the index
df_new = df.groupby('TripID').time.agg(['min', 'max']).reset_index()
df_new['length_of_trip'] = df_new.max - df_new.min
df_new = df_new.loc[df_new.length_of_trip > 90] # to pick a random number
That'll get you all the rows with a trip length above the amount you need, including the trip IDs.
You can use groupby and transform to directly add the lengths column to the DataFrame, like so:
df["lengths"] = df[["TripID", "time"]].groupby("TripID").transform("count")
I managed to find an answer to my question that is quite a bit nicer than my original approach as well:
df = df.groupby('TripID').filter(lambda x: len(x) > 2)
This can be found in the Pandas documentation. It gets rid of all groups that have 2 or less elements in them, or trips that are 2 records or shorter in my case.
I hope this will help someone else out as well.

Categories

Resources