Lets say I have some data like this:
category = pd.Series(np.ones(4))
job1_days = pd.Series([1, 2, 1, 2])
job1_time = pd.Series([30, 35, 50, 10])
job2_days = pd.Series([1, 3, 1, 3])
job2_time = pd.Series([10, 40, 60, 10])
job3_days = pd.Series([1, 2, 1, 3])
job3_time = pd.Series([30, 15, 50, 15])
Each entry represents an individual (so 4 people total). xxx_days represents the number of days an individual did something and xxx_time represents the number of minutes spent doing that job on a single day
I want to assign a 2 to category for an individual, if across all jobs they spent at least 3 days of 20 minutes each. So for example, person 1 does not meet the criteria because they only spent 2 total days with at least 20 minutes (their job 2 day count does not count toward the total because time is < 20). Person 2 does meet the criteria as they spent 5 total days (jobs 1 and 2).
After replacement, category should look like this:
[1, 2, 2, 1]
My current attempt to do this requires a for loop and manually indexing into each series and calculating the total days where time is greater than 20. However, this approach doesn't scale well to my actual dataset. I haven't included the code here as i'd like to approach it from a Pandas perspective instead
Whats the most efficient way to do this in Pandas? The thing that stumps me is checking conditions across multiple series and act accordingly after summation of days
Put days and time in two data frames with column positions correspondence maintained, then do the calculation in a vectorized approach:
import pandas as pd
time = pd.concat([job1_time, job2_time, job3_time], axis = 1)
days = pd.concat([job1_days, job2_days, job3_days], axis = 1)
((days * (time >= 20)).sum(1) >= 3) + 1
#0 1
#1 2
#2 2
#3 1
#dtype: int64
Related
I have a problem. I have a dataframe, this gives me the savings. If I sum this up I get all the savings of X data points.
Last year I had Y data points. I would now predict / extrapolate how much savings I can expect.
Thus I have to calculate the following
(all savings / data points) * number of old data points = extrapolated savings
In my estimation, this is wrong. So how could I calculate an extrapolation for the savings of the 10 data points?
import pandas as pd
d = {'id': [1, 2, 3, 4], 'saveing': [10, 20, 30, 5]}
df = pd.DataFrame(data=d)
count_datapoints = 10
(df['saveing'].sum() / df.shape[0]) * count_datapoints
[OUT] 162.5
Dataframe
id saveing
0 1 10
1 2 20
2 3 30
3 4 5
Imagine I have the following data frame:
Product
Month 1
Month 2
Month 3
Month 4
Total
Stuff A
5
0
3
3
11
Stuff B
10
11
4
8
33
Stuff C
0
0
23
30
53
that can be constructed from:
df = pd.DataFrame({'Product': ['Stuff A', 'Stuff B', 'Stuff C'],
'Month 1': [5, 10, 0],
'Month 2': [0, 11, 0],
'Month 3': [3, 4, 23],
'Month 4': [3, 8, 30],
'Total': [11, 33, 53]})
This data frame shows the amount of units sold per product, per month.
Now, what I want to do is to create a new column called "Average" that calculates the average units sold per month. HOWEVER, notice in this example that Stuff C's values for months 1 and 2 are 0. This product was probably introduced in Month 3, so its average should be calculated based on months 3 and 4 only. Also notice that Stuff A's units sold in Month 2 were 0, but that does not mean the product was introduced in Month 3 since 5 units were sold in Month 1. That is, its average should be calculated based on all four months. Assume that the provided data frame may contain any number of months.
Based on these conditions, I have come up with the following solution in pseudo-code:
months = ["list of index names of months to calculate"]
x = len(months)
if df["Month 1"] != 0:
df["Average"] = df["Total"] / x
elif df["Month 2"] != 0:
df["Average"] = df["Total"] / x - 1
...
elif df["Month " + str(x)] != 0:
df["Average"] = df["Total"] / 1
else:
df["Average"] = 0
That way, the average would be calculated starting from the first month where units sold are different from 0. However, I haven't been able to translate this logical abstraction into actual working code. I couldn't manage to iterate over len(months) while maintaining the elif conditions. Or maybe there is a better, more practical approach.
I would appreciate any help, since I've been trying to crack this problem for a while with no success.
There is numpy method np.trim_zeros that trims leading and/or trailing zeros. Using a list comprehension, you can iterate over the relevant DataFrame rows, trim the leading zeros and find the average of what remains for each row.
Note that since 'Month 1' to 'Month 4' are consecutive, you can slice the columns between them using .loc.
import numpy as np
df['Average Sales'] = [np.trim_zeros(row, trim='f').mean() for row in df.loc[:, 'Month 1':'Month 4'].to_numpy()]
Output:
Product Month 1 Month 2 Month 3 Month 4 Total Average Sales
0 Stuff A 5 0 3 3 11 2.75
1 Stuff B 10 11 4 8 33 8.25
2 Stuff C 0 0 23 30 53 26.50
Try:
df = df.set_index(['Product','Total'])
df['Average'] = df.where(df.ne(0).cummax(axis=1)).mean(axis=1)
df_out=df.reset_index()
print(df_out)
Output:
Product Total Month 1 Month 2 Month 3 Month 4 Average
0 Stuff A 11 5 0 3 3 2.75
1 Stuff B 33 10 11 4 8 8.25
2 Stuff C 53 0 0 23 30 26.50
Details:
Move Product and Total into the dataframe index, so we can do calcation on the rest of the dataframe.
First create a boolean matrix using ne to zero. Then, use cummax along the rows which means that if there is a non-zero value, It will remain True until then end of the row. If it starts with a zero, then the False will stay until first non-zero then turns to Turn and remain True.
Next, use pd.DataFrame.where to only select those values for that boolean matrix were Turn, other values (leading zeros) will be NaN and not used in the calcuation of mean.
If you don't mind it being a little memory inefficient, you could put your dataframe into a numpy array. Numpy has a built-in function to remove zeroes from an array, and then you could use the mean function to calculate the average. It could look something like this:
import numpy as np
arr = np.array(Stuff_A_DF)
mean = arr[np.nonzero(arr)].mean()
Alternatively, you could manually extract the row to a list, then loop through to remove the zeroes.
I have to group a dataset with multiple participants. The participants work a specific time on a specific tablet. If rows are the same tablet, and the time difference between consecutive rows is no more than 10 minutes, the rows belong to one participant. I would like to create a new column ("Participant") that numbers the participants. I know some python but this goes over my head. Thanks a lot!
Dataframe:
ID, Time, Tablet
1, 9:12, a
2, 9:14, a
3, 9:17, a
4, 9:45, a
5, 9:49, a
6, 9:51, a
7, 9:13, b
8, 9:15, b
...
Goal:
ID, Time, Tablet, Participant
1, 9:12, a, 1
2, 9:14, a, 1
3, 9:17, a, 1
4, 9:45, a, 2
5, 9:49, a, 2
6, 9:51, a, 2
7, 9:13, b, 3
8, 9:15, b, 3
...
You can groupby first then do a cumsum to get the participant column the way you want. Please make sure the time column is in datetime format and also sort it before you do this.
df['time'] = pd.to_datetime(df['time'])
df['time_diff']=df.groupby(['tablet'])['time'].diff().dt.seconds/60
df['participant'] = np.where((df['time_diff'].isnull()) | (df['time_diff']>10), 1,0).cumsum()
I've done something similar before, I used a combination of a group_by statement and using the Pandas shift function.
df = df.sort_values(["Tablet", "Time"])
df["Time_Period"] = df.groupby("Tablet")["Time"].shift(-1)-df["Time"]
df["Time_Period"] = df["Time_Period"].dt.total_seconds()
df["New_Participant"] = df["Time_Period"] > 10*60 #10 Minutes
df["Participant_ID"] = df["New_Participant"].cumsum()
Basically I flag every time there is a gap of over 10 minutes between sessions, then do a rolling sum to give each participant a unique ID
So basically I just need advice on how to calculate a 24 month rolling mean over each row of a dataframe. Every row indicates a particular city, and the columns are the respective sales for that month. If anyone could help me figure this out, it would be much appreciated
Edit: Clearly I failed to explain myself properly. I know that pandas has a rolling method built in. The problem is that I don't want to take the moving average of a singular column, I want to take it of columns in a row.
Sample Dataset
State - M1 - M2 - M3 - M4 - ..... - M48
UT - 40 - 20 - 30 - 60 -..... 60
CA - 30 - 60 - 20 - 40 -..... 70
So I want to find the rolling average for each states most recent 24 months (M24-M48 columns)
What I've tried:
Data['24_Month_Moving_Average'] = Data.rolling(window=24, win_type='triang', min_periods=1, axis=1).mean()
error: Wrong number of items passed 139, placement implies 1
edit 2, Sample Dataset:
Data = pd.Dataframe({'M1':[1, 2], 'M2':[3,5], 'M3':[5,6]}, index = ['UT', 'CA'])
# need code that will add column that is the rolling 24 month average for each state
Picture of Dataframe
You can use functions rolling() with mean() and specify the parameters you want window, min_periods as follow:
df.col1.rolling(n, win_type='triang', min_periods=1).mean()
Don't know what shoudl eb your expected outptu but listing a sample to show with the apply() for each row generate the rolling, make the state column the index for your dataframe, hope it helps:
import pandas as pd
df = pd.DataFrame({'B': [6, 1, 2, 20, 4],'C': [1, 1, 2, 30, 4],'D': [10, 1, 2, 5, 4]})
def test_roll(data):
return(data.rolling(window=2, win_type='triang', min_periods=1, axis=0).mean())
print(df.apply(test_roll, axis=1))
pandas.DataFrame.rolling
this is more of a guidance / point me in the right direction sort of question.
The Problemo!
I have a problem at work that I currently work out using a very very long excel formula.
I basically allocate a variable of hours (let's call this h) to 500 stores
I then declare the hour's allocation for a full-time colleague and part-time (ft and pt)
The formula I have at the moment works out based on the no. of hours how many FT can work there and after the FT allocation is exhausted (basically it cannot be divided/mod into the whole number of hours) it then works onto the number of pt colleagues.
in math terms I allocate 20 hours to store A
store A FT colleagues work 12 hours and the PT work 6
based on this store A can accomodate 1 FT col 1 PT and have 2 hours as a remainder.
I would like to do this in python and thought it would be a good first real-ish project to work on.
Solution thus far,
What I've tried is to start fleshing out a function that takes in the ft,pt and h as arguments and spits out the number of FT and PT the number of hours can accomodate. I would then love to append this into a pandas data frame. However, I've not been able to work this out for a while now.. and I have no idea what to search for on SO
def (full_time, part_time,hours):
for hours in full_time:
if hours < full_time or part_time:
return full_time
elif hours >= full_time
return full_time
elif hours >= full_time ....
What I've tried is to start fleshing out a function that takes in the ft,pt and h as arguments and spits out the number of FT and PT the number of hours can accommodate.
My understanding is that you have three input variables and three outputs. A given store with total_hours allocated has FT employees who can work ft_hours and PT employees who can each work pt_hours. You want to find the number of FT workers & PT workers to allocate, and the remainder assuming that no employees will work half-shifts.
def alloc_hours(
ft_hours: int,
pt_hours: int,
total_hours: int
) -> tuple:
"""Calculate hour-allocation for given store.
ft_hours: The number of hours a full-time emp. works.
pt_hours: The number of hours a part-time emp. works.
total_hours: The total hours allocated to the store.
Returns: tuple
1st element: num. of full-time workers.
2nd element: num. of part-time workers.
3rd element: remainder hours.
"""
ft_workers, remainder = divmod(total_hours, ft_hours)
pt_workers, remainder = divmod(remainder, pt_hours)
return ft_workers, pt_workers, remainder
Examples:
>>> alloc_hours(12, 6, 20)
(1, 1, 2)
>>> alloc_hours(8, 6, 20)
(2, 0, 4)
>>> alloc_hours(8, 6, 24)
(3, 0, 0)
In Pandas:
import pandas as pd
data = {
'ft_hours': [12, 8, 10, 8, 12, 10, 8, 8],
'pt_hours': [6, 4, 6, 6, 6, 4, 4, 6],
'total_hours': [20, 20, 24, 40, 30, 20, 10, 40]
}
data = pd.DataFrame(data)
# Pandas supports vectorization, so each of these results is a Series.
ft_workers, remainder = divmod(data['total_hours'], data['ft_hours'])
pt_workers, remainder = divmod(remainder, data['pt_hours'])
data = data.assign(
ft_workers=ft_workers,
pt_workers=pt_workers,
remainder=remainder
)
Result:
>>> data
ft_hours pt_hours total_hours ft_workers pt_workers remainder
0 12 6 20 1 1 2
1 8 4 20 2 1 0
2 10 6 24 2 0 4
3 8 6 40 5 0 0
4 12 6 30 2 1 0
5 10 4 20 2 0 0
6 8 4 10 1 0 2
7 8 6 40 5 0 0
This answer is based on the assumption that you have an existing Dataframe that provides the three inputs. You could create a new column/field using the the pandas apply function. Apply takes your inputs, applies your function, then returns the results in the new field.