I want to sum the value of each gender for every 5 min timestamp.
Main Table:-
Time Gender value
10:01 Male 5
10:02 Female 1
10:03 Male 5
10:04 Male 5
10:05 Female 1
10:06 Female 1
10:07 Male 5
10:08 Male 5
10:09 Male 5
10:10 Male 5
Required Result:-
Time Gender value
10:00 Male 15
10:00 Female 2
10:05 Male 20
10:05 Female 1
You could convert to TimeDelta, floor the result, and use it to groupby+agg:
t = pd.to_timedelta(df['Time']+':00')
(df
.groupby([t.dt.floor('5min'), 'Gender'])
.agg({'value': 'sum'})
.reset_index()
)
output:
Time Gender value
0 0 days 10:00:00 Female 1
1 0 days 10:00:00 Male 15
2 0 days 10:05:00 Female 2
3 0 days 10:05:00 Male 15
4 0 days 10:10:00 Male 5
matching the provided output
To match your provided output, it needs a few more things.
subtracting one minute to floor '00:05:00' on '00:00:00'
converting back to string
t = pd.to_timedelta(df['Time']+':00').sub(pd.to_timedelta('1min'))
(df
.groupby([t.dt.floor('5min'), 'Gender'])
.agg({'value': 'sum'})
.reset_index()
.assign(Time=lambda d: (pd.to_datetime(0)+d['Time']).dt.strftime('%H:%M'))
)
output:
Time Gender value
0 10:00 Female 2
1 10:00 Male 15
2 10:05 Female 1
3 10:05 Male 20
variant
t = pd.to_timedelta(df['Time']+':00').sub(pd.to_timedelta('1min'))
(df.assign(Time=t.dt.floor('5min').astype(str).str[-8:-3])
.groupby(['Time', 'Gender'])
['value'].sum().reset_index()
)
I have a pandas dataframe that currently has no specifiy index (thus when printing an automatic index is created which beginns with 0). Now I would like to have a "timeslot" index that beginns with 1 and an additional "time of the day" column in the dataframe. Here you can see a screenshot of how theoutput csv should look like. Can you tell me how to do this?
Try with pd.date_range:
df['time of day'] = pd.date_range('1970-1-1', periods=len(df), freq='H') \
.strftime('%H:%M')
Setup:
df = pd.DataFrame(np.random.randint(1, 50, (30, 2)), columns=['Column 1', 'Column 2'])
df.insert(0, 'time of day', pd.date_range('1970-1-1', periods=len(df), freq='H').strftime('%H:%M'))
df.index.name = 'timeslot'
df.index += 1
print(df)
# Output:
time of day Column 1 Column 2
timeslot
1 00:00 43 33
2 01:00 20 11
3 02:00 40 10
4 03:00 19 28
5 04:00 10 27
6 05:00 27 10
7 06:00 1 10
8 07:00 33 36
9 08:00 32 2
10 09:00 23 32
11 10:00 1 17
12 11:00 48 42
13 12:00 21 3
14 13:00 48 28
15 14:00 41 46
16 15:00 48 43
17 16:00 47 6
18 17:00 33 21
19 18:00 38 19
20 19:00 17 40
21 20:00 8 24
22 21:00 28 22
23 22:00 2 13
24 23:00 24 3
25 00:00 4 1
26 01:00 8 9
27 02:00 19 36
28 03:00 30 36
29 04:00 43 39
30 05:00 43 3
Assuming your dataframe is df:
df['time of day'] = df.index.astype(str).str.rjust(2, '0')+':00'
df.index += 1
output: No output as no text input was provided
if there are more than 24 rows:
df['time of day'] = (df.index%24).astype(str).str.rjust(2, '0')+':00'
df.index += 1
I have a dataframe with "close_date", "open_date", "amount", "sales_rep".
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
Jim
1/01/2021
2/05/2021
3
Jim
1/15/2021
4/06/2021
26
Jim
2/01/2021
2/06/2021
7
Jim
2/15/2021
3/14/2021
12
Jim
3/01/2021
4/22/2021
13
Jim
3/15/2021
3/29/2021
5
Jim
4/01/2021
4/20/2021
17
Bob
1/01/2021
1/12/2021
23
Bob
1/15/2021
2/16/2021
12
Bob
2/01/2021
3/04/2021
4
Bob
2/15/2021
4/05/2021
23
Bob
3/01/2021
3/24/2021
12
Bob
3/15/2021
4/15/2021
7
Bob
4/01/2021
5/01/2021
20
I want to create a column that tells me the open amount. So if we take the second row we can see that the opp was closed on 04/06/2021. I want to know how many open opps there were before that date. So I would look to see if the open date for row 5 was before the close date of 4/06/2021 and that the close date for row 5 is also after 04/06/2021. In this case it is so I would add that to the sum. I also want to current row value to be included in the sum. This should be done for each sales rep in the dataframe. I have filled in the table with the expected values below.
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
open_amount_sum
Jim
1/01/2021
2/05/2021
3
36 (I got this by adding 3, 26, and 7 because those are the only two values that fit the condition and the 3 because it is the value for that row.)
Jim
1/15/2021
4/06/2021
26
56
Jim
2/01/2021
2/06/2021
7
33
Jim
2/15/2021
3/14/2021
12
51
Jim
3/01/2021
4/22/2021
13
13
Jim
3/15/2021
3/29/2021
5
44
Jim
4/01/2021
4/20/2021
17
30
Bob
1/01/2021
1/12/2021
23
23
Bob
1/15/2021
2/16/2021
12
39
Bob
2/01/2021
3/04/2021
4
39
Bob
2/15/2021
4/05/2021
23
50
Bob
3/01/2021
3/24/2021
12
42
Bob
3/15/2021
4/15/2021
7
27
Bob
4/01/2021
5/01/2021
20
20
Edit #RJ's solution from the comments is better. here it is formatted slightly differently
df['open_amount_sum'] = df.apply(
lambda x: df[
df['sales_rep'].eq(x['sales_rep']) &
df['open_date'].le(x['close_date']) &
df['close_date'].ge(x['close_date'])
]['amount'].sum(),
axis=1,
)
Here is a solution, but it is slow and kind of ugly. can definitely be improved
import pandas as pd
import io
df = pd.read_csv(io.StringIO(
"""
sales_rep,open_date,close_date,amount
Jim,1/01/2021,2/05/2021,3
Jim,1/15/2021,4/06/2021,26
Jim,2/01/2021,2/06/2021,7
Jim,2/15/2021,3/14/2021,12
Jim,3/01/2021,4/22/2021,13
Jim,3/15/2021,3/29/2021,5
Jim,4/01/2021,4/20/2021,17
Bob,1/01/2021,1/12/2021,23
Bob,1/15/2021,2/16/2021,12
Bob,2/01/2021,3/04/2021,4
Bob,2/15/2021,4/05/2021,23
Bob,3/01/2021,3/24/2021,12
Bob,3/15/2021,4/15/2021,7
Bob,4/01/2021,5/01/2021,20
"""
))
sum_df = df.groupby('sales_rep').apply(
lambda g:
g['close_date'].apply(
lambda close:
g.loc[
g['open_date'].le(close) & g['close_date'].ge(close),
'amount'
].sum())
).reset_index(level=0)
df['close_sum'] = sum_df['close_date']
df
Merge the dataframe unto itself, then filter, before grouping:
(df
.merge(df, on='sales_rep')
.query('open_date_y <= close_date_x<=close_date_y')
.loc(axis=1)['sales_rep', 'open_date_x', 'close_date_x', 'amount_x', 'amount_y']
.rename(columns=lambda col: col.removesuffix('_x'))
.rename(columns = {'amount_y' : 'open_sum_amount'})
.groupby(['sales_rep', 'open_date', 'close_date', 'amount'],
sort = False,
as_index = False)
.sum()
)
sales_rep open_date close_date amount open_sum_amount
0 Jim 2021-01-01 2021-02-05 3 36
1 Jim 2021-01-15 2021-04-06 26 56
2 Jim 2021-02-01 2021-02-06 7 33
3 Jim 2021-02-15 2021-03-14 12 51
4 Jim 2021-03-01 2021-04-22 13 13
5 Jim 2021-03-15 2021-03-29 5 44
6 Jim 2021-04-01 2021-04-20 17 30
7 Bob 2021-01-01 2021-01-12 23 23
8 Bob 2021-01-15 2021-02-16 12 39
9 Bob 2021-02-01 2021-03-04 4 39
10 Bob 2021-02-15 2021-04-05 23 50
11 Bob 2021-03-01 2021-03-24 12 42
12 Bob 2021-03-15 2021-04-15 7 27
13 Bob 2021-04-01 2021-05-01 20 20
I have a table like this:
Date Student Average(for that date)
17 Jan 2020 Alex 40
18 Jan 2020 Alex 50
19 Jan 2020 Alex 80
20 Jan 2020 Alex 70
17 Jan 2020 Jeff 10
18 Jan 2020 Jeff 50
19 Jan 2020 Jeff 80
20 Jan 2020 Jeff 60
I want to add a column for high and low. The logic for that column should be that it is high as long as the average score for a student for today`s date is greater than the value < 90% of previous days score.
Like my comparison would look something like this:
avg(score)(for current date) < ( avg(score)(for previous day) - (90% * avg(score)(for previous day) /100)
I can`t figure how to incorporate the date part in my formula.That it compares averages from current day to the average of the previous date.
I am working with Pandas so i was wondering if there is a way in it to incorporate this.
IIUC,
df['Previous Day'] = df.sort_values('Date').groupby('Student')['Average'].shift()*.90
df['Indicator'] = np.where(df['Average']>df['Previous Day'],'High','Low')
df
Output:
Date Student Average Previous Day Indicator
0 2020-01-17 Alex 40 NaN Low
1 2020-01-18 Alex 50 36.0 High
2 2020-01-19 Alex 80 45.0 High
3 2020-01-20 Alex 70 72.0 Low
4 2020-01-17 Jeff 10 NaN Low
5 2020-01-18 Jeff 50 9.0 High
6 2020-01-19 Jeff 80 45.0 High
7 2020-01-20 Jeff 60 72.0 Low
I am new to Python 3.6 and I have been trying to solve an assignment without any success using Pandas.
My dataframe looks like this:
Index ID Time Account Key City County
0 10 2016-01-01 12:30 11 55 a NZ
1 2 2016-01-02 13:30 14 34 b AL
2 33 2016-01-03 11:20 4 55 a NZ
3 4 2016-01-01 14:30 11 40 b AL
4 18 2016-01-20 23:30 14 34 b AL
..
100 41 2016-03-20 13:50 11 55 a NZ
I want to identify that Account 11 and 14 are reoccurring and to count them in different buckets in a new column (Ie: occurring with changes in Key and occurring without changes in Key) but I want 11 to be counted once.
I want to calculate the time difference in hours between the first and second occurrence of Account 11 but to ignore all other occurrences of 11. The results should be placed in a new data frame with columns 'Account' and 'Time_diff'
Any ideas on how to proceed? I am using Spyder if that makes any difference =)
So for Q1 it would look like:
Index ID Time Account Key City County ChangeKey
0 10 2016-01-01 12:30 11 55 a NZ 0
1 2 2016-01-02 13:30 14 34 b AL 0
2 33 2016-01-03 11:20 4 55 a NZ 0
3 4 2016-01-01 14:30 11 40 b AL 1
4 18 2016-01-20 23:30 14 34 b AL 0
The key changes for account 11 but not account 14.
For Q2 the final result would look like
Index Time Account Timediff
0 2016-01-01 12:30 11 0
1 2016-01-02 13:30 14 0
2 2016-01-03 11:20 4 NA
3 2016-01-01 14:30 11 2
4 2016-01-20 23:30 14 320