Add new column with time only in Dataframe Pandas - python

So i have this Dataframe that has been scraped from a website.What i want to achieve is to add a new column with only time in it as HH:mm each time it scrapes new data.
But the code i use gives data and time
import pandas as pd
data = {'Name':['Tom', 'nick', 'krish', 'jack','Rick', 'John'],
'Age':[20, 21, 19, 18, 26, 23]}
df = pd.DataFrame(data)
df['time'] = pd.date_range('11/5/2020', periods = 6, freq ='2H')
df
Data i have:
Name Age time
0 Tom 20 2020-11-05 01:00:00
1 nick 21 2020-11-05 01:00:00
2 krish 19 2020-11-05 01:00:00
3 jack 18 2020-11-05 01:00:00
4 Rick 26 2020-11-05 01:00:00
5 John 23 2020-11-05 01:00:00
Result I want
Name Age time
0 Tom 20 01:00
1 nick 21 01:00
2 krish 19 01:00
3 jack 18 01:00
4 Rick 26 01:00
5 John 23 01:00
As the data is scraped every 15 minutes I want to append the new data to it with the time without the need to add header again like below:
Name Age time
0 Tom 20 01:00
1 nick 21 01:00
2 krish 19 01:00
3 jack 18 01:00
4 Rick 26 01:00
5 John 23 01:00
0 Tom 20 01:15
1 nick 21 01:15
2 krish 19 01:15
3 jack 18 01:15
4 Rick 26 01:15
5 John 23 01:15

If I understand correctly here is what you want.
lst_date_range = pd.date_range('11/5/2020', periods = 6, freq ='2H')
df['time'] = [date_time.strftime("%H:%M") for date_time in lst_date_range]
By using .strftime("%H:%M") could allow you to select only hour and minute.
From
Name Age time
0 Tom 20 2020-11-05 01:00:00
To
Name Age time
0 Tom 20 01:00
But I do not understand what you are asking for appending without the need to add header, please explain more if you need help here.

Related

Grouping and sum the value for every 5min / resampling the data for 5min with string values

I want to sum the value of each gender for every 5 min timestamp.
Main Table:-
Time Gender value
10:01 Male 5
10:02 Female 1
10:03 Male 5
10:04 Male 5
10:05 Female 1
10:06 Female 1
10:07 Male 5
10:08 Male 5
10:09 Male 5
10:10 Male 5
Required Result:-
Time Gender value
10:00 Male 15
10:00 Female 2
10:05 Male 20
10:05 Female 1
You could convert to TimeDelta, floor the result, and use it to groupby+agg:
t = pd.to_timedelta(df['Time']+':00')
(df
.groupby([t.dt.floor('5min'), 'Gender'])
.agg({'value': 'sum'})
.reset_index()
)
output:
Time Gender value
0 0 days 10:00:00 Female 1
1 0 days 10:00:00 Male 15
2 0 days 10:05:00 Female 2
3 0 days 10:05:00 Male 15
4 0 days 10:10:00 Male 5
matching the provided output
To match your provided output, it needs a few more things.
subtracting one minute to floor '00:05:00' on '00:00:00'
converting back to string
t = pd.to_timedelta(df['Time']+':00').sub(pd.to_timedelta('1min'))
(df
.groupby([t.dt.floor('5min'), 'Gender'])
.agg({'value': 'sum'})
.reset_index()
.assign(Time=lambda d: (pd.to_datetime(0)+d['Time']).dt.strftime('%H:%M'))
)
output:
Time Gender value
0 10:00 Female 2
1 10:00 Male 15
2 10:05 Female 1
3 10:05 Male 20
variant
t = pd.to_timedelta(df['Time']+':00').sub(pd.to_timedelta('1min'))
(df.assign(Time=t.dt.floor('5min').astype(str).str[-8:-3])
.groupby(['Time', 'Gender'])
['value'].sum().reset_index()
)

Change index in a pandas dataframe and add additional time column

I have a pandas dataframe that currently has no specifiy index (thus when printing an automatic index is created which beginns with 0). Now I would like to have a "timeslot" index that beginns with 1 and an additional "time of the day" column in the dataframe. Here you can see a screenshot of how theoutput csv should look like. Can you tell me how to do this?
Try with pd.date_range:
df['time of day'] = pd.date_range('1970-1-1', periods=len(df), freq='H') \
.strftime('%H:%M')
Setup:
df = pd.DataFrame(np.random.randint(1, 50, (30, 2)), columns=['Column 1', 'Column 2'])
df.insert(0, 'time of day', pd.date_range('1970-1-1', periods=len(df), freq='H').strftime('%H:%M'))
df.index.name = 'timeslot'
df.index += 1
print(df)
# Output:
time of day Column 1 Column 2
timeslot
1 00:00 43 33
2 01:00 20 11
3 02:00 40 10
4 03:00 19 28
5 04:00 10 27
6 05:00 27 10
7 06:00 1 10
8 07:00 33 36
9 08:00 32 2
10 09:00 23 32
11 10:00 1 17
12 11:00 48 42
13 12:00 21 3
14 13:00 48 28
15 14:00 41 46
16 15:00 48 43
17 16:00 47 6
18 17:00 33 21
19 18:00 38 19
20 19:00 17 40
21 20:00 8 24
22 21:00 28 22
23 22:00 2 13
24 23:00 24 3
25 00:00 4 1
26 01:00 8 9
27 02:00 19 36
28 03:00 30 36
29 04:00 43 39
30 05:00 43 3
Assuming your dataframe is df:
df['time of day'] = df.index.astype(str).str.rjust(2, '0')+':00'
df.index += 1
output: No output as no text input was provided
if there are more than 24 rows:
df['time of day'] = (df.index%24).astype(str).str.rjust(2, '0')+':00'
df.index += 1

How to sum amount for all rows if the current rows close date falls between the other rows close and open date columns for each sales rep

I have a dataframe with "close_date", "open_date", "amount", "sales_rep".
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
Jim
1/01/2021
2/05/2021
3
Jim
1/15/2021
4/06/2021
26
Jim
2/01/2021
2/06/2021
7
Jim
2/15/2021
3/14/2021
12
Jim
3/01/2021
4/22/2021
13
Jim
3/15/2021
3/29/2021
5
Jim
4/01/2021
4/20/2021
17
Bob
1/01/2021
1/12/2021
23
Bob
1/15/2021
2/16/2021
12
Bob
2/01/2021
3/04/2021
4
Bob
2/15/2021
4/05/2021
23
Bob
3/01/2021
3/24/2021
12
Bob
3/15/2021
4/15/2021
7
Bob
4/01/2021
5/01/2021
20
I want to create a column that tells me the open amount. So if we take the second row we can see that the opp was closed on 04/06/2021. I want to know how many open opps there were before that date. So I would look to see if the open date for row 5 was before the close date of 4/06/2021 and that the close date for row 5 is also after 04/06/2021. In this case it is so I would add that to the sum. I also want to current row value to be included in the sum. This should be done for each sales rep in the dataframe. I have filled in the table with the expected values below.
sales_rep
open_date(MM/DD/YYYY)
close_date
amount
open_amount_sum
Jim
1/01/2021
2/05/2021
3
36 (I got this by adding 3, 26, and 7 because those are the only two values that fit the condition and the 3 because it is the value for that row.)
Jim
1/15/2021
4/06/2021
26
56
Jim
2/01/2021
2/06/2021
7
33
Jim
2/15/2021
3/14/2021
12
51
Jim
3/01/2021
4/22/2021
13
13
Jim
3/15/2021
3/29/2021
5
44
Jim
4/01/2021
4/20/2021
17
30
Bob
1/01/2021
1/12/2021
23
23
Bob
1/15/2021
2/16/2021
12
39
Bob
2/01/2021
3/04/2021
4
39
Bob
2/15/2021
4/05/2021
23
50
Bob
3/01/2021
3/24/2021
12
42
Bob
3/15/2021
4/15/2021
7
27
Bob
4/01/2021
5/01/2021
20
20
Edit #RJ's solution from the comments is better. here it is formatted slightly differently
df['open_amount_sum'] = df.apply(
lambda x: df[
df['sales_rep'].eq(x['sales_rep']) &
df['open_date'].le(x['close_date']) &
df['close_date'].ge(x['close_date'])
]['amount'].sum(),
axis=1,
)
Here is a solution, but it is slow and kind of ugly. can definitely be improved
import pandas as pd
import io
df = pd.read_csv(io.StringIO(
"""
sales_rep,open_date,close_date,amount
Jim,1/01/2021,2/05/2021,3
Jim,1/15/2021,4/06/2021,26
Jim,2/01/2021,2/06/2021,7
Jim,2/15/2021,3/14/2021,12
Jim,3/01/2021,4/22/2021,13
Jim,3/15/2021,3/29/2021,5
Jim,4/01/2021,4/20/2021,17
Bob,1/01/2021,1/12/2021,23
Bob,1/15/2021,2/16/2021,12
Bob,2/01/2021,3/04/2021,4
Bob,2/15/2021,4/05/2021,23
Bob,3/01/2021,3/24/2021,12
Bob,3/15/2021,4/15/2021,7
Bob,4/01/2021,5/01/2021,20
"""
))
sum_df = df.groupby('sales_rep').apply(
lambda g:
g['close_date'].apply(
lambda close:
g.loc[
g['open_date'].le(close) & g['close_date'].ge(close),
'amount'
].sum())
).reset_index(level=0)
df['close_sum'] = sum_df['close_date']
df
Merge the dataframe unto itself, then filter, before grouping:
(df
.merge(df, on='sales_rep')
.query('open_date_y <= close_date_x<=close_date_y')
.loc(axis=1)['sales_rep', 'open_date_x', 'close_date_x', 'amount_x', 'amount_y']
.rename(columns=lambda col: col.removesuffix('_x'))
.rename(columns = {'amount_y' : 'open_sum_amount'})
.groupby(['sales_rep', 'open_date', 'close_date', 'amount'],
sort = False,
as_index = False)
.sum()
)
sales_rep open_date close_date amount open_sum_amount
0 Jim 2021-01-01 2021-02-05 3 36
1 Jim 2021-01-15 2021-04-06 26 56
2 Jim 2021-02-01 2021-02-06 7 33
3 Jim 2021-02-15 2021-03-14 12 51
4 Jim 2021-03-01 2021-04-22 13 13
5 Jim 2021-03-15 2021-03-29 5 44
6 Jim 2021-04-01 2021-04-20 17 30
7 Bob 2021-01-01 2021-01-12 23 23
8 Bob 2021-01-15 2021-02-16 12 39
9 Bob 2021-02-01 2021-03-04 4 39
10 Bob 2021-02-15 2021-04-05 23 50
11 Bob 2021-03-01 2021-03-24 12 42
12 Bob 2021-03-15 2021-04-15 7 27
13 Bob 2021-04-01 2021-05-01 20 20

Compare averages of a values corresponding to 2 different dates?

I have a table like this:
Date Student Average(for that date)
17 Jan 2020 Alex 40
18 Jan 2020 Alex 50
19 Jan 2020 Alex 80
20 Jan 2020 Alex 70
17 Jan 2020 Jeff 10
18 Jan 2020 Jeff 50
19 Jan 2020 Jeff 80
20 Jan 2020 Jeff 60
I want to add a column for high and low. The logic for that column should be that it is high as long as the average score for a student for today`s date is greater than the value < 90% of previous days score.
Like my comparison would look something like this:
avg(score)(for current date) < ( avg(score)(for previous day) - (90% * avg(score)(for previous day) /100)
I can`t figure how to incorporate the date part in my formula.That it compares averages from current day to the average of the previous date.
I am working with Pandas so i was wondering if there is a way in it to incorporate this.
IIUC,
df['Previous Day'] = df.sort_values('Date').groupby('Student')['Average'].shift()*.90
df['Indicator'] = np.where(df['Average']>df['Previous Day'],'High','Low')
df
Output:
Date Student Average Previous Day Indicator
0 2020-01-17 Alex 40 NaN Low
1 2020-01-18 Alex 50 36.0 High
2 2020-01-19 Alex 80 45.0 High
3 2020-01-20 Alex 70 72.0 Low
4 2020-01-17 Jeff 10 NaN Low
5 2020-01-18 Jeff 50 9.0 High
6 2020-01-19 Jeff 80 45.0 High
7 2020-01-20 Jeff 60 72.0 Low

How do i find the first duplicate value in a data frame based on time stamp in Python 3.x?

I am new to Python 3.6 and I have been trying to solve an assignment without any success using Pandas.
My dataframe looks like this:
Index ID Time Account Key City County
0 10 2016-01-01 12:30 11 55 a NZ
1 2 2016-01-02 13:30 14 34 b AL
2 33 2016-01-03 11:20 4 55 a NZ
3 4 2016-01-01 14:30 11 40 b AL
4 18 2016-01-20 23:30 14 34 b AL
..
100 41 2016-03-20 13:50 11 55 a NZ
I want to identify that Account 11 and 14 are reoccurring and to count them in different buckets in a new column (Ie: occurring with changes in Key and occurring without changes in Key) but I want 11 to be counted once.
I want to calculate the time difference in hours between the first and second occurrence of Account 11 but to ignore all other occurrences of 11. The results should be placed in a new data frame with columns 'Account' and 'Time_diff'
Any ideas on how to proceed? I am using Spyder if that makes any difference =)
So for Q1 it would look like:
Index ID Time Account Key City County ChangeKey
0 10 2016-01-01 12:30 11 55 a NZ 0
1 2 2016-01-02 13:30 14 34 b AL 0
2 33 2016-01-03 11:20 4 55 a NZ 0
3 4 2016-01-01 14:30 11 40 b AL 1
4 18 2016-01-20 23:30 14 34 b AL 0
The key changes for account 11 but not account 14.
For Q2 the final result would look like
Index Time Account Timediff
0 2016-01-01 12:30 11 0
1 2016-01-02 13:30 14 0
2 2016-01-03 11:20 4 NA
3 2016-01-01 14:30 11 2
4 2016-01-20 23:30 14 320

Categories

Resources