how to add new column based on the above row's value - python

I have one dataframe as below. At first,they have three columns('date','time','flag'). I want to add one column which based on the flag and date which means when I get flag=1 ,then the rest of this day the target is 1, otherwise the target is zero.
date time flag target
0 2017/4/10 10:00:00 0 0
1 2017/4/10 11:00:00 1 1
2 2017/4/10 12:00:00 0 1
3 2017/4/10 13:00:00 0 1
4 2017/4/10 14:00:00 0 1
5 2017/4/11 10:00:00 1 1
6 2017/4/11 11:00:00 0 1
7 2017/4/11 12:00:00 1 1
8 2017/4/11 13:00:00 1 1
9 2017/4/11 14:00:00 0 1
10 2017/4/12 10:00:00 0 0
11 2017/4/12 11:00:00 0 0
12 2017/4/12 12:00:00 0 0
13 2017/4/12 13:00:00 0 0
14 2017/4/12 14:00:00 0 0
15 2017/4/13 10:00:00 0 0
16 2017/4/13 11:00:00 1 1
17 2017/4/13 12:00:00 0 1
18 2017/4/13 13:00:00 1 1
19 2017/4/13 14:00:00 0 1

Use DataFrameGroupBy.cumsum for cumulative sum flag values, compare with 0 and last cast mask to integer:
df['new'] = (df.groupby('date')['flag'].cumsum() > 0).astype(int)
print (df)
date time flag target new
0 2017/4/10 10:00:00 0 0 0
1 2017/4/10 11:00:00 1 1 1
2 2017/4/10 12:00:00 0 1 1
3 2017/4/10 13:00:00 0 1 1
4 2017/4/10 14:00:00 0 1 1
5 2017/4/11 10:00:00 1 1 1
6 2017/4/11 11:00:00 0 1 1
7 2017/4/11 12:00:00 1 1 1
8 2017/4/11 13:00:00 1 1 1
9 2017/4/11 14:00:00 0 1 1
10 2017/4/12 10:00:00 0 0 0
11 2017/4/12 11:00:00 0 0 0
12 2017/4/12 12:00:00 0 0 0
13 2017/4/12 13:00:00 0 0 0
14 2017/4/12 14:00:00 0 0 0
15 2017/4/13 10:00:00 0 0 0
16 2017/4/13 11:00:00 1 1 1
17 2017/4/13 12:00:00 0 1 1
18 2017/4/13 13:00:00 1 1 1
19 2017/4/13 14:00:00 0 1 1

Okay, I know that we've already found a solution here but just to satisfy the nerd in me, here's an answer (not elegant given how long it is) to avoid that nagging first-row flaw
pd.merge(df, (df.groupby('date')['flag'].any().astype(int)).to_frame().T.transpose().reset_index(), left_on='date', right_on='date')
Approach remains the same as #jezrael - the groupby function is key here. Instead of using the cumsum, which leads to the first-row flaw, any() appears to fit really well into this solution. The only drawback is that it produces a series, which we then need to coerce back into a dataframe and transpose before joining them together by the date key.

Related

Count time in an 30 minutes interval in pandas

I got the following dataframe with two groups:
start_time
end_time
ID
10/10/2021 13:38
10/10/2021 14:30
A
31/10/2021 14:00
31/10/2021 15:00
A
21/10/2021 14:47
21/10/2021 15:30
B
23/10/2021 14:00
23/10/2021 15:30
B
I will ignore the date but only preserve the time for counting.
And I would like to create an 30 minutes interval as rows for each group first and then count, which should be similar to this:
start_interval
end_interval
count
ID
13:00
13:30
0
A
13:30
14:00
1
A
14:00
14:30
2
A
14:30
15:00
1
A
13:00
13:30
0
B
13:30
14:00
0
B
14:00
14:30
1
B
14:30
15:00
2
B
15:00
15:30
2
B
Use:
#normalize all datetimes for 30 minutes
f = lambda x: pd.to_datetime(x).dt.floor('30Min')
df[["start_time", "end_time"]] = df[["start_time", "end_time"]].apply(f)
#get difference of 30 minutes
df['diff'] = df['end_time'].sub(df['start_time']).dt.total_seconds().div(1800).astype(int)
df['start_time'] = df['start_time'].sub(df['start_time'].dt.floor('d'))
#repeat by 30 minutes
df = df.loc[df.index.repeat(df['diff'])]
df['start_time'] += pd.to_timedelta(df.groupby(level=0).cumcount().mul(30), unit='Min')
print (df)
start_time end_time ID diff
0 0 days 13:30:00 2021-10-10 14:30:00 A 2
0 0 days 14:00:00 2021-10-10 14:30:00 A 2
1 0 days 14:00:00 2021-10-31 15:00:00 A 2
1 0 days 14:30:00 2021-10-31 15:00:00 A 2
2 0 days 14:30:00 2021-10-21 15:30:00 B 2
2 0 days 15:00:00 2021-10-21 15:30:00 B 2
3 0 days 14:00:00 2021-10-23 15:30:00 B 3
3 0 days 14:30:00 2021-10-23 15:30:00 B 3
3 0 days 15:00:00 2021-10-23 15:30:00 B 3
#add starting dates - here 12:00
df1 = pd.DataFrame({'ID':df['ID'].unique(), 'start_time': pd.Timedelta(12, unit='H')})
print (df1)
ID start_time
0 A 0 days 12:00:00
1 B 0 days 12:00:00
df = pd.concat([df, df1])
#count per 30 minutes
df = df.set_index('start_time').groupby('ID').resample('30Min')['end_time'].count().reset_index(name='count')
#add end column
df['end_interval'] = df['start_time'] + pd.Timedelta(30, unit='Min')
df = df.rename(columns={'start_time':'start_interval'})[['start_interval','end_interval','count','ID']]
print (df)
start_interval end_interval count ID
0 0 days 12:00:00 0 days 12:30:00 0 A
1 0 days 12:30:00 0 days 13:00:00 0 A
2 0 days 13:00:00 0 days 13:30:00 0 A
3 0 days 13:30:00 0 days 14:00:00 1 A
4 0 days 14:00:00 0 days 14:30:00 2 A
5 0 days 14:30:00 0 days 15:00:00 1 A
6 0 days 12:00:00 0 days 12:30:00 0 B
7 0 days 12:30:00 0 days 13:00:00 0 B
8 0 days 13:00:00 0 days 13:30:00 0 B
9 0 days 13:30:00 0 days 14:00:00 0 B
10 0 days 14:00:00 0 days 14:30:00 1 B
11 0 days 14:30:00 0 days 15:00:00 2 B
12 0 days 15:00:00 0 days 15:30:00 2 B
EDIT:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{:02d}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
df[['start_interval','end_interval']] = df[['start_interval','end_interval']].applymap(f)
print (df)
start_interval end_interval count ID
0 12:00:00 12:30:00 0 A
1 12:30:00 13:00:00 0 A
2 13:00:00 13:30:00 0 A
3 13:30:00 14:00:00 1 A
4 14:00:00 14:30:00 2 A
5 14:30:00 15:00:00 1 A
6 12:00:00 12:30:00 0 B
7 12:30:00 13:00:00 0 B
8 13:00:00 13:30:00 0 B
9 13:30:00 14:00:00 0 B
10 14:00:00 14:30:00 1 B
11 14:30:00 15:00:00 2 B
12 15:00:00 15:30:00 2 B
The input dataframe has start and end times. The resultant dataframe is a series of timestamps with 30min interval between them.
Here it is
# Import libs
import pandas as pd
from datetime import timedelta
# Sample Dataframe
df = pd.DataFrame(
[
["10/10/2021 13:40", "10/10/2021 14:30", "A"],
["31/10/2021 14:00", "31/10/2021 15:00", "A"],
["21/10/2021 14:40", "21/10/2021 15:30", "B"],
["23/10/2021 14:00", "23/10/2021 15:30", "B"],
],
columns=["start_time", "end_time", "ID"],
)
# convert to timedelta
df[["start_time", "end_time"]] = df[["start_time", "end_time"]].apply(
lambda x: pd.to_datetime(x) - pd.to_datetime(x).dt.normalize()
)
# Extract seconds elapsed
df[["start_secs", "end_secs"]] = df[["start_time", "end_time"]].applymap(
lambda x: x.seconds
)
# OUTPUT
# start_time end_time ID start_secs end_secs
# 0 0 days 13:40:00 0 days 14:30:00 A 49200 52200
# 1 0 days 14:00:00 0 days 15:00:00 A 50400 54000
# 2 0 days 14:40:00 0 days 15:30:00 B 52800 55800
# 3 0 days 14:00:00 0 days 15:30:00 B 50400 55800
# Get rounded Min and Max time in secs of the dataframe
min_t, max_t = (df["start_secs"].min() // 3600) * 3600, (
df["end_secs"].max() // 3600
) * 3600 + 3600
# Create Interval dataframe with 30min bins
interval_df = pd.DataFrame(
map(lambda x: [x, x + 30 * 60], range(min_t, max_t, 30 * 60)),
columns=["start_interval", "end_interval"],
)
# OUTPUT
# start_interval end_interval
# 0 46800 48600
# 1 48600 50400
# 2 50400 52200
# 3 52200 54000
# 4 54000 55800
# 5 55800 57600
# It finds if the bin interval overlaps with the actual timeline and then count overlapping timelines of a single ID.
interval_df[["A", "B"]] = (
df.groupby(["ID"])
.apply(
lambda x: x.apply(
lambda y: ~(
((interval_df["end_interval"] - y["start_secs"]) <= 0)
| ((interval_df["start_interval"] - y["end_secs"]) >= 0)
),
axis=1,
).sum(axis=0)
)
.T
)
# OUTPUT
# start_interval end_interval A B
# 0 46800 48600 0 0
# 1 48600 50400 1 0
# 2 50400 52200 2 1
# 3 52200 54000 1 2
# 4 54000 55800 0 2
# 5 55800 57600 0 0
# Convert seconds to time
interval_df[["start_interval", "end_interval"]] = interval_df[
["start_interval", "end_interval"]
].applymap(lambda x: str(timedelta(seconds=x)))
# Stack counts of A and B into one single column
interval_df.melt(["start_interval", "end_interval"])
# OUTPUT
# start_interval end_interval variable value
# 0 13:00:00 13:30:00 A 0
# 1 13:30:00 14:00:00 A 1
# 2 14:00:00 14:30:00 A 2
# 3 14:30:00 15:00:00 A 1
# 4 15:00:00 15:30:00 A 0
# 5 15:30:00 16:00:00 A 0
# 6 13:00:00 13:30:00 B 0
# 7 13:30:00 14:00:00 B 0
# 8 14:00:00 14:30:00 B 1
# 9 14:30:00 15:00:00 B 2
# 10 15:00:00 15:30:00 B 2
# 11 15:30:00 16:00:00 B 0

How to get time difference in specifc rows include in one column data using python

Here I have a dataset with time and three inputs. Here I calculate the time difference using panda.
code is :
data['Time_different'] = pd.to_timedelta(data['time'].astype(str)).diff(-1).dt.total_seconds().div(60)
This is reading the difference of time in each row. But I want to write a code for find the time difference only specific rows which are having X3 values.
I tried to write the code using for loop. But it's not working properly. Without using for loop can we write the code.?
As you can see in my image I have three inputs, X1,X2,X3. Here when I used that code it is showing the time difference of X1,X2,X3.
Here what I want to write is getting the time difference for X3 inputs which are having a values.
time X3
6:00:00 0
7:00:00 2
8:00:00 0
9:00:00 50
10:00:00 0
11:00:00 0
12:00:00 0
13:45:00 0
15:00:00 0
16:00:00 0
17:00:00 0
18:00:00 0
19:00:00 20
Then here I want to skip the time of having 0 values of X3 and want to read only time difference of values of X3.
time x3
7:00:00 2(values having)
9:00:00 50
So the time difference is 2hrs
Then second:
9:00:00 50
19:00:00 20
Then time difference is 10 hrs
Like wise I want write the code or my whole column. Can anyone help me to solve this?
While putting the code then get the error with time difference in minus value.
You can try to:
Find rows where X3 different from 0
Compute the difference is hours using shift
Update the dataframe using join:
Full example:
data = """time X3
6:00:00 0
7:00:00 2
8:00:00 0
9:00:00 50
10:00:00 0
11:00:00 0
12:00:00 0
13:45:00 0
15:00:00 0
16:00:00 0
17:00:00 0
18:00:00 0
19:00:00 20"""
# Build dataframe from example
df = pd.read_csv(StringIO(data), sep=r'\s{1,}')
df['X1'] = np.random.randint(0,10,len(df)) # Add random values for "X1" column
df['X2'] = np.random.randint(0,10,len(df)) # Add random values for "X2" column
# Convert the time column to datetime object
df.time = pd.to_datetime(df.time, format="%H:%M:%S")
print(df)
# time X3 X1 X2
# 0 1900-01-01 06:00:00 0 5 4
# 1 1900-01-01 07:00:00 2 7 1
# 2 1900-01-01 08:00:00 0 2 8
# 3 1900-01-01 09:00:00 50 1 0
# 4 1900-01-01 10:00:00 0 3 9
# 5 1900-01-01 11:00:00 0 8 4
# 6 1900-01-01 12:00:00 0 0 2
# 7 1900-01-01 13:45:00 0 5 0
# 8 1900-01-01 15:00:00 0 5 7
# 9 1900-01-01 16:00:00 0 0 8
# 10 1900-01-01 17:00:00 0 6 7
# 11 1900-01-01 18:00:00 0 1 5
# 12 1900-01-01 19:00:00 20 4 7
# Compute difference
sub_df = df[df.X3 != 0]
out_values = (sub_df.time.dt.hour - sub_df.shift().time.dt.hour) \
.to_frame() \
.fillna(sub_df.time.dt.hour.iloc[0]) \
.rename(columns={'time': 'out'}) # Rename column
print(out_values)
# out
# 1 7.0
# 3 2.0
# 12 10.0
df = df.join(out_values) # Add out values
print(df)
# time X3 X1 X2 out
# 0 1900-01-01 06:00:00 0 2 9 NaN
# 1 1900-01-01 07:00:00 2 7 4 7.0
# 2 1900-01-01 08:00:00 0 6 6 NaN
# 3 1900-01-01 09:00:00 50 9 1 2.0
# 4 1900-01-01 10:00:00 0 2 9 NaN
# 5 1900-01-01 11:00:00 0 5 3 NaN
# 6 1900-01-01 12:00:00 0 6 4 NaN
# 7 1900-01-01 13:45:00 0 9 3 NaN
# 8 1900-01-01 15:00:00 0 3 0 NaN
# 9 1900-01-01 16:00:00 0 1 8 NaN
# 10 1900-01-01 17:00:00 0 7 5 NaN
# 11 1900-01-01 18:00:00 0 6 7 NaN
# 12 1900-01-01 19:00:00 20 1 5 10.0
Here is use .fillna(sub_df.time.dt.hour.iloc[0]) to replace the first values with the matching hours (since the subtract 0 does nothing). You can define your own rule for the value in fillna().

How to add missing rows in dataframe by comparing values in python

Hi I am using python pandas for dataframes,I have data something like as followed:
Employee-ID Time-slot Calls-received Prod-sold
1 14:30:00 10 1
1 15:00:00 15 3
1 15:30:00 10 2
1
16:00:00 8 2
1 16:30:00 10 0
2 14:30:00 10 2
2
15:00:00 15 3
2 16:30:00 10 2
2 17:00:00 10 0
I have 10,000 employee and ideally there should be 16 time slots for each employee but time slots are missing for some employees, like employee 2 time slot 15:30:00 and 16:00:00 is missing I wish to add new rows and with missing time slots and zero values for 'calls-received' and prod-sold. Something like that:
2 14:30:00 10 2
2 15:00:00 15 3
2 15:30:00 0 0
2 16:00:00 0 0
2 16:30:00 10 2
2 17:00:00 10 0

Aggregate to 15min based timestamp to hour and find sum, avg and max for multiple columns in pandas

I have a dataframe with period_start_time by every 15 minutes and now I need to aggregate to 1 hour and calculate sum and avg for almost every column in dataframe (it has about 20 columns) and
PERIOD_START_TIME ID val1 val2
06.21.2017 22:15:00 12 3 0
06.21.2017 22:30:00 12 5 6
06.21.2017 22:45:00 12 0 3
06.21.2017 23:00:00 12 5 2
...
06.21.2017 22:15:00 15 9 2
06.21.2017 22:30:00 15 0 2
06.21.2017 22:45:00 15 1 5
06.21.2017 23:00:00 15 0 1
...
Desired output:
PERIOD_START_TIME ID val1(avg) val1(sum) val1(max) ...
06.21.2017 22:00:00 12 3.25 13 5
...
06.21.2017 23:00:00 15 2.25 10 9 ...
And for columns val2 too, and for every other column in dataframe.
I have no idea how to group by period start time for every hour, not for the whole day, no idea how to start.
I believe you need Series.dt.floor for Hours and then aggregate by agg:
df = df.groupby([df['PERIOD_START_TIME'].dt.floor('H'),'ID']).agg(['mean','sum', 'max'])
#for columns from MultiIndex
df.columns = df.columns.map('_'.join)
print (df)
val1_mean val1_sum val1_max val2_mean val2_sum \
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 2.666667 8 5 3 9
15 3.333333 10 9 3 9
2017-06-21 23:00:00 12 5.000000 5 5 2 2
15 0.000000 0 0 1 1
val2_max
PERIOD_START_TIME ID
2017-06-21 22:00:00 12 6
15 5
2017-06-21 23:00:00 12 2
15 1
df = df.reset_index()
print (df)
PERIOD_START_TIME ID val1_mean val1_sum val1_max val2_mean val2_sum \
0 2017-06-21 22:00 12 2.666667 8 5 3 9
1 2017-06-21 22:00 15 3.333333 10 9 3 9
2 2017-06-21 23:00 12 5.000000 5 5 2 2
3 2017-06-21 23:00 15 0.000000 0 0 1 1
val2_max
0 6
1 5
2 2
3 1
Very similarly you can convert PERIOD_START_TIME to a pandas Period.
df['PERIOD_START_TIME'] = df['PERIOD_START_TIME'].dt.to_period('H')
df.groupby(['PERIOD_START_TIME', 'ID']).agg(['max', 'min', 'mean']).reset_index()

Python/Pandas filter out unique rows from DataFrames

I tow or three DataFrames that have duplicated rows.
In [31]: df1
Out[31]:
member time
0 0 2009-09-30 12:00:00
1 0 2009-09-30 18:00:00
2 0 2009-10-01 00:00:00
3 1 2009-09-30 12:00:00
4 1 2009-09-30 18:00:00
5 2 2009-09-30 12:00:00
6 3 2009-09-30 12:00:00
...
In [32]: df2
Out[32]:
member time
0 0 2009-09-30 12:00:00
1 0 2009-09-30 18:00:00
3 1 2009-09-30 12:00:00
4 2 2009-09-30 12:00:00
5 2 2009-09-30 18:00:00
6 2 2009-10-01 00:00:00
...
I'd like to filter out the rows that have unique value of 'member' and 'time' from df1 and df2, and get a DataFrame that has only rows that have the common value of 'member' and 'time' in df1 and df2, that is
In [33]: df_duplicated_1_and_2
Out[33]:
member time
0 0 2009-09-30 12:00:00
1 0 2009-09-30 18:00:00
3 1 2009-09-30 12:00:00
4 2 2009-09-30 12:00:00
...
Is there a efficient and elegant way to do this ?
Update If possible, I'd like to get not a new merged DataFrame but a filtered DataFrame. e.g.,
In [34]: df1
Out[34]:
member time value
0 0 2009-09-30 12:00:00 a
1 0 2009-09-30 18:00:00 b
2 0 2009-10-01 00:00:00 c
3 1 2009-09-30 12:00:00 d
4 1 2009-09-30 18:00:00 e
5 2 2009-09-30 12:00:00 f
6 3 2009-09-30 12:00:00 g
...
In [35]: df1_filtered_out
Out[35]:
member time value
0 0 2009-09-30 12:00:00 a
1 0 2009-09-30 18:00:00 b
3 1 2009-09-30 12:00:00 d
4 2 2009-09-30 12:00:00 g
...
and also get filtered df2.
Do a inner join on member and time columns:
>>> df1.merge(df2, on=['member', 'time'], how='inner')
member time
0 0 2009-09-30 12:00:00
1 0 2009-09-30 18:00:00
2 1 2009-09-30 12:00:00
3 2 2009-09-30 12:00:00
This will produce a result that has only the rows that have the same member and time values in both DataFrames.
Update:
>>> df1.merge(df2[['member', 'time']])
member time value
0 0 2009-09-30 12:00:00 a
1 0 2009-09-30 18:00:00 b
2 1 2009-09-30 12:00:00 d
3 2 2009-09-30 12:00:00 f

Categories

Resources