Python/Pandas: extract intervals from a large dataframe - python

I have two pandas DataFrames:
20 million rows of continues time series data with DateTime Index (df) IMG
20 thousand rows with two timestamps (df_seq) IMG
I want to use the second Dataframe to extract all sequences out of the first (all rows of the first between the two timestamps for each row of 2. ), then each sequence needs to be transposed into 990 columns and then all sequences have to be combined in a new DataFrame.
So the new DataFrame has one row with 990 columns for each sequence IMG (case row get added later).
Right now my code looks like this:
sequences = pd.DataFrame()
for row in df_seq.itertuples(index=True, name='Pandas'):
sequences = sequences.append(df.loc[row.date:row.end_date].reset_index(drop=True)[:990].transpose())
sequences = sequences.reset_index(drop=True)
This code works, but is terribly slow --> 20-25 min execution time
Is there a way to rewrite this in vectorised operations? Or any other way to improve the performance of this code?

Here's a way to do it. The large dataframe is 'df', and the intervals one is called 'intervals':
inx = pd.date_range(start="2020-01-01", freq="1s", periods=1000)
df = pd.DataFrame(range(len(inx)), index=inx)
df.index.name = "timestamp"
intervals = pd.DataFrame([("2020-01-01 00:00:12","2020-01-01 00:00:18"),
("2020-01-01 00:01:20","2020-01-01 00:02:03")],
columns=["start_time", "end_time"])
intervals.start_time = pd.to_datetime(intervals.start_time)
intervals.end_time = pd.to_datetime(intervals.end_time)
intervals
t = pd.merge_asof(df.reset_index(), intervals[["start_time"]], left_on="timestamp", right_on="start_time", )
t = pd.merge_asof(t, intervals[["end_time"]], left_on="timestamp", right_on="end_time", direction="forward")
t = t[(t.timestamp >= t.start_time) & (t.timestamp <= t.end_time)]
The result is:
timestamp 0 start_time end_time
12 2020-01-01 00:00:12 12 2020-01-01 00:00:12 2020-01-01 00:00:18
13 2020-01-01 00:00:13 13 2020-01-01 00:00:12 2020-01-01 00:00:18
14 2020-01-01 00:00:14 14 2020-01-01 00:00:12 2020-01-01 00:00:18
15 2020-01-01 00:00:15 15 2020-01-01 00:00:12 2020-01-01 00:00:18
16 2020-01-01 00:00:16 16 2020-01-01 00:00:12 2020-01-01 00:00:18
.. ... ... ... ...
119 2020-01-01 00:01:59 119 2020-01-01 00:01:20 2020-01-01 00:02:03
120 2020-01-01 00:02:00 120 2020-01-01 00:01:20 2020-01-01 00:02:03
121 2020-01-01 00:02:01 121 2020-01-01 00:01:20 2020-01-01 00:02:03
122 2020-01-01 00:02:02 122 2020-01-01 00:01:20 2020-01-01 00:02:03
123 2020-01-01 00:02:03 123 2020-01-01 00:01:20 2020-01-01 00:02:03

After the steps from the answer above I added a groupby and a unstack and the result is exactly the df i need:
Execution time is ~30 seconds!
The full code looks now like this:
sequences = pd.merge_asof(df, df_seq[["date"]], left_on="timestamp", right_on="date", )
sequences = pd.merge_asof(sequences, df_seq[["end_date"]], left_on="timestamp", right_on="end_date", direction="forward")
sequences = sequences[(sequences.timestamp >= sequences.date) & (sequences.timestamp <= sequences.end_date)]
sequences = sequences.groupby('date')['feature_1'].apply(lambda df_temp: df_temp.reset_index(drop=True)).unstack().loc[:,:990]
sequences = sequences.reset_index(drop=True)

Related

How do I get difference between two dates but also have a way to sum up hours by day?

I'm a little new to Python and I'm not sure where to start.
I have a python dataframe that contains shift information like the below:
EmployeeID ShiftType BeginDate EndDate
10 Holiday 2020-01-01 21:00:00 2020-01-02 07:00:00
10 Regular 2020-01-02 21:00:00 2020-01-03 07:00:00
10 Regular 2020-01-03 21:00:00 2020-01-04 07:00:00
10 Regular 2020-01-04 21:00:00 2020-01-05 07:00:00
20 Regular 2020-02-01 09:00:00 2020-02-01 17:00:00
20 Regular 2020-02-02 09:00:00 2020-02-02 17:00:00
20 Regular 2020-02-03 09:00:00 2020-02-03 17:00:00
20 Regular 2020-02-04 09:00:00 2020-02-04 17:00:00
I'd like to be able to break each shift down and summarize hours worked each day.
The desired ouput is below:
EmployeeID ShiftType Date HoursWorked
10 Holiday 2020-01-01 3
10 Regular 2020-01-02 10
10 Regular 2020-01-03 10
10 Regular 2020-01-04 10
10 Regular 2020-01-05 7
20 Regular 2020-02-01 10
20 Regular 2020-02-02 10
20 Regular 2020-02-03 10
20 Regular 2020-02-04 10
I know how to get the work hours like below. This does get me hours for each shift, but I would like to be able to break out each calendar day and hours for that day. Retaining 'ShiftType' is not that important here.
df['HoursWorked'] = ((pd.to_datetime(schedule['EndDate']) - pd.to_datetime(schedule['BeginDate'])).dt.total_seconds() / 3600)
Any suggestion would be appreciated.
You could calculate the working hours for each date (BeginDate and EndDate) separately, which would give you three pd.Series, two for the parts of the night shifts that are on different dates and one for the day shift that is on one date. Use the according date as index for those Series.
# make sure dtype is correct
df["BeginDate"] = pd.to_datetime(df["BeginDate"])
df["EndDate"] = pd.to_datetime(df["EndDate"])
# get a mask where there is a night shift; end date = start date +1
m = df["BeginDate"].dt.date != df["EndDate"].dt.date
# extract the night shifts
s0 = pd.Series((df["BeginDate"][m].dt.ceil('d')-df["BeginDate"][m]).values,
index=df["BeginDate"][m].dt.floor('d'))
s1 = pd.Series((df["EndDate"][m]-df["EndDate"][m].dt.floor('d')).values,
index=df["EndDate"][m].dt.floor('d'))
# ...and the day shifts
s2 = pd.Series((df["EndDate"][~m]-df["BeginDate"][~m]).values, index=df["BeginDate"][~m].dt.floor('d'))
Now concat and sum them, giving you a pd.Series with the working hours for each date:
working_hours = pd.concat([s0, s1, s2], axis=1).sum(axis=1)
# working_hours
# 2020-01-01 0 days 03:00:00
# 2020-01-02 0 days 10:00:00
# 2020-01-03 0 days 10:00:00
# ...
# Freq: D, dtype: timedelta64[ns]
To join with your original df, you can reindex that and add the working_hours:
new_index = pd.concat([df["BeginDate"].dt.floor('d'),
df["EndDate"].dt.floor('d')]).drop_duplicates()
df_out = df.set_index(df["BeginDate"].dt.floor('d')).reindex(new_index, method='nearest')
df_out.index = df_out.index.set_names('date')
df_out = df_out.drop(['BeginDate', 'EndDate'], axis=1).sort_index()
df_out['HoursWorked'] = working_hours.dt.total_seconds()/3600
# df_out
# EmployeeID ShiftType HoursWorked
# date
# 2020-01-01 10 Holiday 3.0
# 2020-01-02 10 Regular 10.0
# 2020-01-03 10 Regular 10.0
# 2020-01-04 10 Regular 10.0
# 2020-01-05 10 Regular 7.0
# 2020-02-01 20 Regular 8.0
# 2020-02-02 20 Regular 8.0
# 2020-02-03 20 Regular 8.0
# 2020-02-04 20 Regular 8.0

Groupby ID, sort by Time, and divide last by first

I have the following dataframe:
ID ..... Quantity Time
54 100 2020-01-01 00:00:04
55 100 2020-01-01 00:00:04
54 88 2020-01-01-00:00:05
54 66 2020-01-01 00:00:06
55 100 2020-01-01 00:00:07
55 88 2020-01-01 00:00:07
I would like to group the dataframe (sorted by time!) by ID and then take the quantity of the last row and divide it by the first row per ID.
The result should look like this:
ID ..... Quantity Time Result
54 100 2020-01-01 00:00:04
54 88 2020-01-01-00:00:05
54 66 2020-01-01 00:00:06 0.66
55 100 2020-01-01 00:00:04
55 100 2020-01-01 00:00:07
55 88 2020-01-01 00:00:07 0.88
So far I used the following code to get the first and the last row for every ID.
g = df.sort_values(by=['Time']).groupby('ID')
df_new=(pd.concat([g.head(1), g.tail(1)])
.sort_values(by='ID')
.reset_index(drop=True))
and then I used the following code to get the Result of the division:
df_new['Result'] = df_new['Quantity'].iloc[1::2].div(df_new['Quantity'].shift())
The problem is: the dataframe stays not sorted by time. It is really important that I take (timewise) the last quantity per ID and divide it by the first quantity (in time) per ID.
Thanks for any hints where I need to change the code!
There are not pairs ID values, but triples, so first convert column to datetime if necessary by to_datetime, then sorting per 2 columns by DataFrame.sort_values and last use second or third solution from previous answer:
df['Time'] = pd.to_datetime(df['Time'])
df = df.sort_values(['ID','Time'])
first = df.groupby('ID')['Quantity'].transform('first')
df['Result'] = df.drop_duplicates('ID', keep='last')['Quantity'].div(first)
print (df)
ID Quantity Time Result
0 54 100 2020-01-01 00:00:04 NaN
2 54 88 2020-01-01 00:00:05 NaN
3 54 66 2020-01-01 00:00:06 0.66
1 55 100 2020-01-01 00:00:04 NaN
4 55 100 2020-01-01 00:00:07 NaN
5 55 88 2020-01-01 00:00:07 0.88
Convert Time to datetime :
df["Time"] = pd.to_datetime(df["Time"])
Sort on ID and Time :
df = df.sort_values(["ID", "Time"])
Group on ID:
grouping = df.groupby("ID").Quantity
Get results for the division of last by first:
result = grouping.last().div(grouping.first()).array
Now, you can assign the results back to the original dataframe :
df.loc[df.Quantity.eq(grouping.transform("last")), "Result"] = result
df
ID Quantity Time Result
0 54 100 2020-01-01 00:00:04 NaN
2 54 88 2020-01-01 00:00:05 NaN
3 54 66 2020-01-01 00:00:06 0.66
1 55 100 2020-01-01 00:00:04 NaN
4 55 100 2020-01-01 00:00:07 NaN
5 55 88 2020-01-01 00:00:07 0.88

replace values greater than 0 in a range of time in pandas dataframe

I have a large csv file in which I want to replace values with zero in a particular range of time. For example in between 20:00:00 to 05:00:00 I want to replace all the values greater than zero with 0. How do I do it?
dff = pd.read_csv('108e.csv', header=None) # reading the data set
data = df.copy()
df = pd.DataFrame(data)
df['timeStamp'] = pd.to_datetime(df['timeStamp'])
for i in df.set_index('timeStamp').between_time('20:00:00' , '05:00:00')['luminosity']:
if( i > 0):
df[['luminosity']] = df[["luminosity"]].replace({i:0})
You can use the function select from numpy.
import numpy as np
df['luminosity'] = np.select((df['timeStamp']>='20:00:00') & (df['timeStamp']<='05:00:00') & (df['luminosity']>=0), 0, df['luminosity'])
Here are other examples to use it and here are the official docs.
Assume that your DataFrame contains:
timeStamp luminosity
0 2020-01-02 18:00:00 10
1 2020-01-02 20:00:00 11
2 2020-01-02 22:00:00 12
3 2020-01-03 02:00:00 13
4 2020-01-03 05:00:00 14
5 2020-01-03 07:00:00 15
6 2020-01-03 18:00:00 16
7 2020-01-03 20:10:00 17
8 2020-01-03 22:10:00 18
9 2020-01-04 02:10:00 19
10 2020-01-04 05:00:00 20
11 2020-01-04 05:10:00 21
12 2020-01-04 07:00:00 22
To only retrieve rows in the time range of interest you could run:
df.set_index('timeStamp').between_time('20:00' , '05:00')
But if you attempted to modify these data, e.g.
df = df.set_index('timeStamp')
df.between_time('20:00' , '05:00')['luminosity'] = 0
you would get SettingWithCopyWarning. The reason is that this function
returns a view of the original data.
To circumvent this limitation, you can use indexer_between_time,
on the index of a DataFrame, which returns a Numpy array - locations
of rows meeting your time range criterion.
To update the underlying data, with setting index only to get row positions,
you can run:
df.iloc[df.set_index('timeStamp').index\
.indexer_between_time('20:00', '05:00'), 1] = 0
Note that to keep the code short, I passed the int location of the column
of interest.
Access by iloc should be quite fast.
When you print the df again, the result is:
timeStamp luminosity
0 2020-01-02 18:00:00 10
1 2020-01-02 20:00:00 0
2 2020-01-02 22:00:00 0
3 2020-01-03 02:00:00 0
4 2020-01-03 05:00:00 0
5 2020-01-03 07:00:00 15
6 2020-01-03 18:00:00 16
7 2020-01-03 20:10:00 0
8 2020-01-03 22:10:00 0
9 2020-01-04 02:10:00 0
10 2020-01-04 05:00:00 0
11 2020-01-04 05:10:00 21
12 2020-01-04 07:00:00 22

How can I get different statistics for a rolling datetime range up top a current value in a pandas dataframe?

I have a dataframe that has four different columns and looks like the table below:
index_example | column_a | column_b | column_c | datetime_column
1 A 1,000 1 2020-01-01 11:00:00
2 A 2,000 2 2019-11-01 10:00:00
3 A 5,000 3 2019-12-01 08:00:00
4 B 1,000 4 2020-01-01 05:00:00
5 B 6,000 5 2019-01-01 01:00:00
6 B 7,000 6 2019-04-01 11:00:00
7 A 8,000 7 2019-11-30 07:00:00
8 B 500 8 2020-01-01 05:00:00
9 B 1,000 9 2020-01-01 03:00:00
10 B 2,000 10 2020-01-01 02:00:00
11 A 1,000 11 2019-05-02 01:00:00
Purpose:
For each row, get the different rolling statistics for column_b based on a window of time in the datetime_column defined as the last N months. The window of time to look at however, is filtered by the value in column_a.
Code example using a for loop which is not feasible given the size:
mean_dict = {}
for index,value in enumerate(df.datetime_column)):
test_date = value
test_column_a = df.column_a[index]
subset_df = df[(df.datetime_column<test_date)&\
(df.datetime_column>=test_date-timedelta(days = 180))&
(df.column_a == test_column_a)]
mean_dict[index] = df.column_b.mean()
For example for row #1:
Target date = 2020-01-01 11:00:00
Target value in column_a = A
Date Range: from 2019-07-01 11:00:00 to 2020-01-01 11:00:00
Average would be the mean of rows 2,3,7
If I wanted average for row #2 then it would be:
Target date = 2019-11-01 10:00:00
Target value in column_a = A
Date Range: from 2019-05-01 10:00 to 2019-11-01 10:00:00
Average would be the mean of rows 11
and so on...
I cannot use the grouper since in reality I do not have dates but datetimes.
Has anyone encountered this before?
Thanks!
EDIT
The dataframe is big ~2M rows which means that looping is not an option. I already tried looping and creating a subset based on conditional values but it takes too long.

Pandas - Compute data from a column to another

Considering the following dataframe:
df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")
df.tail(6)
week extra_hours extra_hours_over
12 3 01:00:00 NaN
13 3 02:00:00 NaN
14 3 01:00:00 NaN
15 3 02:00:00 NaN
16 3 00:00:00 NaN
17 3 00:00:00 NaN
Now, in every week, the maximum amount of extra_hours is 4h, meaning I have to subtract 30min blocks from extra_hour column, and fill the extra_hour_over column, so that in every week, total sum of extra_hour has a maximum of 4h.
So, given the example dataframe, a possible solution (for week 3) would be like this:
week extra_hours extra_hours_over
12 3 01:00:00 00:00:00
13 3 01:30:00 00:30:00
14 3 00:30:00 00:30:00
15 3 01:00:00 01:00:00
16 3 00:00:00 00:00:00
17 3 00:00:00 00:00:00
I would need to aggregate total extra_hours per week, check in which days it passes 4h, and then randomly subtract half-hour chunks.
What would be the easiest/most direct way to achieve this?
Here goes one attempt for what you seem to be asking. The idea is simple, although the code fairly verbose:
1) Create some helper variables (minutes, extra_minutes, total for the week)
2) Loop through a temporary dataset that will contain only while sum is > 240 minutes.
3) In the loop, use random.choice to select a time to remove 30 min from.
4) Apply the changes to minutes and extra minutes
The code:
df = pd.read_json("""{"week":{"0":1,"1":1,"2":1,"3":1,"4":1,"5":1,"6":2,"7":2,"8":2,"9":2,"10":2,"11":2,"12":3,"13":3,"14":3,"15":3,"16":3,"17":3},"extra_hours":{"0":"01:00:00","1":"00:00:00","2":"01:00:00","3":"01:00:00","4":"00:00:00","5":"01:00:00","6":"01:00:00","7":"01:00:00","8":"01:00:00","9":"01:00:00","10":"00:00:00","11":"01:00:00","12":"01:00:00","13":"02:00:00","14":"01:00:00","15":"02:00:00","16":"00:00:00","17":"00:00:00"},"extra_hours_over":{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null}}""")
df['minutes'] = pd.DatetimeIndex(df['extra_hours']).hour * 60 + pd.DatetimeIndex(df['extra_hours']).minute
df['extra_minutes'] = 0
df['tot_time'] = df.groupby('week')['minutes'].transform('sum')
while not df[df['tot_time'] > 240].empty:
mask = df[(df['minutes']>=30)&(df['tot_time']>240)].groupby('week').apply(lambda x: np.random.choice(x.index)).values
df.loc[mask,'minutes'] -= 30
df.loc[mask,'extra_minutes'] += 30
df['tot_time'] = df.groupby('week')['minutes'].transform('sum')
df['extra_hours_over'] = df['extra_minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df['extra_hours'] = df['minutes'].apply(lambda x: pd.Timedelta(minutes=x))
df.drop(['minutes','extra_minutes'], axis=1).tail(6)
Out[1]:
week extra_hours extra_hours_over tot_time
12 3 00:30:00 00:30:00 240
13 3 01:30:00 00:30:00 240
14 3 00:30:00 00:30:00 240
15 3 01:30:00 00:30:00 240
16 3 00:00:00 00:00:00 240
17 3 00:00:00 00:00:00 240
Note: Because I am using np.random.choice, the same observation can be picked twice, which will make that observation change by a chunk of more than 30 min.

Categories

Resources