How to make this row-wise operation performant (python)? - python

My issue is very simple, but I just can't wrap my head around it:
I have two dataframes:
time series dataframe with two columns: Timestamp and DataValue
A time interval dataframe with start, end timestamps and a label
What I want to do:
Add a third column to the timeseries that yields the labels according to the time interval dataframe.
Every timepoint needs to have an assigned label designated by the time interval dataframe.
This code works:
TimeSeries_labelled = TimeSeries.copy(deep=True)
TimeSeries_labelled["State"] = 0
for index in Timeintervals_States.index:
for entry in TimeSeries_labelled.index:
if Timeintervals_States.loc[index,"start"] <= TimeSeries_labelled.loc[entry, "Timestamp"] <= Timeintervals_States.loc[index,"end"]:
TimeSeries_labelled.loc[entry, "State"] = Timeintervals_States.loc[index,"state"]
But it is really slow. I tried to make it shorter and faster with pyhton built in filter codes, but failed miserably.
Please help!

I don't really know about TimeSeries, with a dataframe containing timestamps as datetime object you could use something like the following :
import pandas as pd
#Create the thrid column in the target dataframe
df_timeseries['label'] = pd.Series('',index=df_timeseries.index)
#Loop over the dataframe containing start and end timestamps
for index,row in df_start_end.iterrows():
#Create a boolean mask to filter data
mask = (df_timeseries['timestamp'] > row['start']) & (df_timeseries['timestamp'] < row['end'])
df_timeseries.loc[mask,'label'] = row['label']
This will make the rows your timeseries dataframe that match the condition of the mask have the label of the row, for each rows of your dataframe containing start & end timestamps

Related

Python pandas.datetimeindex piecewise dataframe slicing

I have a dataframe with a pandas DatetimeIndex. I need to take many slices from it(for printing a piecewise graph with matplotlib). In other words I need a new DF which would be a subset of the first one.
More precisely I need to take all rows that are between 9 and 16 o'clock but only if they are within a date range. Fortunately I have only one date range and one time range to apply.
What would be a clean way to do that? thanks
The first step is to set the index of the dataframe to the column where you store time. Once the index is based on time, you can subset the dataframe easily.
df['time'] = pd.to_datetime(df['time'], format='%H:%M:%S.%f') # assuming you have a col called 'time'
df['time'] = df['time'].dt.strftime('%H:%M:%S.%f')
df.set_index('time', inplace=True)
new_df = df[startTime:endTime] # startTime and endTime are strings

pandas get column values using UTC index

I have a pandas Dataframe with an index using UTC time and a column with data (in the example the column "value_1").
My question is: How could I create a new column in which each value is the value of the first column but 20 seconds later. Using the example below, I would get for the first value of this second column the value at the moment "2011-01-01 00:00:20".
import pandas as pd
import numpy as np
data_1 = pd.DataFrame(index=pd.date_range('1/1/2011', periods = 1000, freq ='S'))
data_1['value_1'] = 100 + np.random.randint(0,1000,size=(1000, 1))
data_1['value_2'] = ??¿¿
I don't know if it would be possible if I change the index to a different format.
I have seen that pandas have some useful functionalities to work with time series but I have not found the one in order to solve this problem yet.
Thank-you in advance.
you can either use shift with the value of seconds you want to use (here 20):
data_1['value_2'] = data_1['value_1'].shift(-20)
or can reindex with the index + 20s and get values with to_numpy:
data_1['value_2'] = data_1['value_1'].reindex(data_1['value_1'].index+pd.Timedelta(seconds=20)).to_numpy()

Pandas - New Row for Each Day in Date Range

I have a Pandas df with one column (Reservation_Dt_Start) representing the start of a date range and another (Reservation_Dt_End) representing the end of a date range.
Rather than each row having a date range, I'd like to expand each row to have as many records as there are dates in the date range, with each new row representing one of those dates.
See the two pics below for an example input and the desired output.
The code snippet below works!! However, for every 250 rows in the input table, it takes 1 second to run. Given my input table is 120,000,000 rows in size, this code will take about one week to run.
pd.concat([pd.DataFrame({'Book_Dt': row.Book_Dt,
'Day_Of_Reservation': pd.date_range(row.Reservation_Dt_Start, row.Reservation_Dt_End),
'Pickup': row.Pickup,
'Dropoff' : row.Dropoff,
'Price': row.Price},
columns=['Book_Dt','Day_Of_Reservation', 'Pickup', 'Dropoff' , 'Price'])
for i, row in df.iterrows()], ignore_index=True)
There has to be a faster way to do this. Any ideas? Thanks!
pd.concat in a loop with a large dataset gets pretty slow as it will make a copy of the frame each time and return a new dataframe. You are attempting to do this 120m times. I would try to work with this data as a simple list of tuples instead then convert to dataframe at the end.
e.g.
Given a list list = []
For each row in the dataframe:
get list of date range (can use pd.date_range here still) store in variable dates which is a list of dates
for each date in date range, add a tuple to the list list.append((row.Book_Dt, dates[i], row.Pickup, row.Dropoff, row.Price))
Finally you can convert the list of tuples to a dataframe:
df = pd.DataFrame(list, columns = ['Book_Dt', 'Day_Of_Reservation', 'Pickup', 'Dropoff', 'Price'])

Filter by datetime and update dataframe based on other dataframe datetime

I just have started learning pandas, so I am only at the beginning of the road. :)
The situation :
I have two dataframes (df1 and df2).
df1 contains multiple sensor data of a machine. The sensors transmit data every minute. I set the index of df1 in datetime format (this is actually the date and time when the sensors sent the data).
df2 contains the data of one production unit, meaning the unit id number (this is named 'Sarzs' in the dataframe) and the datetime when the process started and ended as well as the output quality of that particular production unit. The dataframe does not contain the data of the production unit related to that particular time (in the dataframe you can see that the column "Sarzs_no" is set to NaN at this stage). The starting and stopping dates and times of the production unit are stored in the "Start" and "Stop" columns and are in datetime format.
The problem:
I would like to iterate throught the rows of df1 and through the rows of df2 and check wether they are within (or between) the "Start" and "Stop" time of df2 and if this statement is true then udpdate the df1['Sarzs_no'] value with
the df2['Output'] value.
The progress so far::
So far I have wrote the code below:
for i in range (0, len(df2.index)):
for j in range(0, len(df1.index)):
print (df1.index)
and I have two questions basically:
How to actually write the filtering code and do the update?
Isn't there (it should be, I guess) a better way to make the filtering then iterating through all the rows in both dataframes, which it seems very time consuming therefore inefficient to me.
Thank you in advance for your help.
With dataframes containing timestamps as datetime object you could use something like the following :
#Loop over the dataframe containing start and end timestamps
for index,row in df2.iterrows():
#Create a boolean mask to filter data
mask = (df1.index > row['Start']) & (df1.index < row['Stop'])
df1.loc[mask,'Sarzs_no'] = row['Output']
This will make the rows that match the condition of the mask have the Output label of the row, for each rows of your dataframe containing start & end timestamps
The loc function return the indexes of the rows that match the conditions and the iterrows function create an iterator that move through your dataframe row by row
EDIT
As you have a datetime index, you can just use :
df1[row['Start']:row['Stop']]
instead of .loc() to get the rows you need to update

pandas GroupBy on the index and find the max

I have a large dataframe (around 35k entries), the index of this data frame is composed by dates (like 2014-02-12), this dates are not unique. What I need to do is to find for each data what is the max value for each data and create a new data frame with it. I created a solution that works (it is down bellow) but it takes a lot of time to process. Does anyone knows a faster way that I could do this? Thank you.
#Creates a empty dataframe
dataset0514maxrec = pd.DataFrame(columns=dataset0514max.columns.values)
dataset0514maxrec.index.name = 'Date'
#Gets the unique values, find the groups, recover the max value and append it
for i in dataset0514max.index.unique():
tempDF1 = dataset0514max.loc[dataset0514max.index.isin([i])]
tempDF2 = tempDF1[tempDF1['Data_Value'] == tempDF1['Data_Value'].max()]
dataset0514maxrec = dataset0514maxrec.append(tempDF2.head(1))
print(dataset0514maxrec)
groupby with levels
df.groupby(level=0).Data_Value.max().reset_index()
The next two options require the index to be a datetime index. If it
isn't, convert it:
df.index = pd.to_datetime(df.index)
resample
df.resample('D').max()
sort_values + duplicated
df = df.sort_values('Data_Value')
m = ~df.index.duplicated()
df = df[m]

Categories

Resources