I have a dataframe with a timestamp column, and I have to create a new column based on the result of an algorithm. The algorithm has to be applied on the current row, and all previous and following rows that have the timestamp inside a fixed interval. So if for example the time interval is 1 hour, I need to choose all the rows that are at maximum 1h before or 1h after the "current" row, apply the algorithm and save the result in the new column, and do this for all the rows
df['new_column'] = algorithm(df[df['timestamp'] inside time window])
What I don't know is how to get the portion of dataframe that is inside the time window
I would like to guess there is a more efficient way. However, I haven't found one.
def algorithm(timestamp, df):
df = df[(df['timestamp'] >= timestamp + relativedelta(hours=-1))]
df = df[(df['timestamp'] <= timestamp + relativedelta(hours=1))]
#rest of algorithm
return return_value
If you call this algorithm with the following code, I believe you will get your anticipated result.
df['new_column'] = df['timestamp'].apply(algorithm, df=df)
Related
The data I have is order creation and completion:
OrderID
time_created
time_completed
price
a
1/12/21
2/12/21
10
b
1/12/21
6/12/21
11
c
3/12/21
8/12/21
9
d
4/12/21
5/12/21
8
e
9/12/21
10/12/21
7
I am trying to do a rolling apply based on the time_created column, but applied to the time_completed column. Specifically, at the point of each new order creation, I am trying to filter for the previous x days of completed orders, and obtain distribution parameters from it (e.g. count, mean price, median price etc).
The most straightforward way that I can think of is to create a function that filters for the data of the relevant completed orders, then extracts the distribution parameters. This function would then be iteratively applied to each row in the dataframe. For example:
def get_params(ser):
lookback = Timedelta('2d')
past_orders = df[(df['time_completed'] < ser['time_created']) & (df['time_completed'] > ser['time_created'] - lookback)]
mean = past_orders['price'].mean()
25perc = past_orders['price'].quantile(0.25)
count = past_orders.shape[0]
return pd.Series([mean, 25perc, count])
df.apply(get_params, axis=1)
However, the problem of this implementation is that it is too slow. Each row's result is highly related to the previous row's results, but this implementation does not make use of it.
The time_created column is sorted, and always earlier/smaller than the time_completed column, which is why I believe some form of rolling can be used.
I guess my problem with pandas' rolling implementation is that I have not found how out to get it to reference one column (time_created) while rolling on another (time_completed). Is there any way to do that? Thanks.
The problem is that I am trying to run a specific row I choose to calculate what percentage the specific row value is away from the intended outputs mean (which is already calculated from another column), to find what percentage it deviates from the intended outputs mean.
I want to run each item individually like so:
Below I made a dataframe column to store the result
df['pct difference'] = ((df['tertiary_tag']['price'] - df['ab roller']['mean'])/df['ab roller']['mean']) * 100
For example, let's say the mean is 10 and I know that the item is 8 dollars, figure out whatever percentage away from the mean that product is and return that number for each items of that dataset. Return what percentage it deviates from the mean.
Keep in mind, the problem is not solved by a loop because I am sure pandas has something more practical to calculate the % difference not pct_change.
I also thought maybe to get the very specific row make a column as some indexing so I can use, and use that to access any row with in the columns by using that index and from indexing do my operation whatever you want for example to calculate difference in percentage between two rows?
I thought maybe through indexing the column of the price?
df = df.set_index(['price']) df.index = pd.to_datetime(df.index)
def percent_diff(df, row1, row2):
"""
Calculating the percentage difference between two specific rows in a dataframe
"""
return (df.loc[row1, 'value'] - df.loc[row2, 'value']) / df.loc[row2, 'value'] * 100
I have a big amount of timeseries sensor data in a pandas dataframe. The resolution of the data is one observation every 15 minutes for 1 month for 876 sensors.
The data has some daily seasonality and some faulty measurements in single sensors on about 50% of the observations.
I want to remove the seasonality.
df.diff(periods=96)
This does not work, because then I have an outlier on 2 days (the day with the actual faulty measurement and the day after.
Therefore I wrote this snippet of code which does what it should and works fine:
for index in df.index:
for column in df.columns:
df[column][index] = df[column][index] - (
df[column][df.index % 96 == index % 96]).mean()
The problem is that this is incredibly slow.
Is there a way to achieve the same thing with a pandas function significantly faster?
Iterating over a DataFrame/ Series should be your last resort, it's very slow.
In this case, you can use groupby + transform to compute the mean of each season for all the columns, and then subtract with from your DataFrame in a vectorized way.
Based on your code, it seems that this should work
period = 96
season_mean = df.groupby(df.index % period).transform('mean')
df -= season_mean
Or, if you want
period = 96
df = df.groupby(df.index % period).transform(lambda g: g - g.mean())
I have a pandas dataframe that includes time intervals that overlapping at some points (figure 1). I need a data frame that has a time series that starts beginning from the first start_time to the end of the last end_time (figure 2).
I have to sum up VIS values at overlapped time intervals.
I couldn't figure it out. How can I do it?
This problem is easily solved with the python package staircase, which is built on pandas and numpy for the purposes of working with (mathematical) step functions.
Assume your original dataframe is called df and the times you want in your resulting dataframe are an array (or datetime index, or series etc) called times.
import staircase as sc
stepfunction = sc.Stairs(df, start="start_time", end="end_time", value="VIS")
result = stepfunction(times, include_index=True)
That's it, result is a pandas Series indexed by times, and has the values you want. You can convert it to a dataframe in the format you want using reset_index method on the Series.
You can generate your times data like this
import pandas as pd
times = pd.date_range(df["start_time"].min(), df["end_time"].max(), freq="30min")
Why it works
Each row in your dataframe can be thought of a step function. For example the first row corresponds to a step function which starts with a value of zero, then at 2002-02-03 04:15:00 increases to a value of 10, then at 2002-02-04 04:45:00 returns to zero. When you sum all the step functions up for each row you have one step function whose value is the sum of all VIS values at any point. This is what has been assigned to the stepfunction variable above. The stepfunction variable is callable, and returns values of the step function at the points specified. This is what is happening in the last line of the example where the result variable is being assigned.
note:
I am the creator of staircase. Please feel free to reach out with feedback or questions if you have any.
If you paste your data instead of the images, I'd be able to test this. But this is how you may want to think about it. Assume your dataframe is called df.
df['start_time'] = pd.to_datetime(df['start_time']) # in case it's not datetime already
df.set_index('start_time', inplace=True)
new_dates = pd.date_range(start=min(df.index), end=max(df.end_time), freq='15Min')
new_df = df.reindex(new_dates, fill_value=np.nan)
As long as there are no duplicates in start_time, this should work. If there is, that'd need to be handled in some other way.
Resample is another possibility, but without data, it's tough to say what would work.
I have the following daily dataframe:
daily_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='D')
random_values = np.random.randint(1, 3,size=(len(daily_index), 1))
daily_df = pd.DataFrame(random_values, index=daily_index, columns=['A']).replace(1, np.nan)
I want to map each value to a dataframe where each day is expanded to multiple 1 minute intervals. The final DF looks like so:
intraday_index = pd.date_range(start='1/1/2015', end='1/01/2018', freq='1min')
intraday_df_full = daily_df.reindex(intraday_index)
# Choose random indices.
drop_indices = np.random.choice(intraday_df_full.index, 5000, replace=False)
intraday_df = intraday_df_full.drop(drop_indices)
In the final dataframe, each day is broken into 1 min intervals, but some are missing (so the minute count on each day is not the same). Some days have a value in the beginning of the day, but nan for the rest.
My question is, only for the days which start with some value in the first minute, how do I front fill for the rest of the day?
I initially tried to simply do the following daily_df.reindex(intraday_index, method='ffill', limit=1440), but since some rows are missing, this cannot work. Maybe there a way to limit by time?
Following #Datanovice's comments, this line achieves the desired result:
intraday_df.groupby(intraday_df.index.date).transform('ffill')
where my groupby defines the desired groups on which we want to apply the operation and transform does this without modifying the DataFrame's index.