Creating time series DataFrame from event data

Creating time series DataFrame from event data - python

I have a dataset of locations of stores with dates of events (the date all stock was sold from that store) and quantities of the sold items, such as the following:
import numpy as np, pandas as pd
# Dates
start = pd.Timestamp("2014-02-26")
end = pd.Timestamp("2014-09-24")
# Generate some data
N = 1000
quantA = np.random.randint(10, 500, N)
quantB = np.random.randint(50, 250, N)
sell = np.random.randint(start.value, end.value, N)
sell = pd.to_datetime(np.array(sell, dtype="datetime64[ns]"))
df = pd.DataFrame({"sell_date": sell, "quantityA":quantA, "quantityB":quantB})
df.index = df.sell_date
I would like to create a new time series dataframe that has per-weekly summaries (or per daily; or per custom date_range object) from a range of these quantities A and B.
I can generate week number and aggregate sales based on those, like so...
df['week'] = df.sell_date.dt.week
df.pivot_table(values = ['quantityA', 'quantityB'], index = 'week', aggfunc = [np.sum, len])
But I don't see how to do the following:
expand this out to a full time series (based on a date_range object, such as period_range = pd.date_range(start = start, end = end, freq='7D')),
include the original date (as a 'week starting' variable), instead of integer week number, or
change the date variable to be the index of this new dataframe.

I'm not sure if this is what you want but you can try
df.set_index('sell_date', inplace=True)
resampled = df.resample('7D', [sum, len])
The resulting index might not be exactly what you want as it starts with the earliest datetime correct to the nanosecond. You could replace with datetimes which have 00:00:00 in the time by doing
resampled.index = pd.to_datetime(resampled.index.date)
EDIT:
You can actually just do
resampled = df.resample('W', [sum, len])
And the resulting index is exactly what you want. Interestingly, passing 'D' also gives the index you would expect but passing a multiple like '2D' results in the 'ugly' index, that is, starting at the earliest correct to the nanosecond and increasing in multiples of exactly 2 days. I guess the lesson is stick to singles like 'D', 'W', 'M' where possible.
EDIT:
The API for resampling changed at some point such that the above no longer works. Instead one can do:
resampled = df.resample('W').agg([sum, len])
.resample now returns a Resampler object which exposes methods, much like the groupbyAPI.

Related

Pandas: easier way to sample interpolated time series data at given times (e.g. every full day)

Regularly I run into the problem that I have time series data that I want to interpolate and resample at given times. I have a solution, but it feels like "too labor intensive", e.g. I guess there should be a simpler way. Have a look for how I currently do it here: https://gist.github.com/cs224/012f393d5ced6931ae223e6ddc4fe6b2 (or the nicer version via nbviewer here: https://nbviewer.org/gist/cs224/012f393d5ced6931ae223e6ddc4fe6b2)
Perhaps a motivating example: I fill up my car about every two weeks. I have the cost data of every refill. Now I would like to know the cummulative sum on a daily basis, where the day values are at midnight and interpolated.
Currently I create a new empty data frame that contains the time points at which I want to have my resampled values:
df_sampling = pd.DataFrame(index=pd.date_range(start, end, freq=freq))
And then either use pd.merge:
ldf = pd.merge(df_in, df_sampling, left_index=True, right_index=True, how='outer')
or pd.concat:
ldf = pd.concat([df_in, df_sampling], axis=1)
to create a combined time series that has the additional time points in the index. Based on that I can then use pd.interpolate and then sub-select all index values given by df_sampling. See the gist for details.
All this feels too cumbersome and I guess there should be a better way how to do it.

Instead of using either merge or concat inside your function generate_interpolated_time_series, I would rely on df.reindex. Something like this:
def f(df_in, freq='T', start=None):
if start is None:
start = df_in.index[0].floor('T')
# refactored: df_in.index[0].replace(second=0,microsecond=0,nanosecond=0)
end = df_in.index[-1]
idx = pd.date_range(start=start, end=end, freq=freq)
ldf = df_in.reindex(df_in.index.union(idx)).interpolate().bfill()
ldf = ldf[~ldf.index.isin(df_in.index.difference(idx))]
return ldf
Test sample:
from pandas import Timestamp
d = {Timestamp('2022-10-07 11:06:09.957000'): 21.9,
Timestamp('2022-11-19 04:53:18.532000'): 47.5,
Timestamp('2022-11-19 16:30:04.564000'): 66.9,
Timestamp('2022-11-21 04:17:57.832000'): 96.9,
Timestamp('2022-12-05 22:26:48.354000'): 118.6}
df = pd.DataFrame.from_dict(d, orient='index', columns=['values'])
print(df)
values
2022-10-07 11:06:09.957 21.9
2022-11-19 04:53:18.532 47.5
2022-11-19 16:30:04.564 66.9
2022-11-21 04:17:57.832 96.9
2022-12-05 22:26:48.354 118.6
Check for equality:
merge = generate_interpolated_time_series(df, freq='D', method='merge')
concat = generate_interpolated_time_series(df, freq='D', method='concat')
reindex = f(df, freq='D')
print(all([merge.equals(concat),merge.equals(reindex)]))
# True
Added bonus would be some performance gain. Here you see the results of a comparison between the 3 methods (applying %timeit) for different frequencies (['D','H','T','S']). reindex in green is fastest for each.
Aside: in your function, raise Exception('Method unknown: ' + metnhod) contains a typo; should be method.

Pandas to calculate pct_change on specified freq option

I would like to calculate 1 year, 2, years, 3 years growth rate of weekly/daily DataFrame :
start = '20100101'
end = '20201117'
df_ts = pd.DataFrame(index=pd.bdate_range(start=start, end=end, freq='D'))
df_ts['valeur1'] = range(1, df_ts.shape[0]+1)
df_ts['gr'] = 100*df_ts.pct_change(periods=1, freq='Y)
df_ts
I thought pct_change(periods=n, freq='Y') was the good way to do it but I get an error result with this simple date.
I need to emphasise that in my data I have weekly/daily data and I operate others operation so I need to put this inside apply(lambda x: x.pct_change(periods=n, freq='Y').
Any suggestions to do it simply ?

You are assigning a new dataframe to a column because you did .pct_change on the dataframe not the column. Try this:
df_ts['gr'] = df_ts['valeur1'].pct_change(periods=1, freq='Y')

Pandas map calendar to index

I have a full-year hourly series, that we may call "calendar":
from pandas import date_range, Series
calendar = Series(
index=date_range("2006-01-01", "2007-01-01", freq="H", closed="left", tz="utc"),
data=range(365 * 24)
)
Now I have a new index, which is another hourly series, but starting and ending at arbitrary datetimes:
index = date_range("2019-01-01", "2020-10-02", freq="H", tz="utc")
I would like to create a new series result that has the same index as index and, for each month-day-hour combination, it takes the value from the corresponding month-day-hour in the calendar.
I could iterate to have a working solution like so, with a try-except just to ignore February 29th:
result = Series(index=index, dtype="float")
for timestamp in result.index:
try:
calendar_timestamp = timestamp.replace(year=2006)
except:
continue
result.loc[timestamp] = calendar.loc[calendar_timestamp]
This however, is very inefficient, so does anybody know how to do it better? With better I mean specially faster (CPU-time-wise).
Constraints/notes:
No Numba, nor Cython, just CPython and Pandas/NumPy
It is fine to leave February 29th with NaN values, since that day is not represented in the calendar
We can always assume that the index is properly sorted and has no gaps (the same applies to the calendar)

Let's try extracting the combination as string and map:
cal1 = pd.Series(calendar.values,
index=calendar.index.strftime('%m%d%H'))
result = index.to_series().dt.strftime('%m%d%H').map(cal1)
Output:

Using loc on two columns to perform calculations that replace values of another column

I have been stuck on this way too long. All I am trying to do is create a new column called Duration Target Date which derives from Standard Duration Days + Date/Time Created. Below is my code so far:
From my POV, I think that this code will iterate from 0 to the length of the data frame. If there is "No Set Standard Duration" in the Standard Duration Days column then goes to my else statement and overwrites that given cell with blank(same as I initialized it). However, if the code realizes that there is anything but "No Set Standard Duration" it should then add the value from the given cell from column Standard Duration Days with column Date/Time Created. I want the new value to be in the new column Duration Target Date at the corresponding index.
newDF["Duration Target Date"] = ""
for i in range(0,len(newDF)):
if newDF.loc[i,"Standard Duration Days"] != "No Set Standard Duration":
newDF.loc[i,"Duration Target Date"] = (timedelta(days = int(newDF.loc[i,"Standard Duration Days"])) + newDF.loc[i,"Date/Time Created"])
else:
newDF.loc[i,"Duration Target Date"] == ""
I noticed that this works partially but then it eventually stops working... I also get an error when I run this: "KeyError 326"

I would just add the columns and leave the NaT (Not a Time) error.
df = pd.DataFrame({
"Standard Duration Days": [3, 5, "No Set Standard Duration"],
"Date/Time Created": ['2019-01-01', '2019-02-01', '2019-03-01']
})
# 1. Convert string dates to pandas timestamps.
df['Date/Time Created'] = pd.to_datetime(df['Date/Time Created'])
# 2. Create time deltas, coercing errors.
delta = pd.to_timedelta(df['Standard Duration Days'], unit='D', errors='coerce')
# 3. Create new column by adding delta to 'Date/Time Created'.
df['Duration Target Date'] = (df['Date/Time Created'] + delta).dt.normalize()
>>> df
Standard Duration Days Date/Time Created Duration Target Date
0 3 2019-01-01 2019-01-04
1 5 2019-02-01 2019-02-06
2 No Set Standard Duration 2019-03-01 NaT
Adding text to a numeric column converts the entire column to object which takes more memory and is less efficient. Generally, one wants to leave empty values as np.nan or possibly a sentinel value in the case of integers. Only for display purposes do those get converted, e.g. df['Duration Target Date'].fillna('').

A couple issues here. First, it looks like you're confusing loc with iloc. Very easy to do. loc looks up by the actual index, which may or may not be the integer-position index. But your i in range (0, len(newDF)) is iterating by integer-position index. So you're getting your KeyError 326 because you're getting to the 326th row of your dataframe, but it's index is not actually 326. you can check this by looking at print(newDF.iloc[320:330]).
Second and more important issue: you almost never want to iterate through rows in a pandas dataframe. Instead, use a vectorized function that applies to a full column at a time. For your case where you want conditional assignment, the relevant function is np.where:
boolean_filter = newDF.loc[:,"Standard Duration Days"] != "No Set Standard Duration"
value_where_true = (timedelta(days = newDF.loc[:,"Standard Duration Days"].astype('int'))) + newDF.loc[:,"Date/Time Created"])
value_where_false = ""
newDF["Duration Target Date"] = np.where(boolean_filter, value_where_true, value_where_false)

Here's a way using .apply row-wise:
newDF['Standard Duration Days'] = newDF['Standard Duration Days'].astype(int)
newDF['Duration Target Date'] = (newDF
.apply(lambda x:, x["Standard Duration Days"] + x["Date/Time Created"] if x["Standard Duration Days"] != "No Set Standard Duration" else None,axis=1)
Note: Since you haven't provided any data, it is not tested.

Pandas: Calculate the percentage between two rows and add the value as a column

I have a dataset structured like this:
"Date","Time","Open","High","Low","Close","Volume"
This time series represent the values of a generic stock market.
I want to calculate the difference in percentage between two rows of the column "Close" (in fact, I want to know how much the value of the stock increased or decreased; each row represent a day).
I've done this with a for loop(that is terrible using pandas in a big data problem) and I create the right results but in a different DataFrame:
rows_number = df_stock.shape[0]
# The first row will be 1, because is calculated in percentage. If haven't any yesterday the value must be 1
percentage_df = percentage_df.append({'Date': df_stock.iloc[0]['Date'], 'Percentage': 1}, ignore_index=True)
# Foreach days, calculate the market trend in percentage
for index in range(1, rows_number):
# n_yesterday : 100 = (n_today - n_yesterday) : x
n_today = df_stock.iloc[index]['Close']
n_yesterday = self.df_stock.iloc[index-1]['Close']
difference = n_today - n_yesterday
percentage = (100 * difference ) / n_yesterday
percentage_df = percentage_df .append({'Date': df_stock.iloc[index]['Date'], 'Percentage': percentage}, ignore_index=True)
How could I refactor this taking advantage of dataFrame api, thus removing the for loop and creating a new column in place?

df['Change'] = df['Close'].pct_change()
or if you want to calucale change in reverse order:
df['Change'] = df['Close'].pct_change(-1)

I would suggest to first make the Date column as DateTime indexing for this you can use
df_stock = df_stock.set_index(['Date'])
df_stock.index = pd.to_datetime(df_stock.index, dayfirst=True)
Then simply access any row with specific column by using datetime indexing and do any kind of operations whatever you want for example to calculate difference in percentage between two rows of the column "Close"
df_stock['percentage'] = ((df_stock['15-07-2019']['Close'] - df_stock['14-07-2019']['Close'])/df_stock['14-07-2019']['Close']) * 100
You can also use for loop to do the operations for each date or row:
for Dt in df_stock.index:

Using diff
(-df['Close'].diff())/df['Close'].shift()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating time series DataFrame from event data - python

Related

Pandas: easier way to sample interpolated time series data at given times (e.g. every full day)

Pandas to calculate pct_change on specified freq option

Pandas map calendar to index

Using loc on two columns to perform calculations that replace values of another column

Pandas: Calculate the percentage between two rows and add the value as a column

Categories

Resources