Shift time series with missing dates in Pandas - python

I have a times series with some missing entries, that looks like this:
date value
---------------
2000 5
2001 10
2003 8
2004 72
2005 12
2007 13
I would like to do create a column for the "previous_value". But I only want it to show values for consecutive years. So I want it to look like this:
date value previous_value
-------------------------------
2000 5 nan
2001 10 5
2003 8 nan
2004 72 8
2005 12 72
2007 13 nan
However just applying pandas shift function directly to the column 'value' would give 'previous_value' = 10 for 'time' = 2003, and 'previous_value' = 12 for 'time' = 2007.
What's the most elegant way to deal with this in pandas? (I'm not sure if it's as easy as setting the 'freq' attribute).

In [588]: df = pd.DataFrame({ 'date':[2000,2001,2003,2004,2005,2007],
'value':[5,10,8,72,12,13] })
In [589]: df['previous_value'] = df.value.shift()[ df.date == df.date.shift() + 1 ]
In [590]: df
Out[590]:
date value previous_value
0 2000 5 NaN
1 2001 10 5
2 2003 8 NaN
3 2004 72 8
4 2005 12 72
5 2007 13 NaN
Also see here for a time series approach using resample(): Using shift() with unevenly spaced data

Your example doesn't look like real time series data with timestamps. Let's take another example with the missing date 2020-01-03:
df = pd.DataFrame({"val": [10, 20, 30, 40, 50]},
index=pd.date_range("2020-01-01", "2020-01-05"))
df.drop(pd.Timestamp('2020-01-03'), inplace=True)
val
2020-01-01 10
2020-01-02 20
2020-01-04 40
2020-01-05 50
To shift by one day you can set the freq parameter to 'D':
df.shift(1, freq='D')
Output:
val
2020-01-02 10
2020-01-03 20
2020-01-05 40
2020-01-06 50
To combine original data with the shifted one you can merge both tables:
df.merge(df.shift(1, freq='D'),
left_index=True,
right_index=True,
how='left',
suffixes=('', '_previous'))
Output:
val val_previous
2020-01-01 10 NaN
2020-01-02 20 10.0
2020-01-04 40 NaN
2020-01-05 50 40.0
Other offset aliases you can find here

Related

I want to select duplicate rows between 2 dataframes

I want to filter rolls (df1) with date column that in datetime64[ns] from df2 (same column name and dtype). I tried searching for a solution but I get the error:
Can only compare identically-labeled Series objects | 'Timestamp' object is not iterable or other.
sample df1
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
4
2018-11-25
120
5
2018-08-25
120
sample df2
date
2018-10-09
2018-10-10
sample result that I want
id
date
value
1
2018-10-09
120
2
2018-10-09
60
3
2018-10-10
59
In fact, I want this program to run 1 time in every 7 days, counting back from the day it started. So I want it to remove dates that are not in these past 7 days.
# create new dataframe -> df2
data = {'date':[]}
df2 = pd.DataFrame(data)
#Set the date to the last 7 days.
days_use = 7 # 7 -> 1
for x in range (days_use,0,-1):
days_use = x
use_day = date.today() - timedelta(days=days_use)
df2.loc[x] = use_day
#Change to datetime64[ns]
df2['date'] = pd.to_datetime(df2['date'])
Use isin:
>>> df1[df1["date"].isin(df2["date"])]
id date value
0 1 2018-10-09 120
1 2 2018-10-09 60
2 3 2018-10-10 59
If you want to create df2 with the dates for the past week, you can simply use pd.date_range:
df2 = pd.DataFrame({"date": pd.date_range(pd.Timestamp.today().date()-pd.DateOffset(7),periods=7)})
>>> df2
date
0 2022-05-03
1 2022-05-04
2 2022-05-05
3 2022-05-06
4 2022-05-07
5 2022-05-08
6 2022-05-09

How to filter by a pandas column if the number of unique value in another column is equal a given value

please I want to filter AccountID that has transaction data for at least >=3 months ?. This is just a small fraction of the entire dataset
Here is what I did but, I am not sure it is right.
data = data.groupby('AccountID').apply(lambda x: x['TransactionDate'].nunique() >= 3)
I get a series as an output with boolean values. I want to get a pandas dataframe
TransactionDate AccountID TransactionAmount
0 2020-12-01 8 400000.0
1 2020-12-01 22 25000.0
2 2020-12-02 22 551500.0
3 2020-01-01 17 116.0
4 2020-01-01 24 2000.0
5 2020-01-02 68 6000.0
6 2020-03-03. 20 180000.0
7 2020-03-01 66 34000.0
8 2020-02-01 66 20000.0
9 2020-02-01 66 40600.0
The ouput I get
AccountID
1 True
2 True
3 True
4 True
5 True
You are close, need GroupBy.transform for repeat aggregated values for Series with same size like original df, so possible filtering in boolean indexing:
data = data[data.groupby('AccountID')['TransactionDate'].transform('nunique') >= 3]
If possible some dates has no same day, 1 use Series.dt.to_period for helper column filler by months periods:
s = data.assign(new = data['TransactionDate'].dt.to_period('m')).groupby('AccountID')['new'].transform('nunique')
data = data[s >= 3]

Finding historical seasonal average for given month in a monthly series in a dataframe time-series

I have a dataframe (snippet below) with index in format YYYYMM and several columns of values, including one called "month" in which I've extracted the MM data from the index column.
index st us stu px month
0 202001 2616757.0 3287969.0 0.795858 2.036 01
1 201912 3188693.0 3137911.0 1.016183 2.283 12
2 201911 3610052.0 2752828.0 1.311398 2.625 11
3 201910 3762043.0 2327289.0 1.616492 2.339 10
4 201909 3414939.0 2216155.0 1.540930 2.508 09
What I want to do is make a new column called 'stavg' which takes the 5-year average of the 'st' column for the given month. For example, since the top row refers to 202001, the stavg for that row should be the average of the January values from 2019, 2018, 2017, 2016, and 2015. Going back in time by each additional year should pull the moving average back as well, such that stavg for the row for, say, 201205 should show the average of the May values from 2011, 2010, 2009, 2008, and 2007.
index st us stu px month stavg
0 202001 2616757.0 3287969.0 0.795858 2.036 01 xxx
1 201912 3188693.0 3137911.0 1.016183 2.283 12 xxx
2 201911 3610052.0 2752828.0 1.311398 2.625 11 xxx
3 201910 3762043.0 2327289.0 1.616492 2.339 10 xxx
4 201909 3414939.0 2216155.0 1.540930 2.508 09 xxx
I know how to generate new columns of data based on operations on other columns on the same row (such as dividing 'st' by 'us' to get 'stu' and extracting digits from index to get 'month') but this notion of creating a column of data based on previous values is really stumping me.
Any clues on how to approach this would be greatly appreciated!! I know that for the first five years of data, I won't be able to populate the 'stavg' column with anything, which is fine--I could use NaN there.
Try defining a function and using apply method
df['year'] = (df['index'].astype(int)/100).astype(int)
def get_stavg(df, year, month):
# get year from index
df_year_month = df.query('#year - 5 <= year < #year and month == #month')
return df_year_month.st.mean()
df['stavg'] = df.apply(lambda x: get_stavg(df, x['year'], x['month']), axis=1)
If you are looking for a pandas only solution you could do something like
Dummy Data
Here we create a dummy datasets with 10 years of data with only two months (Jan and Feb).
import pandas as pd
df1 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-JAN")})
df2 = pd.DataFrame({"date":pd.date_range("2010-01-01", periods=10, freq="AS-FEB")})
df1["n"] = df1.index*2
df2["n"] = df2.index*3
df = pd.concat([df1, df2]).sort_values("date").reset_index(drop=True)
df.head(10)
date n
0 2010-01-01 0
1 2010-02-01 0
2 2011-01-01 2
3 2011-02-01 3
4 2012-01-01 4
5 2012-02-01 6
6 2013-01-01 6
7 2013-02-01 9
8 2014-01-01 8
9 2014-02-01 12
Groupby + rolling mean
df["n_mean"] = df.groupby(df["date"].dt.month)["n"]\
.rolling(5).mean()\
.reset_index(0,drop=True)
date n n_mean
0 2010-01-01 0 NaN
1 2010-02-01 0 NaN
2 2011-01-01 2 NaN
3 2011-02-01 3 NaN
4 2012-01-01 4 NaN
5 2012-02-01 6 NaN
6 2013-01-01 6 NaN
7 2013-02-01 9 NaN
8 2014-01-01 8 4.0
9 2014-02-01 12 6.0
10 2015-01-01 10 6.0
11 2015-02-01 15 9.0
12 2016-01-01 12 8.0
13 2016-02-01 18 12.0
14 2017-01-01 14 10.0
15 2017-02-01 21 15.0
16 2018-01-01 16 12.0
17 2018-02-01 24 18.0
18 2019-01-01 18 14.0
19 2019-02-01 27 21.0
By definition for the first 4 years the result is NaN.
Update
For your particular case
import pandas as pd
index = [f"{y}01" for y in range(2010, 2020)] +\
[f"{y}02" for y in range(2010, 2020)]
df = pd.DataFrame({"index":index})
df["st"] = df.index + 1
# dates/ index should be sorted
df = df.sort_values("index").reset_index(drop=True)
# extract month
df["month"] = df["index"].str[-2:]
df["st_mean"] = df.groupby("month")["st"]\
.rolling(5).mean()\
.reset_index(0,drop=True)

Number of active IDs in each period

I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1

Perform multiple operations in a single groupby call with pandas?

I'd like to produce a summary dataframe after grouping by date. I want to have a column that shows the mean of a given column as it is and the mean of that same column after filtering for instances that are greater than 0. I figured out how I can do this (below), but it requires doing two separate groupby calls, renaming the columns, and then joining them back together. I fell like one should be able to do this all in one call. I was trying to use eval to do this but kept getting an error and being told to use apply, that I couldn't use eval on a groupby object.
Code which gets me what I want but doesn't seem very efficient:
# Sample data
data = pd.DataFrame(
{"year" : [2013, 2013, 2013, 2014, 2014, 2014],
"month" : [1, 2, 3, 1, 2, 3],
"day": [1, 1, 1, 1, 1, 1],
"delay": [0, -4, 50, -60, 9, 10]})
subset = (data
.groupby(['year', 'month', 'day'])['delay']
.mean()
.reset_index()
.rename(columns = {'delay' : 'avg_delay'})
)
subset_1 = (data[data.delay > 0]
.groupby(['year', 'month', 'day'])['delay']
.mean()
.reset_index()
.rename(columns = {'delay' : 'avg_delay_pos'})
)
combined = pd.merge(subset, subset_1, how='left', on=['year', 'month', 'day'])
combined
year month day avg_delay avg_delay_pos
0 2013 1 1 0 NaN
1 2013 2 1 -4 NaN
2 2013 3 1 50 50.0
3 2014 1 1 -60 NaN
4 2014 2 1 9 9.0
5 2014 3 1 10 10.0
IIUC, you could use the following code:
>>> data['avg_delay'] = data.pop('delay')
>>> data['avg_delay_pos'] = data.loc[data['avg_delay'].gt(0), 'avg_delay']
>>> data
day month year avg_delay avg_delay_pos
0 1 1 2013 0 NaN
1 1 2 2013 -4 NaN
2 1 3 2013 50 50.0
3 1 1 2014 -60 NaN
4 1 2 2014 9 9.0
5 1 3 2014 10 10.0
>>>
Explanation:
I first remove the delay column, and assign it to the new name of avg_delay, so I am virtually renaming the name of delay to avg_delay.
Then I create a new column called avg_delay_pos, which first uses loc to get the values greater than zero, and since the index doesn't reset, so it will make the indexes that are greater than zero to the values of avg_delay, and the others won't contain any assignments, that said they will be NaN as you expected.
The solution is specific to your problem, but you can do this using a single groupby call. To get "avg_delay_pos", you just have to remove negative (and zero) values.
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
(df.filter(like='delay')
.groupby(pd.to_datetime(df[['year', 'month', 'day']]))
.mean()
.add_prefix('avg_'))
avg_delay avg_delay_pos
2013-01-01 0 NaN
2013-02-01 -4 NaN
2013-03-01 50 50.0
2014-01-01 -60 NaN
2014-02-01 9 9.0
2014-03-01 10 10.0
Breakdown
where is used to mask values that are not positive.
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
# df['delay'].where(df['delay'] > 0)
0 NaN
1 NaN
2 50.0
3 NaN
4 9.0
5 10.0
Name: delay, dtype: float64
Next, extract the delay columns we want to group on,
df.filter(like='delay')
delay delay_pos
0 0 NaN
1 -4 NaN
2 50 50.0
3 -60 NaN
4 9 9.0
5 10 10.0
Then perform a groupby on the date,
_.groupby(pd.to_datetime(df[['year', 'month', 'day']])).mean()
delay delay_pos
2013-01-01 0 NaN
2013-02-01 -4 NaN
2013-03-01 50 50.0
2014-01-01 -60 NaN
2014-02-01 9 9.0
2014-03-01 10 10.0
Where pd.to_datetime is used to convert the year/month/day columns into a single datetime column, it's more efficient to group on a single column than multiple.
pd.to_datetime(df[['year', 'month', 'day']])
0 2013-01-01
1 2013-02-01
2 2013-03-01
3 2014-01-01
4 2014-02-01
5 2014-03-01
dtype: datetime64[ns]
The final .add_prefix('avg_') add prefix "_avg" to the result.
An alternative way to do this if you want separate year/month/day columns would be
df['delay_pos'] = df['delay'].where(df['delay'] > 0)
df.groupby(['year', 'month', 'day']).mean().add_prefix('avg_').reset_index()
year month day avg_delay avg_delay_pos
0 2013 1 1 0 NaN
1 2013 2 1 -4 NaN
2 2013 3 1 50 50.0
3 2014 1 1 -60 NaN
4 2014 2 1 9 9.0
5 2014 3 1 10 10.0

Categories

Resources