I have data that looks like this. Each row represents a value of that ID at some date.
ID Date Value
A 2012-01-05 50
A 2012-01-08 100
A 2012-01-10 200
B 2012-07-01 10
B 2012-07-03 20
I need to expand this so that I have rows for all days. The value of each day should be the value of the day before (i.e., think of the data above as updates of values, and the data below as a timeseries of values).
ID Date Value
A 2012-01-05 50
A 2012-01-06 50
A 2012-01-07 50
A 2012-01-08 100
A 2012-01-09 100
A 2012-01-10 200
B 2012-07-01 10
B 2012-07-02 10
B 2012-07-03 20
Currently, I have a solution that amounts to the following:
Group by ID
For each group, figure out the min and max date
Create a pd.date_range
Iterate simultaneously through the rows and through the date range, filling the values in the date range and incrementing the index-pointer to the rows if necessary
Append all these date ranges to a final dataframe
It works, but seems like a pretty bad bruteforce solution. I wonder if there's a better approach supported by Pandas?
Using resample on Date indexed dataframe with ID groups and ffill on value
In [1725]: df.set_index('Date').groupby('ID').resample('1D')['Value'].ffill().reset_index()
Out[1725]:
ID Date Value
0 A 2012-01-05 50
1 A 2012-01-06 50
2 A 2012-01-07 50
3 A 2012-01-08 100
4 A 2012-01-09 100
5 A 2012-01-10 200
6 B 2012-07-01 10
7 B 2012-07-02 10
8 B 2012-07-03 20
Or you can try this one (Notice : this can be used for expend numeric column too ).
df.Date=pd.to_datetime(df.Date)
df=df.set_index(df.Date)
df.set_index(df.Date).groupby('ID')\
.apply(lambda x : x.reindex(pd.date_range(min(x.index), max(x.index),freq='D')))\
.ffill().reset_index(drop=True)
Out[519]:
ID Date Value
0 A 2012-01-05 50.0
1 A 2012-01-05 50.0
2 A 2012-01-05 50.0
3 A 2012-01-08 100.0
4 A 2012-01-08 100.0
5 A 2012-01-10 200.0
6 B 2012-07-01 10.0
7 B 2012-07-01 10.0
8 B 2012-07-03 20.0
Related
I have a dataframe that has a datetime index and a time series of integer values 1 per day. From this, I want to identify occurrences where the timeseries is above a threshold for at least 2 consecutive days. For these events, I want to count how many of these occur over the entire span and the start date of each one.
One of the issues is making sure I don't over count the occurrences when the event lasts more than 2 days, as long as the values stay over the threshold it should only be 1 event whether it lasts 2 days of 10 days.
I can do this using a function with lots of if statements but it's very kludgy. I want to learn a more pandas/pythonic way of doing it.
I started by looking at a masked version of the data of only the values that were above the threshold (arbitrary here) and using diff() which seemed promising but I'm still stuck. Any help is appreciated.
dates = pd.date_range('2012-01-01', periods=100, freq='D')
values = np.random.randint(100, size=len(dates))
df = pd.DataFrame({'timeseries':values}, index=dates)
df.loc[df['timeseries'] > arb_thr].index.to_series().diff().head(20)
You can make use of booleans to flag the rows which are above the threshold, then cumsum these to create the artificial groups and finally groupby on them:
arb_thr = 20
df = df.reset_index()
grps = df["timeseries"].lt(arb_thr).cumsum()
result = df.groupby(grps).agg(
min_date=("index", "min"),
max_date=("index", "max"),
count=("timeseries", "count")
).rename_axis(None, axis=0)
min_date max_date count
0 2012-01-01 2012-01-09 9
1 2012-01-10 2012-01-11 2
2 2012-01-12 2012-01-12 1
3 2012-01-13 2012-01-22 10
4 2012-01-23 2012-01-24 2
5 2012-01-25 2012-02-04 11
6 2012-02-05 2012-02-07 3
7 2012-02-08 2012-02-08 1
8 2012-02-09 2012-02-10 2
9 2012-02-11 2012-02-12 2
10 2012-02-13 2012-02-15 3
11 2012-02-16 2012-02-20 5
12 2012-02-21 2012-02-21 1
13 2012-02-22 2012-02-23 2
14 2012-02-24 2012-02-25 2
15 2012-02-26 2012-03-04 8
16 2012-03-05 2012-03-07 3
17 2012-03-08 2012-03-20 13
18 2012-03-21 2012-03-22 2
19 2012-03-23 2012-03-23 1
20 2012-03-24 2012-03-24 1
21 2012-03-25 2012-03-28 4
22 2012-03-29 2012-03-29 1
23 2012-03-30 2012-04-01 3
24 2012-04-02 2012-04-08 7
25 2012-04-09 2012-04-09 1
I have two dataframes as follows:
agreement
agreement_id activation term_months total_fee
0 A 2020-12-01 24 4800
1 B 2021-01-02 6 300
2 C 2021-01-21 6 600
3 D 2021-03-04 6 300
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
I want to add another row in the payments dataframe when the total payments for the agreement_id in the payments dataframe is equal to the total_fee in the agreement_id. The row would contain a zero value under the payments and the date will be calculated as min(date) (from payments) plus term_months (from agreement).
Here's the results I want for the payments dataframe:
payments
cust_id agreement_id date payment
0 1 A 2020-12-01 200
1 1 A 2021-02-02 200
2 1 A 2021-02-03 100
3 1 A 2021-05-01 200
4 1 B 2021-01-02 50
5 1 B 2021-01-09 20
6 1 B 2021-03-01 80
7 1 B 2021-04-23 90
8 2 C 2021-01-21 600
9 3 D 2021-03-04 150
10 3 D 2021-05-03 150
11 2 C 2021-07-21 0
12 3 D 2021-09-04 0
The additional rows are row 11 and 12. The agreement_id 'C' and 'D' where equal to the total_fee shown in the agreement dataframe.
import pandas as pd
import numpy as np
Firstly convert 'date' column of payment dataframe into datetime dtype by using to_datetime() method:
payments['date']=pd.to_datetime(payments['date'])
You can do this by using groupby() method:
newdf=payments.groupby('agreement_id').agg({'payment':'sum','date':'min','cust_id':'first'}).reset_index()
Now by boolean masking get the data which mets your condition:
newdf=newdf[agreement['total_fee']==newdf['payment']].assign(payment=np.nan)
Note: here in the above code we are using assign() method and making the payments row to NaN
Now make use of pd.tseries.offsets.Dateoffsets() method and apply() method:
newdf['date']=newdf['date']+agreement['term_months'].apply(lambda x:pd.tseries.offsets.DateOffset(months=x))
Note: The above code gives you a warning so just ignore that warning as it's a warning not an error
Finally make use of concat() method and fillna() method:
result=pd.concat((payments,newdf),ignore_index=True).fillna(0)
Now if you print result you will get your desired output
#output
cust_id agreement_id date payment
0 1 A 2020-12-01 200.0
1 1 A 2021-02-02 200.0
2 1 A 2021-02-03 100.0
3 1 A 2021-05-01 200.0
4 1 B 2021-01-02 50.0
5 1 B 2021-01-09 20.0
6 1 B 2021-03-01 80.0
7 1 B 2021-04-23 90.0
8 2 C 2021-01-21 600.0
9 3 D 2021-03-04 150.0
10 3 D 2021-05-03 150.0
11 2 C 2021-07-21 0.0
12 3 D 2021-09-04 0.0
Note: If you want exact same output then make use of astype() method and change payment column dtype from float to int
result['payment']=result['payment'].astype(int)
Hi I have a huge dataframe with the following structure:
ticker calendar-date last-update Assets Ebitda .....
0 a 2001-06-30 2001-09-14 110 1000 .....
1 a 2001-09-30 2002-01-22 0 -8 .....
2 a 2001-09-30 2002-02-01 0 800 .....
3 a 2001-12-30 2002-03-06 120 0 .....
4 b 2001-06-30 2001-09-18 110 0 .....
5 b 2001-06-30 2001-09-27 110 30 .....
6 b 2001-09-30 2002-01-08 140 35 .....
7 b 2001-12-30 2002-03-08 120 40 .....
..
What I want is for each ticker: create new columns with % change in Assets and Ebitda from last calendar-date (t-1) and last calendar-date(t-2) for each row.
But here comes the problems:
1) As you can see calendar-date (by ticker) are not always uniques values since there can be more last-update for the same calendar-date but I always want the change since last calendar-date and not from last last-update.
2)there are rows with 0 values in that case I want to use the last observed value to calculate the %change. If I only had one stock that would be easy, I would just ffill the values, but since I have many tickers I cannot perform this operation safely since I could pad the value from ticker 'a' to ticker 'b' and that is not what I want
I guess this could be solved creating a function with if statements to handle data exceptions or maybe there is a good way to handle this inside pandas... maybe multi indexing?? the truth is that I have no idea on how to approach this task, anybody can help?
Thanks
Step 1
sort_values to ensure proper ordering for later manipulation
icols = ['ticker', 'calendar-date', 'last-update']
df.sort_values(icols, inplace=True)
Step 2
groupby 'ticker' and replace zeros and forward fill
vcols = ['Assets', 'Ebitda']
temp = df.groupby('ticker')[vcols].apply(lambda x: x.replace(0, np.nan).ffill())
d1 = df.assign(**temp.to_dict('list'))
d1
ticker calendar-date last-update Assets Ebitda
0 a 2001-06-30 2001-09-14 110.0 1000.0
1 a 2001-09-30 2002-01-22 110.0 -8.0
2 a 2001-09-30 2002-02-01 110.0 800.0
3 a 2001-12-30 2002-03-06 120.0 800.0
4 b 2001-06-30 2001-09-18 110.0 NaN
5 b 2001-06-30 2001-09-27 110.0 30.0
6 b 2001-09-30 2002-01-08 140.0 35.0
7 b 2001-12-30 2002-03-08 120.0 40.0
NOTE: The first 'Ebitda' for 'b' is NaN because there was nothing to forward fill from.
Step 3
groupby ['ticker', 'calendar-date'] and grab the last column. Because we sorted above, the last row will be the most recently updated row.
d2 = d1.groupby(icols[:2])[vcols].last()
Step 4
groupby again, this time just by 'ticker' which is in the index of d2, and take the pct_change
d3 = d2.groupby(level='ticker').pct_change()
Step 5
join back with df
df.join(d3, on=icols[:2], rsuffix='_pct')
ticker calendar-date last-update Assets Ebitda Assets_pct Ebitda_pct
0 a 2001-06-30 2001-09-14 110 1000 NaN NaN
1 a 2001-09-30 2002-01-22 0 -8 0.000000 -0.200000
2 a 2001-09-30 2002-02-01 0 800 0.000000 -0.200000
3 a 2001-12-30 2002-03-06 120 0 0.090909 0.000000
4 b 2001-06-30 2001-09-18 110 0 NaN NaN
5 b 2001-06-30 2001-09-27 110 30 NaN NaN
6 b 2001-09-30 2002-01-08 140 35 0.272727 0.166667
7 b 2001-12-30 2002-03-08 120 40 -0.142857 0.142857
I've got a dataframe that looks like this:
userid date count
a 2016-12-01 4
a 2016-12-03 5
a 2016-12-05 1
b 2016-11-17 14
b 2016-11-18 15
b 2016-11-23 4
The first column is a user id, the second column is a date (resulting from a groupby(pd.TimeGrouper('d')), and the third column is a daily count. However, per user, I would like to ensure that any days missing between a user's min and max date are filled in to be 0 on a per user basis. So if I am starting with a data frame like the above, I end up with a data frame like this:
userid date count
a 2016-12-01 4
a 2016-12-02 0
a 2016-12-03 5
a 2016-12-04 0
a 2016-12-05 1
b 2016-11-17 14
b 2016-11-18 15
b 2016-11-19 0
b 2016-11-20 0
b 2016-11-21 0
b 2016-11-22 0
b 2016-11-23 4
I know that there are various methods available with a pandas data frame to resample (with options to pick to interpolate forwards, backwards, or by averaging) but how would I do this in the sense above, where I want a continuous time series for each userid but where the dates of the time series are different per user?
Here's what I tried that hasn't worked:
grouped_users = user_daily_counts.groupby('user').set_index('timestamp').resample('d', fill_method = None)
However this throws an error AttributeError: Cannot access callable attribute 'set_index' of 'DataFrameGroupBy' objects, try using the 'apply' method. I'm not sure how I'd be able to use the apply method while bringing forward all columns as I'd like to do.
Thanks for any suggestions!
You can use groupby with resample, but first need Datetimeindex created by set_index.
(need pandas 0.18.1 and higher)
Then fill NaN by 0 by asfreq with fillna.
Last remove column userid and reset_index:
df = df.set_index('date')
.groupby('userid')
.resample('D')
.asfreq()
.fillna(0)
.drop('userid', axis=1)
.reset_index()
print (df)
userid date count
0 a 2016-12-01 4.0
1 a 2016-12-02 0.0
2 a 2016-12-03 5.0
3 a 2016-12-04 0.0
4 a 2016-12-05 1.0
5 b 2016-11-17 14.0
6 b 2016-11-18 15.0
7 b 2016-11-19 0.0
8 b 2016-11-20 0.0
9 b 2016-11-21 0.0
10 b 2016-11-22 0.0
11 b 2016-11-23 4.0
If want dtype of column count integer add astype:
df = df.set_index('date') \
.groupby('userid') \
.resample('D') \
.asfreq() \
.fillna(0) \
.drop('userid', axis=1) \
.astype(int) \
.reset_index()
print (df)
userid date count
0 a 2016-12-01 4
1 a 2016-12-02 0
2 a 2016-12-03 5
3 a 2016-12-04 0
4 a 2016-12-05 1
5 b 2016-11-17 14
6 b 2016-11-18 15
7 b 2016-11-19 0
8 b 2016-11-20 0
9 b 2016-11-21 0
10 b 2016-11-22 0
11 b 2016-11-23 4
I have a dataframe with two columns--date and id. I'd like to calculate for each date the number of id's on that date which reappear on a later date within 7 days. If I were doing this in postgres, it would look something like:
SELECT df1.date, COUNT(DISTINCT df1.id)
FROM df df1 INNER JOIN df df2
ON df1.id = df2.id AND
df2.date BETWEEN df1.date + 1 AND df1.date + 7
GROUP BY df1.date;
What is problematic for me is how to translate this statement into pandas in a way that is fast and idiomatic and etc.
I've already tried for one-day retention by simply creating a lagged column and merging the original with the lagged dataframe. This certainly works. However, for seven-day retention I would need to create 7 dataframes and merge them together. That's not reasonable, as far as I'm concerned. (Especially because I'd also like to know 30-day numbers.)
(I should also point out that my research led me to https://github.com/pydata/pandas/issues/2996, which indicates a merge behavior that does not work on my install (pandas 0.14.0) which fails with error message TypeError: Argument 'values' has incorrect type (expected numpy.ndarray, got Series). So there appears to be some sort of advanced merge/join behavior which I clearly don't know how to activate.)
If I understand you correctly, I think you can do it with a groupby/apply. It's a bit tricky. So I think you have data like the following:
>>> df
date id y
0 2012-01-01 1 0.1
1 2012-01-03 1 0.3
2 2012-01-09 1 0.4
3 2012-01-12 1 0.0
4 2012-01-14 1 0.2
5 2012-01-16 1 0.4
6 2012-01-01 2 0.2
7 2012-01-02 2 0.1
8 2012-01-03 2 0.4
9 2012-01-04 2 0.6
10 2012-01-09 2 0.7
11 2012-01-10 2 0.4
I'm going to create rolling forward count within an 'id' group of the number of times that id shows up in the next 7 days including the current day:
def count_forward7(g):
# Add column to the datframe so I can set date as the index
g['foo'] = 1
# New dataframe with daily frequency, so 7 rows = 7 days
# If there are no gaps in the dates you don't need to do this
x = g.set_index('date').resample('D')
# Do Andy Hayden Method for a forward looking rolling windows
# reverses the series and then reverses back the answer
fsum = pd.rolling_sum(x[::-1],window=7,min_periods=0)[::-1]
return pd.DataFrame(fsum[fsum.index.isin(g.date)].values,index=g.index)
>>> df['f7'] = df.groupby('id')[['date']].apply(count_forward7)
>>> df
date id y f7
0 2012-01-01 1 0.1 2
1 2012-01-03 1 0.3 2
2 2012-01-09 1 0.4 3
3 2012-01-12 1 0.0 3
4 2012-01-14 1 0.2 2
5 2012-01-16 1 0.4 1
6 2012-01-01 2 0.2 4
7 2012-01-02 2 0.1 3
8 2012-01-03 2 0.4 3
9 2012-01-04 2 0.6 3
10 2012-01-09 2 0.7 2
11 2012-01-10 2 0.4 1
Now if you want to now "calculate for each date the number of id's on that date which reappear on a later date within 7 days" just count for each date where f7 > 1:
>>> df['bool_f77'] = df['f7'] > 1
>>> df.groupby('date')['bool_f77'].sum()
2012-01-01 2
2012-01-02 1
2012-01-03 2
2012-01-04 1
2012-01-09 2
2012-01-10 0
2012-01-12 1
2012-01-14 1
2012-01-16 0
Or Something like the following:
>>> df.query('f7 > 1').groupby('date')['date'].count()
date
2012-01-01 2
2012-01-02 1
2012-01-03 2
2012-01-04 1
2012-01-09 2
2012-01-12 1
2012-01-14 1