I have data that looks like this
date ticker x y
0 2018-01-31 ABC 1 5
1 2019-01-31 ABC 2 6
2 2018-01-31 XYZ 3 7
3 2019-01-31 XYZ 4 8
So it is a panel of yearly observations. I want to upsample to a monthly frequency and forward fill the new observations. So ABC would look like
date ticker x y
0 2018-01-31 ABC 1 5
1 2018-02-28 ABC 1 5
...
22 2019-11-30 ABC 2 6
23 2019-12-31 ABC 2 6
Notice that I want to fill through the last year, not just up until the last date.
Right now I am doing something like
newidx = df.groupby('ticker')['date'].apply(lambda x:
pd.Series(pd.date_range(x.min(),x.max()+YearEnd(1),freq='M'))).reset_index()
newidx.drop('level_1',axis=1,inplace=True)
df = pd.merge(newidx,df,on=['date','ticker'],how='left')
This is obviously a terrible way to do this. It's really slow, but it works. What is the proper way to handle this?
Your approach might be slow because you need groupby, then merge. Let's try another option with reindex so you only need groupby:
(df.set_index('date')
.groupby('ticker')
.apply(lambda x: x.reindex(pd.date_range(x.index.min(),x.index.max()+YearEnd(1),freq='M'),
method='ffill'))
.reset_index('ticker', drop=True)
.reset_index()
)
Related
I need to build the new column about compare the previous date and the previous date must follow a special rule. I need to find the repeat purchase in past 3 months. I have no idea how can do this. There has some example and my expected output.
transaction.csv:
code,transaction_datetime
1,2021-12-01
1,2022-01-24
1,2022-05-29
2,2021-11-20
2,2022-04-12
2,2022-06-02
3,2021-04-23
3,2022-04-22
expected output:
code,transaction_datetime,repeat_purchase_P3M
1,2021-12-01,no
1,2022-01-24,2021-12-01
1,2022-05-29,no
2,2021-11-20,no
2,2022-04-12,no
2,2022-06-02,2022-04-12
3,2021-04-23,no
3,2022-04-22,no
df = pd.read_csv('file.csv')
df.transaction_datetime = pd.to_datetime(df.transaction_datetime)
grouped = df.groupby('code')['transaction_datetime']
df['repeated_purchase_P3M'] = grouped.shift().dt.date.where(grouped.diff().dt.days < 90, 'no')
df
code transaction_datetime repeated_purchase_P3M
0 1 2021-12-01 no
1 1 2022-01-24 2021-12-01
2 1 2022-05-29 no
3 2 2021-11-20 no
4 2 2022-04-12 no
5 2 2022-06-02 2022-04-12
6 3 2021-04-23 no
7 3 2022-04-22 no
My data looks like this:
print(df)
DateTime, Status
'2021-09-01', 0
'2021-09-05', 1
'2021-09-07', 0
And I need it to look like this:
print(df_desired)
DateTime, Status
'2021-09-01', 0
'2021-09-02', 0
'2021-09-03', 0
'2021-09-04', 0
'2021-09-05', 1
'2021-09-06', 1
'2021-09-07', 0
Right now I accomplish this using pandas like this:
new_index = pd.DataFrame(index = pd.date_range(df.index[0], df.index[-1], freq='D'))
df = new_index.join(df).ffill()
Missing values before the first record in any column are imputed using the inverse of the first record in that column because it's binary and only shows change-points this is guaranteed to be correct.
To my understanding my desired dataframe contained "continuous" data, but I'm not sure what to call the data structure in my source data.
The problem:
When I do this to a dataframe that has a frequency of one record per second and I want to load a year's worth of data my memory overflows (92GB required, ~60GB available). I'm not sure if there is a standard procedure instead of my solution that I don't know the name of and cannot find using google or that I'm using the join method wrong, but this seems horribly inefficient, the resulting dataframe is only a few 100 megabytes after this operation. Any feedback on this would be great!
Use DataFrame.asfreq working with DatetimeIndex:
df = df.set_index('DateTime').asfreq('d', method='ffill').reset_index()
print (df)
DateTime Status
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 1
5 2021-09-06 1
6 2021-09-07 0
You can use this pipeline:
(df.set_index('DateTime')
.reindex(pd.date_range(df['DateTime'].min(), df['DateTime'].max()))
.rename_axis('DateTime')
.ffill(downcast='infer')
.reset_index()
)
output:
DateTime Status
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 1
5 2021-09-06 1
6 2021-09-07 0
input:
DateTime Status
0 2021-09-01 0
1 2021-09-05 1
2 2021-09-07 0
I have the following data frame:
df = pd.DataFrame([[66991,'2020-06-01',2],
[66991,'2020-06-02',1],
[66991,'2020-07-03',1],
[44551,'2020-10-01',1],
[66991,'2020-12-05',7],
[44551,'2020-12-05',5],
[66991,'2020-12-01',1],
[66991,'2021-01-08',3]],columns=['ID','DATE','QTD'])
How can I add the the months (in which QTD is zero), to each ID ? (Ideally I would like for the column BALANCE and CC to keep the previous value, for each ID, on the added rows but this not not stricly necessary as I am more interested on the QTD and VAL columns).
I thought about maybe resampling the data by month for each ID on a data frame and then merge that data frame to the one above. Is this a good implementation? Is there a better way to achieve this result?
Should end up similar to this:
df = pd.DataFrame([[66991,'2020-06-01',2],
[66991,'2020-06-02',1],
[66991,'2020-07-03',1],
[66991,'2020-08-01',0],
[66991,'2020-09-01',0],
[66991,'2020-10-01',0],
[44551,'2020-10-01',1],
[44551,'2020-11-05',0],
[66991,'2020-11-01',0],
[66991,'2020-12-05',7],
[44551,'2020-12-05',5],
[66991,'2020-12-01',1],
[66991,'2021-01-08',3]],columns=['ID','DATE','QTD'])
You can generate a range of dates by ID using pd.date_range, then create a pd.MultiIndex so you can do a reindex:
s = pd.MultiIndex.from_tuples([(i, x) for i, j in df.groupby("ID")
for x in pd.date_range(min(j["DATE"]), max(j["DATE"]), freq="MS")],
names=["ID", "DATE"])
df = df.set_index(["ID", "DATE"])
print (df.reindex(df.index|s, fill_value=0)
.reset_index()
.groupby(["ID", pd.Grouper(key="DATE", freq="M")], as_index=False)
.apply(lambda i: i[i["QTD"].ne(0)|(len(i)==1)])
.droplevel(0))
ID DATE QTD
0 44551 2020-10-01 1
1 44551 2020-11-01 0
3 44551 2020-12-05 5
4 66991 2020-06-01 2
5 66991 2020-06-02 1
7 66991 2020-07-03 1
8 66991 2020-08-01 0
9 66991 2020-09-01 0
10 66991 2020-10-01 0
11 66991 2020-11-01 0
12 66991 2020-12-01 1
13 66991 2020-12-05 7
15 66991 2021-01-08 3
I have a dataframe like this, how to sort this.
df = pd.DataFrame({'Date':['Oct20','Nov19','Jan19','Sep20','Dec20']})
Date
0 Oct20
1 Nov19
2 Jan19
3 Sep20
4 Dec20
I familiar in sorting list of dates(string)
a.sort(key=lambda date: datetime.strptime(date, "%d-%b-%y"))
Any thoughts? Should i split it ?
First convert column to datetimes and get positions of sorted values by Series.argsort what is used for change ordering with DataFrame.iloc:
df = df.iloc[pd.to_datetime(df['Date'], format='%b%y').argsort()]
print (df)
Date
2 Jan19
1 Nov19
3 Sep20
0 Oct20
4 Dec20
Details:
print (pd.to_datetime(df['Date'], format='%b%y'))
0 2020-10-01
1 2019-11-01
2 2019-01-01
3 2020-09-01
4 2020-12-01
Name: Date, dtype: datetime64[ns]
I have some consumer purchase data that looks like
CustomerID InvoiceDate
13654.0 2011-07-17 13:29:00
14841.0 2010-12-16 10:28:00
19543.0 2011-10-18 16:58:00
12877.0 2011-06-15 13:34:00
15073.0 2011-06-06 12:33:00
I'm interested in the rate at which customers purchase. I'd like to group by each customer and then determine how many purchases were made in each quarter (let's say each quarter is every 3 months starting in January).
I could just define when each quarter starts and ends and make another column. I'm wondering if I could instead use groupby to achieve the same thing.
Presently, this is how I do it:
r = data.groupby('CustomerID')
frames = []
for name,frame in r:
f =frame.set_index('InvoiceDate').resample("QS").count()
f['CustomerID']= name
frames.append(f)
g = pd.concat(frames)
UPDATE:
In [43]: df.groupby(['CustomerID', pd.Grouper(key='InvoiceDate', freq='QS')]) \
.size() \
.reset_index(name='Count')
Out[43]:
CustomerID InvoiceDate Count
0 12877.0 2011-04-01 1
1 13654.0 2011-07-01 1
2 14841.0 2010-10-01 1
3 15073.0 2011-04-01 1
4 19543.0 2011-10-01 1
Is that what you want?
In [39]: df.groupby(pd.Grouper(key='InvoiceDate', freq='QS')).count()
Out[39]:
CustomerID
InvoiceDate
2010-10-01 1
2011-01-01 0
2011-04-01 2
2011-07-01 1
2011-10-01 1
I think this is the best I will be able to do:
data.groupby('CustomerID').apply(lambda x: x.set_index('InvoiceDate').resample('QS').count())