compare to previous date in pandas - python

I need to build the new column about compare the previous date and the previous date must follow a special rule. I need to find the repeat purchase in past 3 months. I have no idea how can do this. There has some example and my expected output.
transaction.csv:
code,transaction_datetime
1,2021-12-01
1,2022-01-24
1,2022-05-29
2,2021-11-20
2,2022-04-12
2,2022-06-02
3,2021-04-23
3,2022-04-22
expected output:
code,transaction_datetime,repeat_purchase_P3M
1,2021-12-01,no
1,2022-01-24,2021-12-01
1,2022-05-29,no
2,2021-11-20,no
2,2022-04-12,no
2,2022-06-02,2022-04-12
3,2021-04-23,no
3,2022-04-22,no

df = pd.read_csv('file.csv')
df.transaction_datetime = pd.to_datetime(df.transaction_datetime)
grouped = df.groupby('code')['transaction_datetime']
df['repeated_purchase_P3M'] = grouped.shift().dt.date.where(grouped.diff().dt.days < 90, 'no')
df
code transaction_datetime repeated_purchase_P3M
0 1 2021-12-01 no
1 1 2022-01-24 2021-12-01
2 1 2022-05-29 no
3 2 2021-11-20 no
4 2 2022-04-12 no
5 2 2022-06-02 2022-04-12
6 3 2021-04-23 no
7 3 2022-04-22 no

Related

Given a dataframe with event details, return a count of events that occured on any given date, based on the start and end dates of the events

What I have and am trying to do:
A dataframe, with headers: event_id, location_id, start_date, end_date.
An event can only have one location, start and end.
A location can have multiple events, starts and ends, and they can overlap.
The goal here is to be able to say, given any time T, for location X, how many events were there?
E.g.
Given three events, all for location 2:
Event.
Start.
End.
Event 1
2022-05-01
2022-05-07
Event 2
2022-05-04
2022-05-10
Event 3
2022-05-02
2022-05-05
Time T.
Count of Events
2022-05-01
1
2022-05-02
2
2022-05-03
2
2022-05-04
3
2022-05-05
3
2022-05-06
2
**What I have tried so far, but got stuck on: **
((I did look at THIS possible solution for a similar problem, and I went pretty far with it, but I got lost in the itterows and how to have that apply here.))
Try to get an array or dataframe that has a 365 day date range for each location ID.
E.g.
[1,2022-01-01],[1,2022-01-02]........[98,2022-01-01][98,2022-01-02]
Then convert that array to a dataframe, and merge it with the original dataframe like:
index
location
time
event
location2
start
end
0
1
2022-01-01
1
10
2022-11-07
2022-11-12
1
1
2022-01-01
2
4
2022-02-16
2022-03-05
2
1
2022-01-01
3
99
2022-06-10
2022-06-15
3
1
2022-01-01
4
2
2021-12-31
2022-01-05
4
1
2022-01-01
5
5
2022-05-08
2022-05-22
Then perform some kind of reduction that returns the count:
location
Time
Count
1
2022-01-01
10
1
2022-01-02
3
1
2022-01-03
13
..
...
...
99
2022-01-01
4
99
2022-01-02
0
99
2022-01-03
7
99
2022-01-04
12
I've done something similar with tying events to other events where their dates overalapped, using the .loc(...) but I don't think that would work here, and I'm kind of just stumped.
Where I got stuck was creating an array that combines the location ID and date range, because they're different lengths, and I could figure out how to repeat the location ID for every date in the range.
Anyway, I am 99% positive that there is a much more efficient way of doing this, and really any help at all is greatly appreciated!!
Thank you :)
Update per comment
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the location of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Location': [df[df['Start.'].le(date) & df['End.'].ge(date)]['Event.'].tolist()
for date in date_range]})
# get the length of each list, which is the count
new_df['Count'] = new_df['Location'].str.len()
Date Location Count
0 2022-05-01 [Event 1] 1
1 2022-05-02 [Event 1, Event 3] 2
2 2022-05-03 [Event 1, Event 3] 2
3 2022-05-04 [Event 1, Event 2, Event 3] 3
4 2022-05-05 [Event 1, Event 2, Event 3] 3
5 2022-05-06 [Event 1, Event 2] 2
6 2022-05-07 [Event 1, Event 2] 2
7 2022-05-08 [Event 2] 1
8 2022-05-09 [Event 2] 1
9 2022-05-10 [Event 2] 1
IIUC you can try something like
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
# df.le is less than or equal to
# df.ge is greater than or equal to
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
6 2022-05-07 2
7 2022-05-08 1
8 2022-05-09 1
9 2022-05-10 1
Depending on how large your date range is we may need to take a different approach as things may get slow if you have a range of two years instead of 10 days in the example.
You can also use a custom date range if you do not want to use the min and max values from the whole frame
min_date = '2022-05-01'
max_date = '2022-05-06'
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
Note - I wanted to leave the original question up as is, and I was out of space, so I am answering my own question here, but #It_is_Chris is the real MVP.
Update! - with the enormous help from #It_is_Chris and some additional messing around, I was able to use the following code to generate the output I wanted:
# get the min and max dates
min_date, max_date = original_df[['start', 'end']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# create location range
loc_range = original_df['location'].unique()
# create a new list that combines every date with every location
combined_list = []
for item in date_range:
for location in loc_range:
combined_list.append(
{
'Date':item,
'location':location
}
)
# convert the list to a dataframe
combined_df = pd.DataFrame(combined_list)
# use merge to put original data together with the new dataframe
merged_df = pd.merge(combined_df,original_df, how="left", on="location")
# use loc to directly connect each event to a specific location and time
merged_df = merged_df.loc[(pd.to_datetime(merged_df['Date'])>=pd.to_datetime(merged_df['start'])) & (pd.to_datetime(merged_df['Date'])<=pd.to_datetime(merged_df['end']))]
# use groupby to push out a table as sought Date - Location - Count
output_merged_df = merged_df.groupby(['Date','fleet_id']).size()
The result looked like this:
Note - the sorting was not as I have it here, I believe I would need to add some additional sorting to the dataframe before outputting as a CSV.
Date
location
count
2022-01-01
1
1
2022-01-01
2
4
2022-01-01
3
1
2022-01-01
4
10
2022-01-01
5
3
2022-01-01
6
1

How to filter only the first Friday in a column with one week of data (Pandas)

I have a column that contains Friday-Friday dates ex. Fri March 4 to Fri March 11. I only want to filter the earliest Friday date. Any suggestions. I figured a way to sort out the min value, but I feel like there's a better method
df['Submitted On'] = pd.to_datetime(df['Submitted On'])
early = df['Submitted On'].min()
df = df.loc[df['Submitted On'] != early]
Although I don't know the use case for your data, your method is a little brittle. If for some reason the range of dates in your column changes, then you're filtering out the earliest date regardless of whether it's a Friday or not.
You can use the .dt.dayofweek method for Series which will return integers 0 through 6 for the day of the week meaning Friday is 4, and filter based on the first occurrence of a Friday. For example:
df = pd.DataFrame({'Submitted On': pd.date_range('2022-03-04','2022-03-11'), 'value':range(8)})
df['Submitted On'] = pd.to_datetime(df['Submitted On'])
filtered_df = df.drop(labels=df[df['Submitted On'].dt.dayofweek == 4].index.values[0])
Result:
Submitted On value
1 2022-03-05 1
2 2022-03-06 2
3 2022-03-07 3
4 2022-03-08 4
5 2022-03-09 5
6 2022-03-10 6
7 2022-03-11 7
And note that if I change the date range slightly, it still drops the first Friday:
df = pd.DataFrame({'Submitted On': pd.date_range('2022-03-03','2022-03-12'), 'value':range(10)})
filtered_df = df.drop(labels=df[df['Submitted On'].dt.dayofweek == 4].index.values[0])
Result:
Submitted On value
0 2022-03-03 0
2 2022-03-05 2
3 2022-03-06 3
4 2022-03-07 4
5 2022-03-08 5
6 2022-03-09 6
7 2022-03-10 7
8 2022-03-11 8
9 2022-03-12 9

How to convert one record per change to continuous data?

My data looks like this:
print(df)
DateTime, Status
'2021-09-01', 0
'2021-09-05', 1
'2021-09-07', 0
And I need it to look like this:
print(df_desired)
DateTime, Status
'2021-09-01', 0
'2021-09-02', 0
'2021-09-03', 0
'2021-09-04', 0
'2021-09-05', 1
'2021-09-06', 1
'2021-09-07', 0
Right now I accomplish this using pandas like this:
new_index = pd.DataFrame(index = pd.date_range(df.index[0], df.index[-1], freq='D'))
df = new_index.join(df).ffill()
Missing values before the first record in any column are imputed using the inverse of the first record in that column because it's binary and only shows change-points this is guaranteed to be correct.
To my understanding my desired dataframe contained "continuous" data, but I'm not sure what to call the data structure in my source data.
The problem:
When I do this to a dataframe that has a frequency of one record per second and I want to load a year's worth of data my memory overflows (92GB required, ~60GB available). I'm not sure if there is a standard procedure instead of my solution that I don't know the name of and cannot find using google or that I'm using the join method wrong, but this seems horribly inefficient, the resulting dataframe is only a few 100 megabytes after this operation. Any feedback on this would be great!
Use DataFrame.asfreq working with DatetimeIndex:
df = df.set_index('DateTime').asfreq('d', method='ffill').reset_index()
print (df)
DateTime Status
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 1
5 2021-09-06 1
6 2021-09-07 0
You can use this pipeline:
(df.set_index('DateTime')
.reindex(pd.date_range(df['DateTime'].min(), df['DateTime'].max()))
.rename_axis('DateTime')
.ffill(downcast='infer')
.reset_index()
)
output:
DateTime Status
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 1
5 2021-09-06 1
6 2021-09-07 0
input:
DateTime Status
0 2021-09-01 0
1 2021-09-05 1
2 2021-09-07 0

Add Missing Dates to Time Series ID's in Pandas

I have the following data frame:
df = pd.DataFrame([[66991,'2020-06-01',2],
[66991,'2020-06-02',1],
[66991,'2020-07-03',1],
[44551,'2020-10-01',1],
[66991,'2020-12-05',7],
[44551,'2020-12-05',5],
[66991,'2020-12-01',1],
[66991,'2021-01-08',3]],columns=['ID','DATE','QTD'])
How can I add the the months (in which QTD is zero), to each ID ? (Ideally I would like for the column BALANCE and CC to keep the previous value, for each ID, on the added rows but this not not stricly necessary as I am more interested on the QTD and VAL columns).
I thought about maybe resampling the data by month for each ID on a data frame and then merge that data frame to the one above. Is this a good implementation? Is there a better way to achieve this result?
Should end up similar to this:
df = pd.DataFrame([[66991,'2020-06-01',2],
[66991,'2020-06-02',1],
[66991,'2020-07-03',1],
[66991,'2020-08-01',0],
[66991,'2020-09-01',0],
[66991,'2020-10-01',0],
[44551,'2020-10-01',1],
[44551,'2020-11-05',0],
[66991,'2020-11-01',0],
[66991,'2020-12-05',7],
[44551,'2020-12-05',5],
[66991,'2020-12-01',1],
[66991,'2021-01-08',3]],columns=['ID','DATE','QTD'])
You can generate a range of dates by ID using pd.date_range, then create a pd.MultiIndex so you can do a reindex:
s = pd.MultiIndex.from_tuples([(i, x) for i, j in df.groupby("ID")
for x in pd.date_range(min(j["DATE"]), max(j["DATE"]), freq="MS")],
names=["ID", "DATE"])
df = df.set_index(["ID", "DATE"])
print (df.reindex(df.index|s, fill_value=0)
.reset_index()
.groupby(["ID", pd.Grouper(key="DATE", freq="M")], as_index=False)
.apply(lambda i: i[i["QTD"].ne(0)|(len(i)==1)])
.droplevel(0))
ID DATE QTD
0 44551 2020-10-01 1
1 44551 2020-11-01 0
3 44551 2020-12-05 5
4 66991 2020-06-01 2
5 66991 2020-06-02 1
7 66991 2020-07-03 1
8 66991 2020-08-01 0
9 66991 2020-09-01 0
10 66991 2020-10-01 0
11 66991 2020-11-01 0
12 66991 2020-12-01 1
13 66991 2020-12-05 7
15 66991 2021-01-08 3

python - Fill in missing dates with respect to a specific attribute in pandas

My data looks like below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-25,44
2,2016-10-27,12
I want to fill in missing dates among id.
For example, the date range of id=1 is 2016-10-24 ~ 2016-10-28, and 2016-10-26 is missing. Moreover, the date range of id=2 is 2016-10-21 ~ 2016-10-27, and 2016-10-23, 2016-10-24 and 2016-10-26 are missing.
I want to fill in the missing dates and fill in the target value as 0.
Therefore, I want my data to be as below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-26,0
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-23,0
2,2016-10-24,0
2,2016-10-25,44
2,2016-10-26,0
2,2016-10-27,12
Can somebody help me?
Thanks in advance.
You can use groupby with resample - then is problem fillna - so need asfreq first:
#if necessary convert to datetime
df.date = pd.to_datetime(df.date)
df = df.set_index('date')
df = df.groupby('id').resample('d')['target'].asfreq().fillna(0).astype(int).reset_index()
print (df)
id date target
0 1 2016-10-24 22
1 1 2016-10-25 31
2 1 2016-10-26 0
3 1 2016-10-27 44
4 1 2016-10-28 12
5 2 2016-10-21 22
6 2 2016-10-22 31
7 2 2016-10-23 0
8 2 2016-10-24 0
9 2 2016-10-25 44
10 2 2016-10-26 0
11 2 2016-10-27 12

Categories

Resources