Pandas: Fixing end dates in a changelog

Pandas: Fixing end dates in a changelog - python

I have a dataframe representing all changes that have been made to a record over time. Among other things, this dataframe contains a record id (in this case not unique and not meant to be as it tracks multiple changes to the same record on a different table), startdate and enddate. Enddate is only included if it is know/preset, often it is not. I would like to map the enddate of each change record to the startdate of the next record in the dataframe with the same id.
>>> thing = pd.DataFrame([
... {'id':1,'startdate':date(2021,1,1),'enddate':date(2022,1,1)},
... {'id':1,'startdate':date(2021,3,24),'enddate':None},
... {'id':1,'startdate':date(2021,5,26),'enddate':None},
... {'id':2,'startdate':date(2021,2,2),'enddate':None},
... {'id':2,'startdate':date(2021,11,26),'enddate':None}
... ])
>>> thing
id startdate enddate
0 1 2021-01-01 2022-01-01
1 1 2021-03-24 None
2 1 2021-05-26 None
3 2 2021-02-02 None
4 2 2021-11-26 None
The dataframe is already sorted by the creation timestamp of the record and the id. I tried this:
thing['enddate'] = thing.groupby('id')['startdate'].apply(lambda x: x.shift())
However the above code only maps this to around 10,000 of my 120,000 rows, the majority of which would have an enddate if I were to do this comparison by hand. Can anyone think of a better way to perform this kind of manipulation? For reference, give the dataframe above I'd like to create this one:
>>> thing
id startdate enddate
0 1 2021-01-01 2021-03-24
1 1 2021-03-24 2021-05-26
2 1 2021-05-26 None
3 2 2021-02-02 2021-11-26
4 2 2021-11-26 None
The idea is that once this transformation is done, I'll have a timeframe between which the configurations stored in the other columns (not impportant for this) were in place

here is one way to do it
use transform with the groupby to assign back the values to the rows
comprising the group
df['enddate']=df.groupby(['id'])['startdate'].transform(lambda x: x.shift(-1))
df
id startdate enddate
0 1 2021-01-01 2021-03-24
1 1 2021-03-24 2021-05-26
2 1 2021-05-26 NaT
3 2 2021-02-02 2021-11-26
4 2 2021-11-26 NaT

Related

Given a dataframe with event details, return a count of events that occured on any given date, based on the start and end dates of the events

What I have and am trying to do:
A dataframe, with headers: event_id, location_id, start_date, end_date.
An event can only have one location, start and end.
A location can have multiple events, starts and ends, and they can overlap.
The goal here is to be able to say, given any time T, for location X, how many events were there?
E.g.
Given three events, all for location 2:
Event.
Start.
End.
Event 1
2022-05-01
2022-05-07
Event 2
2022-05-04
2022-05-10
Event 3
2022-05-02
2022-05-05
Time T.
Count of Events
2022-05-01
1
2022-05-02
2
2022-05-03
2
2022-05-04
3
2022-05-05
3
2022-05-06
2
**What I have tried so far, but got stuck on: **
((I did look at THIS possible solution for a similar problem, and I went pretty far with it, but I got lost in the itterows and how to have that apply here.))
Try to get an array or dataframe that has a 365 day date range for each location ID.
E.g.
[1,2022-01-01],[1,2022-01-02]........[98,2022-01-01][98,2022-01-02]
Then convert that array to a dataframe, and merge it with the original dataframe like:
index
location
time
event
location2
start
end
0
1
2022-01-01
1
10
2022-11-07
2022-11-12
1
1
2022-01-01
2
4
2022-02-16
2022-03-05
2
1
2022-01-01
3
99
2022-06-10
2022-06-15
3
1
2022-01-01
4
2
2021-12-31
2022-01-05
4
1
2022-01-01
5
5
2022-05-08
2022-05-22
Then perform some kind of reduction that returns the count:
location
Time
Count
1
2022-01-01
10
1
2022-01-02
3
1
2022-01-03
13
..
...
...
99
2022-01-01
4
99
2022-01-02
0
99
2022-01-03
7
99
2022-01-04
12
I've done something similar with tying events to other events where their dates overalapped, using the .loc(...) but I don't think that would work here, and I'm kind of just stumped.
Where I got stuck was creating an array that combines the location ID and date range, because they're different lengths, and I could figure out how to repeat the location ID for every date in the range.
Anyway, I am 99% positive that there is a much more efficient way of doing this, and really any help at all is greatly appreciated!!
Thank you :)

Update per comment
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the location of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Location': [df[df['Start.'].le(date) & df['End.'].ge(date)]['Event.'].tolist()
for date in date_range]})
# get the length of each list, which is the count
new_df['Count'] = new_df['Location'].str.len()
Date Location Count
0 2022-05-01 [Event 1] 1
1 2022-05-02 [Event 1, Event 3] 2
2 2022-05-03 [Event 1, Event 3] 2
3 2022-05-04 [Event 1, Event 2, Event 3] 3
4 2022-05-05 [Event 1, Event 2, Event 3] 3
5 2022-05-06 [Event 1, Event 2] 2
6 2022-05-07 [Event 1, Event 2] 2
7 2022-05-08 [Event 2] 1
8 2022-05-09 [Event 2] 1
9 2022-05-10 [Event 2] 1
IIUC you can try something like
# get the min and max dates
min_date, max_date = df[['Start.', 'End.']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
# df.le is less than or equal to
# df.ge is greater than or equal to
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2
6 2022-05-07 2
7 2022-05-08 1
8 2022-05-09 1
9 2022-05-10 1
Depending on how large your date range is we may need to take a different approach as things may get slow if you have a range of two years instead of 10 days in the example.
You can also use a custom date range if you do not want to use the min and max values from the whole frame
min_date = '2022-05-01'
max_date = '2022-05-06'
# create a date range
date_range = pd.date_range(min_date, max_date)
# use list comprehension to get the count of dates that are between start and end
new_df = pd.DataFrame({'Date': date_range,
'Count': [sum(df['Start.'].le(date) & df['End.'].ge(date))
for date in date_range]})
Date Count
0 2022-05-01 1
1 2022-05-02 2
2 2022-05-03 2
3 2022-05-04 3
4 2022-05-05 3
5 2022-05-06 2

Note - I wanted to leave the original question up as is, and I was out of space, so I am answering my own question here, but #It_is_Chris is the real MVP.
Update! - with the enormous help from #It_is_Chris and some additional messing around, I was able to use the following code to generate the output I wanted:
# get the min and max dates
min_date, max_date = original_df[['start', 'end']].stack().agg([min, max])
# create a date range
date_range = pd.date_range(min_date, max_date)
# create location range
loc_range = original_df['location'].unique()
# create a new list that combines every date with every location
combined_list = []
for item in date_range:
for location in loc_range:
combined_list.append(
{
'Date':item,
'location':location
}
)
# convert the list to a dataframe
combined_df = pd.DataFrame(combined_list)
# use merge to put original data together with the new dataframe
merged_df = pd.merge(combined_df,original_df, how="left", on="location")
# use loc to directly connect each event to a specific location and time
merged_df = merged_df.loc[(pd.to_datetime(merged_df['Date'])>=pd.to_datetime(merged_df['start'])) & (pd.to_datetime(merged_df['Date'])<=pd.to_datetime(merged_df['end']))]
# use groupby to push out a table as sought Date - Location - Count
output_merged_df = merged_df.groupby(['Date','fleet_id']).size()
The result looked like this:
Note - the sorting was not as I have it here, I believe I would need to add some additional sorting to the dataframe before outputting as a CSV.
Date
location
count
2022-01-01
1
1
2022-01-01
2
4
2022-01-01
3
1
2022-01-01
4
10
2022-01-01
5
3
2022-01-01
6
1

Remove days with faulty data, Pandas dataframe

There are segments of readings that have faulty data and i want to remove entire days which have a least one. I already created the column with the True and False if that segment is wrong.
Example of the dataframe below, since it have more than 100k rows
power_c power_g temperature to_delete
date_time
2019-01-01 00:00:00+00:00 2985 0 10.1 False
2019-01-01 00:05:00+00:00 2258 0 10.1 True
2019-01-01 01:00:00+00:00 2266 0 10.1 False
2019-01-02 00:15:00+00:00 3016 0 10.0 False
2019-01-03 01:20:00+00:00 2265 0 10.0 True
For example the first and second row belong to the same hour on the same day, one of the values has True so i want to delete all rows of that day.
Data always exists in diferences of 5 mins, so i tried to delete 288 items after the True, but since the error is not on the start of the hour it does work as intended.
I am very new to programming and tried a lot of different answers everywhere, i would apreciate very much any help.

Group by the date, then filter out groups that have at least one to_delete.
(df
.groupby(df.index.date)
.apply(lambda sf: None if sf['to_delete'].any() else sf)
.reset_index(level=0, drop=True))
power_c power_g temperature to_delete
date_time
2019-01-02 00:15:00+00:00 3016 0 10.0 False
I'm assuming date_time is a datetime type. If not, convert it first:
df.index = pd.to_datetime(df.index)

How to convert one record per change to continuous data?

My data looks like this:
print(df)
DateTime, Status
'2021-09-01', 0
'2021-09-05', 1
'2021-09-07', 0
And I need it to look like this:
print(df_desired)
DateTime, Status
'2021-09-01', 0
'2021-09-02', 0
'2021-09-03', 0
'2021-09-04', 0
'2021-09-05', 1
'2021-09-06', 1
'2021-09-07', 0
Right now I accomplish this using pandas like this:
new_index = pd.DataFrame(index = pd.date_range(df.index[0], df.index[-1], freq='D'))
df = new_index.join(df).ffill()
Missing values before the first record in any column are imputed using the inverse of the first record in that column because it's binary and only shows change-points this is guaranteed to be correct.
To my understanding my desired dataframe contained "continuous" data, but I'm not sure what to call the data structure in my source data.
The problem:
When I do this to a dataframe that has a frequency of one record per second and I want to load a year's worth of data my memory overflows (92GB required, ~60GB available). I'm not sure if there is a standard procedure instead of my solution that I don't know the name of and cannot find using google or that I'm using the join method wrong, but this seems horribly inefficient, the resulting dataframe is only a few 100 megabytes after this operation. Any feedback on this would be great!

Use DataFrame.asfreq working with DatetimeIndex:
df = df.set_index('DateTime').asfreq('d', method='ffill').reset_index()
print (df)
DateTime Status
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 1
5 2021-09-06 1
6 2021-09-07 0

You can use this pipeline:
(df.set_index('DateTime')
.reindex(pd.date_range(df['DateTime'].min(), df['DateTime'].max()))
.rename_axis('DateTime')
.ffill(downcast='infer')
.reset_index()
)
output:
DateTime Status
0 2021-09-01 0
1 2021-09-02 0
2 2021-09-03 0
3 2021-09-04 0
4 2021-09-05 1
5 2021-09-06 1
6 2021-09-07 0
input:
DateTime Status
0 2021-09-01 0
1 2021-09-05 1
2 2021-09-07 0

Keep only rows from first X hours with starting point from another dataframe

I have a DataFrame (df1) with patients, where each patient (with unique id) has an admission timestamp:
admission_timestamp id
0 2020-03-31 12:00:00 1
1 2021-01-13 20:52:00 2
2 2020-04-02 07:36:00 3
3 2020-04-05 16:27:00 4
4 2020-03-21 18:51:00 5
I also have a DataFrame (df2) with for each patient (with unique id), data for a specific feature. For example:
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
2 1 temperature 2020-04-03 13:04:33 36.51
3 2 temperature 2020-04-02 07:44:12 36.45
4 2 temperature 2020-04-08 08:36:00 36.50
Where effective_timestamp is of type: datetime64[ns], for both columns. The ids for both dataframes link to the same patients.
In reality there is a lot more data with +- 1 value per minute. What I want is for each patient, only the data for the first X (say 24) hours after the admission timestamp from df1. So the above would result in:
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61
3 2 temperature 2020-04-02 07:44:12 36.45
This would thus include first searching for the admission timestamp, and with this timestamp, drop all rows for that patient where the effective_timestamp is not within X hours from the admission timestamp. Here, X should be variable (could be 7, 24, 72, etc). I could not find a similar question on SO. I tried this using panda's date_range but I don't know how to perform that for each patient, with a variable value for X. Any help is appreciated.
Edit: I could also merge the dataframes together so each row in df2 has the admission_timestamp, and then subtract the two columns to get the difference in time. And then drop all rows where difference > X. But this sounds very cumbersome.

Let's use pd.DateOffset
First get the value of admission_timestamp for a given patient id, and convert it to pandas datetime.
Let's say id = 1
>>admissionTime = pd.to_datetime(df1[df1['id'] == 1]['admission_timestamp'].values[0])
>>admissionTime
Timestamp('2020-03-31 12:00:00')
Now, you just need to use pd.DateOffset to add 24 hours to it.
>>admissionTime += pd.DateOffset(hours=24)
Now, just look for the rows where id=1 and effective_timestamp < admissionTime
>>df2[(df2['id'] == 1) & (df2['effective_timestamp']<admissionTime)]
id name effective_timestamp numerical_value
0 1 temperature 2020-03-31 13:00:00 36.47
1 1 temperature 2020-03-31 13:04:33 36.61

How to convert this forloop to pandas lambda function, to increase speed

This forloop will take 3 days to complete. How can I increase the speed?
for i in range(df.shape[0]):
df.loc[df['Creation date'] >= pd.to_datetime(str(df['Original conf GI dte'].iloc[i])),'delivered'] += df['Sale order item'].iloc[i]
I think the forloop is enough to understand?
If Creation date is bigger than Original conf GI date, then add Sale order item value to delivered column.
Each row's date is "Date Accepted" (Date Delivered is future date). Input is Order Ouantity, Date Accepted & Date Delivered....Output is Delivered column
Order Quantity Date Accepted Date Delivered Delivered
20 01-05-2010 01-02-2011 0
10 01-11-2010 01-03-2011 0
300 01-12-2010 01-09-2011 0
5 01-03-2011 01-03-2012 30
20 01-04-2012 01-11-2013 335
10 01-07-2013 01-12-2014 335

Convert values to numpy arrays by Series.to_numpy, compare them with broadcasting, match order values by numpy.where and last sum:
date1 = df['Date Accepted'].to_numpy()
date2 = df['Date Delivered'].to_numpy()
order = df['Order Quantity'].to_numpy()
#oldier pandas versions
#date1 = df['Date Accepted'].values
#date2 = df['Date Delivered'].values
#order = df['Order Quantity'].values
df['Delivered1'] = np.where(date1[:, None] >= date2, order, 0).sum(axis=1)
print (df)
Order Quantity Date Accepted Date Delivered Delivered Delivered1
0 20 2010-01-05 2011-01-02 0 0
1 10 2010-01-11 2011-01-03 0 0
2 300 2010-01-12 2011-01-09 0 0
3 5 2011-01-03 2012-01-03 30 30
4 20 2012-01-04 2013-01-11 335 335
5 10 2013-01-07 2014-01-12 335 335

If I understand correctly, you can use np.where() for speed. Currently you are looping on the dataframe rows whereas numpy operations are designed to operate on the entire column:
cond= df['Creation date'].ge(pd.to_datetime(str(df['Original conf GI dte'])))
df['delivered']=np.where(cond,df['delivered']+df['Sale order item'],df['delivered'])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: Fixing end dates in a changelog - python

Related

Given a dataframe with event details, return a count of events that occured on any given date, based on the start and end dates of the events

Remove days with faulty data, Pandas dataframe

How to convert one record per change to continuous data?

Keep only rows from first X hours with starting point from another dataframe

How to convert this forloop to pandas lambda function, to increase speed

Categories

Resources