edit dataframe samples based on the differences between values in the column - python

I have the following pandas dataframe:
Here is what I am trying to do:
Take the difference of values in the start_time column and find the indices with values less than 0.05
Remove those values from the end_time and start_time columns accounting for the difference
Let's take the example of dataframe below. The start_time column index 2 and 3 value have a difference of less than 0.05 (36.956 - 36.908667).
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 36.932333
3 37.001667 36.956000 37.039667
4 37.210333 37.197333 37.306333
This is what I am trying to achieve. Remove the start_time from the 3rd column and end_time from the second column
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 37.039667
4 37.210333 37.197333 37.306333

This cannot be achieved by a simple shift.
In addition, care should be taken when dealing with continuous start_time difference < 0.05.
Import pandas.
import pandas as pd
Read data. Note that I add one additional row to the sample data above.
df = pd.DataFrame({
'peak_time': [30.691333, 36.918000, 37.001667, 37.1, 37.210333],
'start_time': [30.670667, 36.908667, 36.956000, 36.96, 37.197333],
'end_time': [30.710333, 36.932333, 37.039667, 37.1, 37.306333]
})
Calculate the forward and backward difference of start_time column.
df['start_time_diff1'] = abs(df['start_time'].diff(1))
df['start_time_diff-1'] = abs(df['start_time'].diff(-1))
We can notice that ROW 2 has both differences less than 0.05, indicating that this row has to be first deleted.
After deleting it, we need to record the end_time of the row about to be deleted in the next step.
df2 = df[~(df['start_time_diff1'].lt(0.05) & df['start_time_diff-1'].lt(0.05))].copy()
df2['end_time_shift'] = df2['end_time'].shift(-1)
Then, we can use the simple diff to filter out ROW 3.
df2 = df2[~df2['start_time_diff1'].lt(0.05)].copy()
Finally, paste the end_time to the correct place.
df2.loc[df2['start_time_diff-1'].lt(0.05), 'end_time'] = df2.loc[
df2['start_time_diff-1'].lt(0.05), 'end_time_shift']

You can use .shift() to compare to prior row and take the difference to return rows where the difference is less than .05 by creating a boolean mask s. Then, witrh ~ simply filter out those rows:
s = df['start_time'] - df.shift()['start_time'] < .05
df = df[~s]
df
Out[1]:
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 36.932333
4 37.210333 37.197333 37.306333

Another way is to use .diff()
df[~(df.start_time.diff()<0.05)]
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 36.932333
4 37.210333 37.197333 37.306333

Related

Pandas Advanced: How to get results for customer who has bought at least twice within 5 days of period?

I have been attempting to solve a problem for hours and stuck on it. Here is the problem outline:
import numpy as np
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
df
orderid customerid orderdate
0 10315 ISLAT 1996-09-26
1 10318 ISLAT 1996-10-01
2 10321 ISLAT 1996-10-03
3 10473 ISLAT 1997-03-13
4 10621 ISLAT 1997-08-05
5 10253 HANAR 1996-07-10
6 10541 HANAR 1997-05-19
7 10645 HANAR 1997-08-26
I would like to select all the customers who has ordered items more than once WITHIN 5 DAYS.
For example, here only the customer ordered within 5 days of period and he has done it twice.
I would like to get the output in the following format:
Required Output
customerid initial_order_id initial_order_date nextorderid nextorderdate daysbetween
ISLAT 10315 1996-09-26 10318 1996-10-01 5
ISLAT 10318 1996-10-01 10321 1996-10-03 2
First, to be able to count the difference in days, convert orderdate
column to datetime:
df.orderdate = pd.to_datetime(df.orderdate)
Then define the following function:
def fn(grp):
return grp[(grp.orderdate.shift(-1) - grp.orderdate) / np.timedelta64(1, 'D') <= 5]
And finally apply it:
df.sort_values(['customerid', 'orderdate']).groupby('customerid').apply(fn)
It is a bit tricky because there can be any number of purchase pairs within 5 day windows. It is a good use case for leveraging merge_asof, which allows to do approximate-but-not-exact matching of a dataframe with itself.
Input data
import pandas as pd
df = pd.DataFrame({'orderid': [10315, 10318, 10321, 10473, 10621, 10253, 10541, 10645],
'customerid': ['ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'ISLAT', 'HANAR', 'HANAR', 'HANAR'],
'orderdate': ['1996-09-26', '1996-10-01', '1996-10-03', '1997-03-13', '1997-08-05', '1996-07-10', '1997-05-19', '1997-08-26']})
Define a function that computes the pairs of purchases, given data for a customer.
def compute_purchase_pairs(df):
# Approximate self join on the date, but not exact.
df_combined = pd.merge_asof(df,df, left_index=True, right_index=True,
suffixes=('_first', '_second') , allow_exact_matches=False)
# Compute difference
df_combined['timedelta'] = df_combined['orderdate_first'] - df_combined['orderdate_second']
return df_combined
Do the preprocessing and compute the pairs
# Convert to datetime
df['orderdate'] = pd.to_datetime(df['orderdate'])
# Sort dataframe from last buy to newest (groupby will not change this order)
df2 = df.sort_values(by='orderdate', ascending=False)
# Create an index for joining
df2 = df.set_index('orderdate', drop=False)
# Compute puchases pairs for each customer
df_differences = df2.groupby('customerid').apply(compute_purchase_pairs)
# Show only the ones we care about
result = df_differences[df_differences['timedelta'].dt.days<=5]
result.reset_index(drop=True)
Result
orderid_first customerid_first orderdate_first orderid_second \
0 10318 ISLAT 1996-10-01 10315.0
1 10321 ISLAT 1996-10-03 10318.0
customerid_second orderdate_second timedelta
0 ISLAT 1996-09-26 5 days
1 ISLAT 1996-10-01 2 days
you can create the column 'daysbetween' with sort_values and diff. After to get the following order, you can join df with df once groupby per customerid and shift all the data. Finally, query where the number of days in 'daysbetween_next ' is met:
df['daysbetween'] = df.sort_values(['customerid', 'orderdate'])['orderdate'].diff().dt.days
df_final = df.join(df.groupby('customerid').shift(-1),
lsuffix='_initial', rsuffix='_next')\
.drop('daysbetween_initial', axis=1)\
.query('daysbetween_next <= 5 and daysbetween_next >=0')
It's quite simple. Let's write down the requirements one at the time and try to build upon.
First, I guess that the customer has a unique id since it's not specified. We'll use that id for identifying customers.
Second, I assume it does not matter if the customer bought 5 days before or after.
My solution, is to use a simple filter. Note that this solution can also be implemented in a SQL database.
As a condition, we require the user to be the same. We can achieve this as follows:
new_df = df[df["ID"] == df["ID"].shift(1)]
We create a new DataFrame, namely new_df, with all rows such that the xth row has the same user id as the xth - 1 row (i.e. the previous row).
Now, let's search for purchases within the 5 days, by adding the condition to the previous piece of code
new_df = df[df["ID"] == df["ID"].shift(1) & (df["Date"] - df["Date"].shift(1)) <= 5]
This should do the work. I cannot test it write now, so some fixes may be needed. I'll try to test it as soon as I can

Need to optimize/avoid pandas .resample in groupby calls (need to bring this down to <60s for 1.4k rows -- currently >160s)

I have time series entries that I need to resample. On the more extreme end of things I've imagined that someone might generate data for 15 months -- this tends to be about 1300 records (about 5 location entries to every 2 metric entries). But after resampling to 15 minute intervals, the full set is about 41000 rows.
My data is less than a couple of dozen columns right now, so 20 column * 40k ≈ 800k values that need to be calculated .. this seems like I should be able to get my time down to below 10 seconds really. I've done an initial profile and it looks like the bottleneck is mostly in one pair of pandas methods for resampling that I am calling -- and they are amazingly slow! Its to the point where I am wondering if there is something wrong ..why would pandas be so slow to resample?
This produces a timeout in google cloud functions. That's what I need to avoid.
There's two sets of data: location and metric. Sample location data might look like this:
location bar girlfriends grocers home lunch park relatives work
date user
2018-01-01 00:00:01 0ce65715-4ec7-4ca2-aab0-323c57603277 0 0 0 1 0 0 0 0
sample metric data might look like this:
user date app app_id metric
0 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 2018-01-01 01:30:43 app_2 c2bfd6fb-44bb-499d-8e53-4d5af522ad17 0.02
1 6ca1a9ce-8501-49f5-b7d9-70ac66331fdc 2018-01-01 04:14:59 app_2 c2bfd6fb-44bb-499d-8e53-4d5af522ad17 0.10
I need to union those two subsets to a single ledger, with columns for each location name and each app. The values in apps are samples of constants, so I need to "connect the dots". The values in locations are location change events, so I need to keep repeating the same value until the next change event. In all, it like this:
app_1 app_2 user bar grocers home lunch park relatives work
date
2018-01-31 00:00:00 0.146250 0.256523 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 0 0 1 0 0 0 0
2018-01-31 00:15:00 0.146290 0.256562 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 0 0 0 0 0 0 1
This code does that, but needs to be optimized. What are the weakest links here? I've added basic sectional profiling:
import time
start = time.time()
locDf = locationDf.copy()
locDf.set_index('date', inplace=True)
# convert location data to "15 minute interval" rows
locDfs = {}
for user, user_loc_dc in locDf.groupby('user'):
locDfs[user] = user_loc_dc.resample('15T').agg('max').bfill()
aDf = appDf.copy()
aDf.set_index('date', inplace=True)
print("section1:", time.time() - start)
userLocAppDfs = {}
for user, a2_df in aDf.groupby('user'):
start = time.time()
# per user, convert app data to 15m interval
userDf = a2_df.resample('15T').agg('max')
print("section2.1:", time.time() - start)
start = time.time()
# assign metric for each app to an app column for each app, per user
userDf.reset_index(inplace=True)
userDf = pd.crosstab(index=userDf['date'], columns=userDf['app'], values=userDf['metric'], aggfunc=np.mean).fillna(np.nan, downcast='infer')
userDf['user'] = user
userDf.reset_index(inplace=True)
userDf.set_index('date', inplace=True)
print("section2.2:", time.time() - start)
start = time.time()
# reapply 15m intervals now that we have new data per app
userLocAppDfs[user] = userDf.resample('15T').agg('max')
print("section2.3:", time.time() - start)
start = time.time()
# assign location data to location columns per location, creates a "1" at the 15m interval of the location change event in the location column created
loDf = locDfs[user]
loDf.reset_index(inplace=True)
loDf = pd.crosstab([loDf.date, loDf.user], loDf.location)
loDf.reset_index(inplace=True)
loDf.set_index('date', inplace=True)
loDf.drop('user', axis=1, inplace=True)
print("section2.4:", time.time() - start)
start = time.time()
# join the location crosstab columns with the app crosstab columns per user
userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
# convert from just "1" at each location change event followed by zeros, to "1" continuing until next location change
userLocAppDfs[user] = userLocAppDfs[user].resample('15T').agg('max')
userLocAppDfs[user]['user'].fillna(user, inplace=True)
print("section2.5:", time.time() - start)
start = time.time()
for loc in locationDf[locationDf['user'] == user].location.unique():
# fill location NaNs
userLocAppDfs[user][loc] = userLocAppDfs[user][loc].replace(np.nan, 0)
print("section3:", time.time() - start)
start = time.time()
# fill app NaNs
for app in a2_df['app'].unique():
userLocAppDfs[user][app].interpolate(method='linear', limit_area='inside', inplace=True)
userLocAppDfs[user][app].fillna(value=0, inplace=True)
print("section4:", time.time() - start)
results:
section1: 41.67342448234558
section2.1: 11.441165685653687
section2.2: 0.020460128784179688
section2.3: 5.082422733306885
section2.4: 0.2675948143005371
section2.5: 40.296404123306274
section3: 0.0076410770416259766
section4: 0.0027387142181396484
section2.1: 11.567803621292114
section2.2: 0.02080368995666504
section2.3: 7.187351703643799
section2.4: 0.2625312805175781
section2.5: 40.669641733169556
section3: 0.0072269439697265625
section4: 0.00457453727722168
section2.1: 11.773712396621704
section2.2: 0.019629478454589844
section2.3: 6.996192693710327
section2.4: 0.2728455066680908
section2.5: 45.172399282455444
section3: 0.0071871280670166016
section4: 0.004514217376708984
Both "big" sections have calls to resample and agg('max').
notes:
I found this problem from 12 months ago: Pandas groupby + resample first is really slow - since version 0.22 -- seems like perhaps resample() in groupby is broken currently.

check for date and time between two columns in pandas data frame

I have two data frames:
The first date frame is:
import pandas as pd
df1 = pd.DataFrame({'serialNo':['aaaa','bbbb','cccc','ffff','aaaa','bbbb','aaaa'],
'Name':['Sayonti','Ruchi','Tony','Gowtam','Toffee','Tom','Sayonti'],
'testName': [4402, 3747 ,5555,8754,1234,9876,3602],
'moduleName': ['singing', 'dance','booze', 'vocals','drama','paint','singing'],
'endResult': ['WARNING', 'FAILED', 'WARNING', 'FAILED','WARNING','FAILED','WARNING'],
'Date':['2018-10-5','2018-10-6','2018-10-7','2018-10-8','2018-10-9','2018-10-10','2018-10-8'],
'Time_df1':['23:26:39','22:50:31','22:15:28','21:40:19','21:04:15','20:29:11','19:54:03']})
The second data frame is:
df2 = pd.DataFrame({'serialNo':['aaaa','bbbb','aaaa','ffff','xyzy','aaaa'],
'Food':['Strawberry','Coke','Pepsi','Nuts','Apple','Candy'],
'Work': ['AP', 'TC','OD', 'PU','NO','PM'],
'Date':['2018-10-1','2018-10-6','2018-10-2','2018-10-3','2018-10-5','2018-10-10'],
'Time_df2':['09:00:00','10:00:00','11:00:00','12:00:00','13:00:00','14:00:00']
})
I am joining the two based on serial number:
df1['Date'] = pd.to_datetime(df1['Date'])
df2['Date'] = pd.to_datetime(df2['Date'])
result = pd.merge(df1,df2,on=['serialNo'],how='inner')
Now I want that Date_y lies within 3 days of Date_x starting from Date_x
which means Date_X+(1,2,3 days) should be Date_y. And I can get that as below but I also want to check for the time range which I do not know how to achieve
result = result[result.Date_x.sub(result.Date_y).dt.days.between(0,3)]
I want to check for the time such that Time_df2 is within 6 hours of start time being Time_df1. Please help?
You could have a column within your dataframe that combines the date and the time. Here's an example of combining a single row in the dataframe:
# Combining Date_x and time_df1
value_1_x = datetime.datetime.combine(result['Date_x'][0].date() ,\
datetime.datetime.strptime(result['Time_df1'][0], '%H:%M:%S').time())
# Combining date_y and time_df2
value_2_y = datetime.datetime.combine(result['Date_y'][0].date() , \
datetime.datetime.strptime(result['Time_df2'][0], '%H:%M:%S').time())
Then given two datetime objects, you can simply subtract to find the difference you are looking for:
difference = value_1_x - value_2_y
print(difference)
Which gives the output:
4 days, 14:26:39
My understanding is that you are looking to see if something is within 3 days and 6 hours (or a total of 78 hours). You can convert this to hours easily, and then make the desired comparison:
hours_difference = abs(value_1_x - value_2_y).total_seconds() / 3600.0
print(hours_difference)
Which gives the output:
110.44416666666666
Hope that helps!

Binning in python

Example
Input have one column .
Time
02.10
02.40
02.50
Output
Since the
Ave time difference is 20 min ((30 min+10 min)/2),
I need a data frame which buckets the data by average .
It needs to add average time to first record , if the resultant time is there in data then it belongs to bin 1 , otherwise to bin 0.
and then continue.
Desired Output
Time - Bin
02.10 - 1
02.30 - 0
02.50 - 1
03.10 - 0
Thanks in advance.
First, you should always share what you have tried.
Anyways, try this. It should work
mean = df.Time.diff().mean()
start = df.loc[0, 'Time']
end = df.loc[df.shape[1] -1, 'Time']
len = int((end - start)/mean) + 1
timeSeries = [start + i*mean for i in range(len)]
df['Bin'] = 0
df.loc[df['Time'].isin(timeSeries), 'Bin'] = 1
This will create 'Bin' as expected by you, conditional, you create 'Time' properly as datetime.

How to determine data capture with a pandas dataframe?

I am working with hourly monitoring data which consists of incomplete time series, i.e. several hours during a year (or during several years) will be absent from my dataframe.
I would like to determine the data capture, i.e. the percentage of values present in a month, a season, or a year.
This works with the following code (for demonstration written for monthly resampling) - however that piece of code appears somewhat inefficient, because I need to create a second hourly dataframe and I need to resample two dataframes.
Is there a more elegant solution to this?
import numpy as np
import pandas as pd
# create dummy series
t1 = pd.date_range(start="1997-01-01 05:00", end="1997-04-25 17:00", freq="H")
t2 = pd.date_range(start="1997-06-11 15:00", end="1997-06-15 12:00", freq="H")
t3 = pd.date_range(start="1997-06-18 00:00", end="1997-08-22 23:00", freq="H")
df1 = pd.DataFrame(np.random.randn(len(t1)), index=t1)
df2 = pd.DataFrame(np.random.randn(len(t2)), index=t2)
df3 = pd.DataFrame(np.random.randn(len(t3)), index=t3)
df = pd.concat((df1, df2, df3))
# create time index with complete hourly coverage over entire years
tstart = "%i-01-01 00:00"%(df.index.year[0])
tend = "%i-12-31 23:00"%(df.index.year[-1])
tref = pd.date_range(start=tstart, end=tend, freq="H")
dfref = pd.DataFrame(np.zeros(len(tref)), index=tref)
# count number of values in reference dataframe and actual dataframe
# Example: monthly resampling
cntref = dfref.resample("MS", "count")
cnt = df.resample("MS", "count").reindex(cntref.index).fillna(0)
for i in range(len(cnt.index)):
print cnt.index[i], cnt.values[i], cntref.values[i], cnt.values[i] / cntref.values[i]
pandas' Timedelta will do the trick:
# Time delta between rows of the df
df['index'] = df.index
pindex = df['index'].shift(1)
delta = df['index'] - pindex
# Any delta > 1H means a missing data period
missing_delta = delta[delta > pd.Timedelta('1H')]
# Sum of missing data periods divided by total period
ratio_missing = missing_delta.sum() / (df.index[-1] - df.index[0])
You can use TimeGrouper.
# Create an hourly index spanning the range of your data.
idx = pd.date_range(pd.Timestamp(df.index[0].strftime('%Y-%m-%d %H:00')),
pd.Timestamp(df.index[-1].strftime('%Y-%m-%d %H:00')),
freq='H')
# Use TimeGrouper to calculate the fraction of observations from `df` that are in the
# hourly time index.
>>> (df.groupby(pd.TimeGrouper('M')).size() /
pd.Series(idx).reindex(idx).groupby(pd.TimeGrouper('M')).size())
1997-01-31 1.000000
1997-02-28 1.000000
1997-03-31 1.000000
1997-04-30 0.825000
1997-05-31 0.000000
1997-06-30 0.563889
1997-07-31 1.000000
1997-08-31 1.000000
Freq: M, dtype: float64
As there have been no further suggestions, it appears as if the originally posted solution is most efficient.
Not sure about performance, but for a (very long) one liner you can do this once you have created 'df'... It at least has the benefits of not requiring a dummy dataframe. It should work for any period of data input and resampling.
month_counts = df.resample('H').mean().resample('M').count() / df.resample('H').ffill().fillna(1).resample('M').count()

Categories

Resources