Binning in python

Binning in python - python

Example
Input have one column .
Time
02.10
02.40
02.50
Output
Since the
Ave time difference is 20 min ((30 min+10 min)/2),
I need a data frame which buckets the data by average .
It needs to add average time to first record , if the resultant time is there in data then it belongs to bin 1 , otherwise to bin 0.
and then continue.
Desired Output
Time - Bin
02.10 - 1
02.30 - 0
02.50 - 1
03.10 - 0
Thanks in advance.

First, you should always share what you have tried.
Anyways, try this. It should work
mean = df.Time.diff().mean()
start = df.loc[0, 'Time']
end = df.loc[df.shape[1] -1, 'Time']
len = int((end - start)/mean) + 1
timeSeries = [start + i*mean for i in range(len)]
df['Bin'] = 0
df.loc[df['Time'].isin(timeSeries), 'Bin'] = 1
This will create 'Bin' as expected by you, conditional, you create 'Time' properly as datetime.

Related

edit dataframe samples based on the differences between values in the column

I have the following pandas dataframe:
Here is what I am trying to do:
Take the difference of values in the start_time column and find the indices with values less than 0.05
Remove those values from the end_time and start_time columns accounting for the difference
Let's take the example of dataframe below. The start_time column index 2 and 3 value have a difference of less than 0.05 (36.956 - 36.908667).
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 36.932333
3 37.001667 36.956000 37.039667
4 37.210333 37.197333 37.306333
This is what I am trying to achieve. Remove the start_time from the 3rd column and end_time from the second column
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 37.039667
4 37.210333 37.197333 37.306333

This cannot be achieved by a simple shift.
In addition, care should be taken when dealing with continuous start_time difference < 0.05.
Import pandas.
import pandas as pd
Read data. Note that I add one additional row to the sample data above.
df = pd.DataFrame({
'peak_time': [30.691333, 36.918000, 37.001667, 37.1, 37.210333],
'start_time': [30.670667, 36.908667, 36.956000, 36.96, 37.197333],
'end_time': [30.710333, 36.932333, 37.039667, 37.1, 37.306333]
})
Calculate the forward and backward difference of start_time column.
df['start_time_diff1'] = abs(df['start_time'].diff(1))
df['start_time_diff-1'] = abs(df['start_time'].diff(-1))
We can notice that ROW 2 has both differences less than 0.05, indicating that this row has to be first deleted.
After deleting it, we need to record the end_time of the row about to be deleted in the next step.
df2 = df[~(df['start_time_diff1'].lt(0.05) & df['start_time_diff-1'].lt(0.05))].copy()
df2['end_time_shift'] = df2['end_time'].shift(-1)
Then, we can use the simple diff to filter out ROW 3.
df2 = df2[~df2['start_time_diff1'].lt(0.05)].copy()
Finally, paste the end_time to the correct place.
df2.loc[df2['start_time_diff-1'].lt(0.05), 'end_time'] = df2.loc[
df2['start_time_diff-1'].lt(0.05), 'end_time_shift']

You can use .shift() to compare to prior row and take the difference to return rows where the difference is less than .05 by creating a boolean mask s. Then, witrh ~ simply filter out those rows:
s = df['start_time'] - df.shift()['start_time'] < .05
df = df[~s]
df
Out[1]:
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 36.932333
4 37.210333 37.197333 37.306333

Another way is to use .diff()
df[~(df.start_time.diff()<0.05)]
peak_time start_time end_time
1 30.691333 30.670667 30.710333
2 36.918000 36.908667 36.932333
4 37.210333 37.197333 37.306333

Is there a better way to iterate through this calculation?

Running this code produces the error message :
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
I have 6 years' worth of competitor's results from a 1/2 marathon in one csv file.
The function year_runners aims to create a new column for each year with a difference in finishing time between each runner.
Is there a more efficient way of producing the same result?
Thanks in advance.
Pos Gun_Time Chip_Time Name Number Category
1 1900-01-01 01:19:15 1900-01-01 01:19:14 Steve Hodges 324 Senior Male
2 1900-01-01 01:19:35 1900-01-01 01:19:35 Theo Bately 92 Supervet Male
#calculating the time difference between each finisher in year and adding this result to into a new column called time_diff
def year_runners(year, x, y):
print('Event held in', year)
# x is the first number (position) for the runner of that year,
# y is the last number (position) for that year e.g. 2016 event spans from df[246:534]
time_diff = 0
#
for index, row in df.iterrows():
time_diff = df2015.loc[(x + 1),'Gun_Time'] - df2015.loc[(x),'Chip_Time']
# using Gun time as the start-time for all.
# using chip time as finishing time for each runner.
# work out time difference between the x-placed runner and the runner behind (x + 1)
df2015.loc[x,'time_diff'] = time_diff #set the time_diff column to the value of time_diff for
#each row of x in the dataframe
print("Runner",(x+1),"time, minus runner" , x,"=",time_diff)
x += 1
if x > y:
break
Hi everyone, this was solved using the shift technique.
youtube.com/watch?v=nZzBj6n_abQ
df2015['shifted_Chip_Time'] = df2015['Chip_Time'].shift(1)
df2015['time_diff'] = df2015['Gun_Time'] - df2015['shifted_Chip_Time']

Need to optimize/avoid pandas .resample in groupby calls (need to bring this down to <60s for 1.4k rows -- currently >160s)

I have time series entries that I need to resample. On the more extreme end of things I've imagined that someone might generate data for 15 months -- this tends to be about 1300 records (about 5 location entries to every 2 metric entries). But after resampling to 15 minute intervals, the full set is about 41000 rows.
My data is less than a couple of dozen columns right now, so 20 column * 40k ≈ 800k values that need to be calculated .. this seems like I should be able to get my time down to below 10 seconds really. I've done an initial profile and it looks like the bottleneck is mostly in one pair of pandas methods for resampling that I am calling -- and they are amazingly slow! Its to the point where I am wondering if there is something wrong ..why would pandas be so slow to resample?
This produces a timeout in google cloud functions. That's what I need to avoid.
There's two sets of data: location and metric. Sample location data might look like this:
location bar girlfriends grocers home lunch park relatives work
date user
2018-01-01 00:00:01 0ce65715-4ec7-4ca2-aab0-323c57603277 0 0 0 1 0 0 0 0
sample metric data might look like this:
user date app app_id metric
0 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 2018-01-01 01:30:43 app_2 c2bfd6fb-44bb-499d-8e53-4d5af522ad17 0.02
1 6ca1a9ce-8501-49f5-b7d9-70ac66331fdc 2018-01-01 04:14:59 app_2 c2bfd6fb-44bb-499d-8e53-4d5af522ad17 0.10
I need to union those two subsets to a single ledger, with columns for each location name and each app. The values in apps are samples of constants, so I need to "connect the dots". The values in locations are location change events, so I need to keep repeating the same value until the next change event. In all, it like this:
app_1 app_2 user bar grocers home lunch park relatives work
date
2018-01-31 00:00:00 0.146250 0.256523 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 0 0 1 0 0 0 0
2018-01-31 00:15:00 0.146290 0.256562 4fb488bc-aea0-4f1e-9bc8-d7a8382263ef 0 0 0 0 0 0 1
This code does that, but needs to be optimized. What are the weakest links here? I've added basic sectional profiling:
import time
start = time.time()
locDf = locationDf.copy()
locDf.set_index('date', inplace=True)
# convert location data to "15 minute interval" rows
locDfs = {}
for user, user_loc_dc in locDf.groupby('user'):
locDfs[user] = user_loc_dc.resample('15T').agg('max').bfill()
aDf = appDf.copy()
aDf.set_index('date', inplace=True)
print("section1:", time.time() - start)
userLocAppDfs = {}
for user, a2_df in aDf.groupby('user'):
start = time.time()
# per user, convert app data to 15m interval
userDf = a2_df.resample('15T').agg('max')
print("section2.1:", time.time() - start)
start = time.time()
# assign metric for each app to an app column for each app, per user
userDf.reset_index(inplace=True)
userDf = pd.crosstab(index=userDf['date'], columns=userDf['app'], values=userDf['metric'], aggfunc=np.mean).fillna(np.nan, downcast='infer')
userDf['user'] = user
userDf.reset_index(inplace=True)
userDf.set_index('date', inplace=True)
print("section2.2:", time.time() - start)
start = time.time()
# reapply 15m intervals now that we have new data per app
userLocAppDfs[user] = userDf.resample('15T').agg('max')
print("section2.3:", time.time() - start)
start = time.time()
# assign location data to location columns per location, creates a "1" at the 15m interval of the location change event in the location column created
loDf = locDfs[user]
loDf.reset_index(inplace=True)
loDf = pd.crosstab([loDf.date, loDf.user], loDf.location)
loDf.reset_index(inplace=True)
loDf.set_index('date', inplace=True)
loDf.drop('user', axis=1, inplace=True)
print("section2.4:", time.time() - start)
start = time.time()
# join the location crosstab columns with the app crosstab columns per user
userLocAppDfs[user] = userLocAppDfs[user].join(loDf, how='outer')
# convert from just "1" at each location change event followed by zeros, to "1" continuing until next location change
userLocAppDfs[user] = userLocAppDfs[user].resample('15T').agg('max')
userLocAppDfs[user]['user'].fillna(user, inplace=True)
print("section2.5:", time.time() - start)
start = time.time()
for loc in locationDf[locationDf['user'] == user].location.unique():
# fill location NaNs
userLocAppDfs[user][loc] = userLocAppDfs[user][loc].replace(np.nan, 0)
print("section3:", time.time() - start)
start = time.time()
# fill app NaNs
for app in a2_df['app'].unique():
userLocAppDfs[user][app].interpolate(method='linear', limit_area='inside', inplace=True)
userLocAppDfs[user][app].fillna(value=0, inplace=True)
print("section4:", time.time() - start)
results:
section1: 41.67342448234558
section2.1: 11.441165685653687
section2.2: 0.020460128784179688
section2.3: 5.082422733306885
section2.4: 0.2675948143005371
section2.5: 40.296404123306274
section3: 0.0076410770416259766
section4: 0.0027387142181396484
section2.1: 11.567803621292114
section2.2: 0.02080368995666504
section2.3: 7.187351703643799
section2.4: 0.2625312805175781
section2.5: 40.669641733169556
section3: 0.0072269439697265625
section4: 0.00457453727722168
section2.1: 11.773712396621704
section2.2: 0.019629478454589844
section2.3: 6.996192693710327
section2.4: 0.2728455066680908
section2.5: 45.172399282455444
section3: 0.0071871280670166016
section4: 0.004514217376708984
Both "big" sections have calls to resample and agg('max').
notes:
I found this problem from 12 months ago: Pandas groupby + resample first is really slow - since version 0.22 -- seems like perhaps resample() in groupby is broken currently.

Python Data manipulation: Duplicate and Average row and column values using dates

Hi I have a dataset in the following format:
Code for replicating the data:
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
I input the numbers as a string to show blank cells
Where the first three columns denotes date (Year, Month and Day) and the following columns represent individuals (My actual data file consists of about 300 such rows and about 1000 subjects. I presented a subset of the data here).
Where the column value refers to expenditure on FMCG products.
What I would like to do is the following:
Part 1 (Beginning and end points)
a) For each individual locate the first observation and duplicate the value of the first observation for atleast the previous six months. For example: Subject C's 1st observation is on the 10th of August 2008. In that case I would want all the rows from June 10, 2008 to be equal to 65 for Subject C (Roughly 2/12/2008
is the cutoff date. SO we leave the 3rd cell from the top for Subject_C's column blank).
b) Locate last observation and repeat the last observation for the following 3 months. For example for Subject_A, we repeat 35 twice (till 6th November 2008).
Please refer to the following diagram for the highlighted cell with the solutions.
Part II - (Rows in between)
Next I would like to do two things (I would need to do the following three steps separately, not all at one time):
For individuals like Subject_A, locate two observations that come one after the other (30 and 35).
i) Use the average of the two observations. In this case we would have 32.5 in the four rows without caring about time.
for eg:
ii) Find the total time between two observations and take the mean of the time. For the 1st half of the time period assign the first value and for the 2nd half assign the second value. For example - for subject 1, the total days between 01/22/208 and 08/10/2008 is 201 days. For the first 201/2 = 100.5 days assign the value of 30 to Subject_A and for the remaining value assign 35. In this case the columns for Subject_A and Subject_C will look like:
The final dataset will use (a), (b) & (i) or (a), (b) & (ii)
Final data I [using a,b and i]
Final data II [using a,b and ii]
I would appreciate any help with this. Thanks in advance. Please let me know if the steps are unclear.
Follow up question and Issues
Thanks #Juan for the initial answer. Here's my follow up question. Suppose that Subject_A has more than 2 observations (code for the example data below). Would we be able to extend this code to incorporate more than 2 observations?
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
Issues
For the current code, I found an issue for part II (ii). This is the output that I get:
This is actually on the right track. The two cells above 35 does not seem to get updated. Is there something wrong on my end? Also the same question as before, would we be able to extend it to the case of >2 observations?

Here a code solution for subject A. Should work with the other subjects:
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
d1 = pd.DataFrame(d1)
## Create a variable named date
d1['date']= pd.to_datetime(d1['Year']+'/'+d1['Month']+'/'+d1['Day'])
# convert to float, to calculate mean
d1['Subject_A'] = d1['Subject_A'].replace('',np.nan).astype(float)
# index of the not null rows
subja = d1['Subject_A'].notnull()
### max and min index row with notnull value
max_id_subja = d1.loc[subja,'date'].idxmax()
min_id_subja = d1.loc[subja,'date'].idxmin()
### max and min date for Sub A with notnull value
max_date_subja = d1.loc[subja,'date'].max()
min_date_subja = d1.loc[subja,'date'].min()
### value for max and min date
max_val_subja = d1.loc[max_id_subja,'Subject_A']
min_val_subja = d1.loc[min_id_subja,'Subject_A']
#### Cutoffs
min_cutoff = min_date_subja-pd.Timedelta(6, unit='M')
max_cutoff = max_date_subja+pd.Timedelta(3, unit='M')
## PART I.a
d1.loc[(d1['date']<min_date_subja) & (d1['date']>min_cutoff),'Subject_A'] = min_val_subja
## PART I.b
d1.loc[(d1['date']>max_date_subja) & (d1['date']<max_cutoff),'Subject_A'] = max_val_subja
## PART II
d1_2i = d1.copy()
d1_2ii = d1.copy()
lower_date = min_date_subja
lower_val = min_val_subja.copy()
next_dates_index = d1_2i.loc[(d1['date']>min_date_subja) & subja].index
for N in next_dates_index:
next_date = d1_2i.loc[N,'date']
next_val = d1_2i.loc[N,'Subject_A']
#PART II.i
d1_2i.loc[(d1['date']>lower_date) & (d1['date']<next_date),'Subject_A'] = np.mean([lower_val,next_val])
#PART II.ii
mean_time_a = pd.Timedelta((next_date-lower_date).days/2, unit='d')
d1_2ii.loc[(d1['date']>lower_date) & (d1['date']<=lower_date+mean_time_a),'Subject_A'] = lower_val
d1_2ii.loc[(d1['date']>lower_date+mean_time_a) & (d1['date']<=next_date),'Subject_A'] = next_val
lower_date = next_date
lower_val = next_val
print(d1_2i)
print(d1_2ii)

Calculate the duration of a state with a pandas Dataframe

I try to calculate how often a state is entered and how long it lasts. For example I have the three possible states 1,2 and 3, which state is active is logged in a pandas Dataframe:
test = pd.DataFrame([2,2,2,1,1,1,2,2,2,3,2,2,1,1], index=pd.date_range('00:00', freq='1h', periods=14))
For example the state 1 is entered two times (at index 3 and 12), the first time it lasts three hours, the second time two hours (so on average 2.5). State 2 is entered 3 times, on average for 2.66 hours.
I know that I can mask data I'm not interested in, for example to analyize for state 1:
state1 = test.mask(test!=1)
but from there on I can't find a way to go on.

I hope the comments give enough explanation - the key point is you can use a custom rolling window function and then cumsum to group the rows into "clumps" of the same state.
# set things up
freq = "1h"
df = pd.DataFrame(
[2,2,2,1,1,1,2,2,2,3,2,2,1,1],
index=pd.date_range('00:00', freq=freq, periods=14)
)
# add a column saying if a row belongs to the same state as the one before it
df["is_first"] = pd.rolling_apply(df, 2, lambda x: x[0] != x[1]).fillna(1)
# the cumulative sum - each "clump" gets its own integer id
df["value_group"] = df["is_first"].cumsum()
# get the rows corresponding to states beginning
start = df.groupby("value_group", as_index=False).nth(0)
# get the rows corresponding to states ending
end = df.groupby("value_group", as_index=False).nth(-1)
# put the timestamp indexes of the "first" and "last" state measurements into
# their own data frame
start_end = pd.DataFrame(
{
"start": start.index,
# add freq to get when the state ended
"end": end.index + pd.Timedelta(freq),
"value": start[0]
}
)
# convert timedeltas to seconds (float)
start_end["duration"] = (
(start_end["end"] - start_end["start"]).apply(float) / 1e9
)
# get average state length and counts
agg = start_end.groupby("value").agg(["mean", "count"])["duration"]
agg["mean"] = agg["mean"] / (60 * 60)
And the output:
mean count
value
1 2.500000 2
2 2.666667 3
3 1.000000 1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Binning in python - python

Related

edit dataframe samples based on the differences between values in the column

Is there a better way to iterate through this calculation?

Need to optimize/avoid pandas .resample in groupby calls (need to bring this down to <60s for 1.4k rows -- currently >160s)

Python Data manipulation: Duplicate and Average row and column values using dates

Calculate the duration of a state with a pandas Dataframe

Categories

Resources