Clustering intervals

Clustering intervals - python

Each of the rows of my dataframe is an interval represented by date1 and date2 and a user id. For each user id, I need to group together the intervals which are separated by a gap below a certain threshold.
So far, for each user id, I sort rows by begin and end date. Then, I compute gaps and group rows based on those values. Then, I add the modified rows to a new dataframe (this is the way I found to un-group the dataframe).
However, this is quite slow. Do you see ways to improve the way I do the grouping?
def gap(group):
return group[['date1', 'date2']].min(axis = 1) - \
group.shift()[['date1', 'date2']].max(axis = 1)
def cluster(df, threshold):
df['clusters'] = 0
grouped = df.groupby('user_id')
newdf = pd.DataFrame()
for name, group in grouped:
group = group.sort_values(['date1', 'date2'], ascending = True)
group['gap'] = gap(group)
cuts = group['gap'] > timedelta(threshold)
df2 = group.copy()
for g, d, r in zip(group.loc[cuts, 'gap'], group.loc[cuts, 'date1'], group.loc[cuts, 'date2']):
df2.loc[((df2['date1'] >= d) & (df2['date2'] >= r)), 'clusters'] +=1
df2 = df2.drop('gap', axis = 1)
newdf = pd.concat([newdf, df2])
return newdf
Here is a minimal sample of the data it uses:
df = pd.DataFrame(dict([('user_id', np.array(['a', 'a', 'a', 'a', 'a', 'a', 'a'])),
('date1', np.array([datetime.strptime(x, "%y%m%d") for x in ['160101', '160103', '160110', '160120', '160130', '160308', '160325']])),
('date2', np.array([datetime.strptime(x, "%y%m%d") for x in ['160107', '160109', '160115', '160126', '160206', '160314', '160402']]))]))

A simple improvement would be to use cumsum on the boolean vector cuts:
def cluster2(df, threshold):
df['clusters'] = 0
grouped = df.groupby('user_id')
df_list = []
for name, group in grouped:
group = group.sort_values(['date1', 'date2'], ascending = True)
group['gap'] = gap(group)
print(group)
cuts = group['gap'] > timedelta(threshold)
df2 = group.copy()
df2['clusters'] = cuts.cumsum()
df_list.append(df2)
return pd.concat(df_list)
Edit: following OP's comment, I moved concatenation out of the loop to improve performance.
A further improvement could be to not sort the groups in the groupby operation (if there are many users):
grouped = df.groupby('user_id', sort=False)
Or even grouping manually by sorting df by user_id and then adding a condition to cuts directly on the original dataframe:
df = df.sort_values(['user_id', 'date1', 'date2'], ascending = True)
df['gap'] = gap(df)
cuts = (df['user_id'] != df['user_id'].shift()) | (df['gap'] > timedelta(threshold))
df['clusters'] = cuts.cumsum()

Related

How can I speed up a multi-column loop?

I have a ~8million-ish row data frame consisting of sales for 615 products across 16 stores each day for five years.
I need to make new column/s that consists of the sales shifted back from 1 to 7 days. I've decided to sort the data frame by date, product and location. The I concatenate item and location as its own column.
Using that column I loop through each unique item/location concatenation and make the shifted sales columns. This code is below:
import pandas as pd
#sort values by item, location, date
df = df.sort_values(['date', 'product', 'location'])
df['sort_values'] = df['product']+"_"+df['location']
df1 = pd.DataFrame()
z = 0
for i in list(df['sort_values'].unique()):
df_ = df[df['sort_values']==i]
df_ = df_.sort_values('ORD_DATE')
df_['eaches_1'] = df_['eaches'].shift(-1)
df_['eaches_2'] = df_['eaches'].shift(-2)
df_['eaches_3'] = df_['eaches'].shift(-3)
df_['eaches_4'] = df_['eaches'].shift(-4)
df_['eaches_5'] = df_['eaches'].shift(-5)
df_['eaches_6'] = df_['eaches'].shift(-6)
df_['eaches_7'] = df_['eaches'].shift(-7)
df1 = pd.concat((df1, df_))
z+=1
if z % 100 == 0:
print(z)
The above code gets me exactly what I want, but takes FOREVER to complete. Is there a faster way to accomplish what I want?

Pandas: Replace empty column values with the non-empty value based on a condition

I have a dataset in this format:
and it needs to be grouped by DocumentId and PersonId columns and sorted by StartDate. Which I doing it like this:
df = pd.read_csv(path).sort_values(by=["StartDate"]).groupby(["DocumentId", "PersonId"])
Now if there is row in this group by with DocumentCode RT and EndDate not empty, all other rows need to be filled by that end date. So this result dataset should be following:
I could not figure out a way to do that. I think I can iterate over each groupby subset but how will find the value from the end date and replace it for each row in that subset.
Based on the suggestions to use bfill(). I tried putting it as following:
df["EndDate"] = (
df.sort_values(by=["StartDate"])
.groupby(["DocumentId", "PersonId"])["EndDate"]
.bfill()
)
Above works fine but how can I add the condition for DocumentCode being RT?

You can calculate the value to use to fill nan inside the apply function.
def fill_end_date(df):
rt_doc = df[df["DocumentCode"] == "RT"]
# if there is row in this group by with DocumentCode RT
if not rt_doc.empty:
end_date = rt_doc.iloc[0]["EndDate"]
# and EndDate not empty
if pd.notnull(end_date):
# all other rows need to be filled by that end date
df = df.fillna({"EndDate": end_date})
return df
df = pd.read_csv(path).sort_values(by=["StartDate"])
df.groupby(["DocumentId", "PersonId"]).apply(fill_end_date).reset_index(drop=True)

You could find the empty cells and replace with np.nan, then fillna with method='bfill'
df['EndDate'] = df['EndDate'].apply(lambda x: np.nan if x=='' else x)
df['EndDate'].fillna(method = 'bfill', inplace=True)
Alternatively you could iterate through the df from last row to first row, and fill in the EndDate where necessary:
d = df.loc[df.shape[0]-1, 'EndDate'] #initial condition
for i in range(df.shape[0]-1, -1, -1):
if df.loc[i, 'DocumentCode'] == 'RT':
d = df.loc[i, 'EndDate']
else:
df.loc[i, 'EndDate'] = d

Retain rows with the top score within a range of a column start

I would like to retain the row with the highest "score" value within 3 places of the start value column. I have a dataframe like the one below:
data = {'id':['id1', 'id2', 'id3', 'id4', 'id5', 'id6'],
'start':[1,12,11,2,20,3],
'score':[3,1,8,2,5,9]}
df = pd.DataFrame(data, columns=['id', 'start', 'score'])
df = df.sort_values(by='start')
Desired output:
data = {'id':['id3', 'id5', 'id6'],
'start':[11,20,3],
'score':[8,5,9]}
output = pd.DataFrame(data, columns=['id', 'start', 'score'])
output = output.sort_values(by='start')
Because id1, id4, & id6 have a start value plus or minus 3, we retain the row with the highest score (id6). The same principal holds for id2 & id3 with id3 being retained. id5 is unique and should be retained.

Do you want this? -
bin = range(df['start'].min(), df['start'].max()+3, 3)
cut = pd.cut(df['start'], bins=bin, include_lowest= True)
def test(x):
return x.sort_values('score').tail(1)
df = df.groupby(cut).apply(test).reset_index(drop=True)

From what I understood, we need to check if the values in start and consecutive and if they do, they belong to the same group.
And form this group, we want to filter the rows where the score is max.
This is how I would do it:
cnt = 0
def group(x, y):
global cnt
if (x - y) > 1:
cnt += 1
return cnt
df['start_2'] = df['start'].shift(1).fillna(1)
df['group'] = df[['start', 'start_2']].apply(lambda x: group(x.start, x.start_2), axis=1)
df = df[df.groupby(['group'])['score'].transform(max) == df['score']]
df.drop(columns=['start_2'], inplace=True)
df
So what's happening here:
I create a column using the start column and shift all values in the downward direction.
Next I look at the difference between the two. If the difference is 1, they belong to the same group, else create a new group using by incrementing the counter. This will give me a new column with the groups.
Using this, group by and filter where the score is max.

Create dataframe conditionally to other dataframe elements

Happy 2020! I would like to create a dataframe based on two others. I have the below two dataframes:
df1 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'],'A': [63.63,64.08,64.19,65.11,65.36,65.25,65.36], 'B': [63.83, 64.10, 64.19, 65.08, 65.33, 65.28, 65.36], 'C':[63.99, 64.22, 64.30, 65.16, 65.41, 65.36, 65.44]})
df2 = pd.DataFrame({'Name':['A','B','C'],'Notice': ['05.05.1982','07.05.1982','12.05.1982']})
The idea is to create df3 such that this dataframe takes the value of A until A's notice date (found in df2) is reached, then df3 switches to the values of B until B's notice date is reached and so on. When we are during notice date, it should take the mean between the current column and the next one.
In the above example, df3 should be as follows (with formulas to illustrate):
df3 = pd.DataFrame({'date':['03.05.1982','04.05.1982','05.05.1982','06.05.1982','07.05.1982','10.05.1982','11.05.1982'], 'Result':[63.63,64.08,(64.19+64.19)/2,65.08,(65.33+65.41)/2,65.36,65.44]})
My idea was to first create a temporary dataframe with same dimensions as df1 and to fill it with 1's when the index date is prior to notice and 0's after. Doing a rolling mean with window 1 would give for each column a series of 1 until I reach 0.5 (signalling a switch).
Not sure if there is a better way to get df3?
I tried the following:
def fill_rule(df_p,df_t):
return np.where(df_p.index > df_t[df_t.Name==df_p.name]['Notice'][0], 0, 1)
df1['date'] = pd.to_datetime(df1['date'])
df2['notice'] = pd.to_datetime(df2['notice'])
df1.set_index("date", inplace = True)
temp = df1.apply(lambda x: fill_rule(x, df2), axis = 0)
And I got the following error: KeyError: (0, 'occurred at index B')

df1['t'] = df1['date'].map(df2.set_index(["Notice"])['Name'])
df1['t'] =df1['t'].fillna(method='bfill').fillna("C")
df3 = pd.DataFrame()
df3['Result'] = df1.apply(lambda row: row[row['t']],axis =1)
df3['date'] = df1['date']

You can use the between method to select the specific date ranges in both dataframes and then use iloc to substitute the specific values
#Initializing the output
df3 = df1.copy()
df3.drop(['B','C'], axis = 1, inplace = True)
df3.columns = ['date','Result']
df3['Result'] = 0.0
df3['count'] = 0
#Modifying df2 to add a dummy sample at the beginning
temp = df2.copy()
temp = temp.iloc[0]
temp = pd.DataFrame(temp).T
temp.Name ='Z'
temp.Notice = pd.to_datetime("05-05-1980")
df2 = pd.concat([temp,df2])
for i in range(len(df2)-1):
startDate = df2.iloc[i]['Notice']
endDate = df2.iloc[i+1]['Notice']
name = df2.iloc[i+1]['Name']
indices = [df1.date.between(startDate, endDate, inclusive=True)][0]
df3.loc[indices,'Result'] += df1[indices][name]
df3.loc[indices,'count'] += 1
df3.Result = df3.apply(lambda x : x.Result/x['count'], axis = 1)

How can I count a specific value in group_by in pandas?

I have a dataframe and I use groupby to group it by Season. One of the columns of the original df is named Check and consists of True and False. My aim it to count the True values for each group and put it in the new dataframe.
import pandas as pd
df = ....
df['Check'] = df['Actual'] == df['Prediction']
grouped_per_year = df.groupby('Season')
df_2= pd.DataFrame()
df_2['Seasons'] = total_matches_per_year.keys()
df_2['Successes'] = ''
df_2['Total_Matches'] = list(grouped_per_year.size())
df_2['SR'] = df_2['Successes'] / df_2['Total_Matches']
df_2['Money_In'] = list(grouped_per_year['Money_In'].apply(sum))
df_2['Profit (%)'] = (df_profit['Money_In'] - df_profit['Total_Matches']) / df_profit['Total_Matches'] * 100.
I have tried:
successes_per_year = grouped_per_year['Pred_Check'].value_counts()
but I don't know how to get only the True count.

For counting True, you can also use sum (as True=1 and False=0 when doing a numerical operation):
grouped_per_year['Pred_Check'].sum()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Clustering intervals - python

Related

How can I speed up a multi-column loop?

Pandas: Replace empty column values with the non-empty value based on a condition

Retain rows with the top score within a range of a column start

Create dataframe conditionally to other dataframe elements

How can I count a specific value in group_by in pandas?

Categories

Resources