Faster loop in Pandas looking for ID and older date - python

So, I have a DataFrame that represents purchases with 4 columns:
date (date of purchase in format %Y-%m-%d)
customer_ID (string column)
claim (1-0 column that means 1-the customer complained about the purchase, 0-customer didn't complain)
claim_value (for claim = 1 it means how much the claim cost to the company, for claim = 0 it's NaN)
I need to build 3 new columns:
past_purchases (how many purchases the specific customer had before this purchase)
past_claims (how many claims the specific customer had before this purchase)
past_claims_value (how much did the customer's past claims cost)
This has been my approach until now:
past_purchases = []
past_claims = []
past_claims_value = []
for i in range(0, len(df)):
date = df['date'][i]
customer_ID = df['customer_ID'][i]
df_temp = df[(df['date'] < date) & (df['customer_ID'] == customer_ID)]
past_purchases.append(len(df_temp))
past_claims.append(df_temp['claim'].sum())
past_claims_value.append(df['claim_value'].sum())
df['past_purchases'] = pd.DataFrame(past_purchases)
df['past_claims'] = pd.DataFrame(past_claims)
df['past_claims_value'] = pd.DataFrame(past_claims_value)
The code works fine, but it's too slow. Can anyone make it work faster? Thanks!
Ps: It's importante to check that the date is older, if the customer had 2 purchases in the same date they shouldn't count for each other.
Pss: I'm willing to use libraries for parallel processing like multiprocessing, concurrent.futures, joblib or dask, but never had before in a similar way.
Expected outcome:

Maybe you can try using a cumsum over customers, if the dates are sorted ascendant
df.sort_values('date', inplace=True)
new_temp_columns = ['claim_s','claim_value_s']
df[['claim_s','claim_value_s']] = df[new_temp_columns].shift()
df['past_claims'] = df.groupby('customer_ID')['claim_s'].transform(pd.Series.cumsum)
df['past_claims_value'] = df.groupby('customer_ID')['claim_value_s'].transform(pd.Series.cumsum)
# set the min value for the groups
dfc = data.groupby(['customer_ID','date'])[['past_claims','past_claims_value']]
data[['past_claims', 'past_claims_value']] = dfc.transform(min)
# Remove temp columns
data = data.loc[:, ~data.columns.isin(new_temp_columns)]
Again, this will only works if te date are srotes

Related

Keep rows based in more than 2 columns

I have a database similar to the one I created below (with 982977 rows × 10 columns), and I wanted to keep the rows where the exams of the same patient (ID) that are different from "COVID" have been performed in a specific period based on the date of the "COVID" exam.
I created 2 columns, one with dates 7 days before and one with 30 days after the original exam date.
Ex: If the patient had an iron exam between 7 days before and 30 days after the date of their COVID exam, then I would keep that patient, otherwise, I would remove.
I did a for loop, but since the database is big, it took almost 6h to complete and when it finished, I lost the connection to the server, and I couldn't continue to manipulate the data
Is there a simpler and/or faster way to do this?
ID = ['1','1','1','2','2']
Exam = ['COVID', 'Iron', 'Ferritin', 'COVID', 'Iron']
Date = [2021-02-22,2021-02-20,2021-06-22,2021-05-22,2021-05-29]
Date7 = [2021-02-15,2021-02-13,2021-06-15,2021-05-15,2021-05-22]
Date30 = [2021-03-24,2021-03-22,2021-07-24,2021-05-22,2021-06-29]
teste = list(zip(ID, Exam, Date, Date7, Date30))
teste2 = pd.DataFrame(teste, columns=['ID','Exam','Date', 'Date7', 'Date30'])
All the dates columns are in datetime already
pacients = []
for pacient in teste2.ID.unique():
a = teste2[teste2.ID==pacient]
b = a[a.Exam!="COVID"]
c = a[a.Exam=="COVID"]
for exam_covid in b.Data:
for covid_7 in c.Data7:
for covid_30 in c.Data30:
if covid_7 < exam_covid < covid_30:
pacients.append(pacient)
pacients = set(pacientes)
pacients = list(pacientes)
With the following sample dataframe named df
ID = ['1','1','1','2','2']
Exam = ['COVID', 'Iron', 'Ferritin', 'COVID', 'Iron']
Date = ['2021-02-22','2021-02-20','2021-06-22','2021-05-22','2021-06-29']
df = pd.DataFrame({'ID': ID, 'Exam': Exam, 'Date': pd.to_datetime(Date)})
you could try the following:
Step: Create a dataframe df_cov that covers all the time intervals around the Covid exams:
df_cov = df[['ID', 'Date']][df.Exam.eq('COVID')]
df_cov = df_cov.assign(
Before=df_cov.Date - pd.Timedelta(days=7),
After=df_cov.Date + pd.Timedelta(days=30)
).drop(columns='Date')
Step: merge the non-Covid-exams in df with df_cov on the column ID, then select the exams that are within the intervals (here with query), and then extract the remaining unique IDs:
patients = (
df[df.Exam.ne('COVID')].merge(df_cov, on='ID', how='left')
.query('(Before < Date) & (Date < After)')
.ID.unique()
)
Result for the sample (I've changed the last exam date such that it won't fall into the required time interval):
array(['1'], dtype=object)

How can you increase the speed of an algorithm that computes a usage streak?

I have the following problem: I have data (table called 'answers') of a quiz application including the answered questions per user with the respective answering date (one answer per line), e.g.:
UserID
Time
Term
QuestionID
Answer
1
2019-12-28 18:25:15
Winter19
345
a
2
2019-12-29 20:15:13
Winter19
734
b
I would like to write an algorithm to determine whether a user has used the quiz application several days in a row (a so-called 'streak'). Therefore, I want to create a table ('appData') with the following information:
UserID
Term
HighestStreak
1
Winter19
7
2
Winter19
10
For this table I need to compute the variable 'HighestStreak'. I managed to do so with the following code:
for userid, term in zip(appData.userid, appData.term):
final_streak = 1
for i in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
temp_streak = 1
while i + pd.DateOffset(days=1) in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
i += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Unfortunately, running this code takes about 45 minutes. The table 'answers' has about 4,000 lines. Is there any structural 'mistake' in my code that makes it so slow or do processes like this take that amount of time?
Any help would be highly appreciated!
EDIT:
I managed to increase the speed from 45 minutes to 2 minutes with the following change:
I filtered the data to students who answered at least one answer first and set the streak to 0 for the rest (as the streak for 0 answers is 0 in every case):
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
Furthermore I moved the filtered list out of the loop, so the algorithm does not need to filter twice, resulting in the following new code:
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
for userid, term in zip(appData.userid, appData.term):
activeDays = answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique()
final_streak = 1
for day in activeDays:
temp_streak = 1
while day + pd.DateOffset(days=1) in activeDays:
day += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Of course, 2 minutes is much better than 45 minutes. But are there any more tips?
my attempt, which borrows some key ideas from the connected components problem; a fairly early problem when looking at graphs
first I create a random DataFrame with some user id's and some dates.
import datetime
import random
import pandas
import numpy
#generate basic dataframe of users and answer dates
def sample_data_frame():
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.Series(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()),
name='date')
users = pandas.Series(users, name='user')
df = pandas.merge(date_range, users, how='cross')
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
def sample_data_frame_v2(): #pandas version <1.2
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.DataFrame(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()), columns = ['date'])
users = pandas.DataFrame(users, columns = ['user'])
date_range['key'] = 1
users['key'] = 1
df = users.merge(date_range, on='key')
df.drop(labels = 'key', axis = 1)
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
put your DataFrame in sorted order, so that the next row is next answer day and then by user
create two new columns from the row below containing the userid and the date of the row below
if the user of row below is the same as the current row and the current date + 1 day is the same as the row below set the column result to false numerically known as 0, otherwise if it's a new streak set to True, which can be represented numerically as 1.
cumulatively sum the results which will group your streaks
finally count how many entries exist per group and find the max for each user
for 10k users over 364 days worth of answers my running time is about a 1 second
df = sample_data_frame()
df = df.sort_values(by=['user', 'date']).reset_index(drop = True)
df['shift_date'] = df['date'].shift()
df['shift_user'] = df['user'].shift()
df['result'] = ~((df['shift_date'] == df['date'] - datetime.timedelta(days=1)) & (df['shift_user'] == df['user']))
df['group'] = df['result'].cumsum()
summary = (df.groupby(by=['user', 'group']).count()['result'].max(level='user'))
summary.sort_values(ascending = False) #print user with highest streak

Efficient way to loop through GroupBy DataFrame

Since my last post did lack in information:
example of my df (the important col):
deviceID: unique ID for the vehicle. Vehicles send data all Xminutes.
mileage: the distance moved since the last message (in km)
positon_timestamp_measure: unixTimestamp of the time the dataset was created.
deviceID mileage positon_timestamp_measure
54672 10 1600696079
43423 20 1600696079
42342 3 1600701501
54672 3 1600702102
43423 2 1600702701
My Goal is to validate the milage by comparing it to the max speed of the vehicle (which is 80km/h) by calculating the speed of the vehicle using the timestamp and the milage. The result should then be written in the orginal dataset.
What I've done so far is the following:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
for group_name, group in df:
#sort group by time
group = group.sort_values(by='position_timestamp_measure')
group = group.reset_index()
#since I can't validate the first point in the group, I set it to valid
df_ori.loc[df_ori.index == group.dataIndex.values[0], 'validPosition'] = 1
#iterate through each data in the group
for i in range(1, len(group)):
timeGoneSec = abs(group.position_timestamp_measure.values[i]-group.position_timestamp_measure.values[i-1])
timeHours = (timeGoneSec/60)/60
#calculate speed
if((group.mileage.values[i]/timeHours)<maxSpeedKMH):
df_ori.loc[dataset.index == group.dataIndex.values[i], 'validPosition'] = 1
dataset.validPosition.value_counts()
It definitely works the way I want it to, however it lacks in performance a lot. The df contains nearly 700k in data (already cleaned). I am still a beginner and can't figure out a better solution. Would really appreciate any of your help.
If I got it right, no for-loops are needed here. Here is what I've transformed your code into:
df_ori['dataIndex'] = df_ori.index
df = df_ori.groupby('device_id')
#create new col and set all values to false
df_ori['valid'] = 0
df_ori = df_ori.sort_values(['position_timestamp_measure'])
# Subtract preceding values from currnet value
df_ori['timeGoneSec'] = \
df_ori.groupby('device_id')['position_timestamp_measure'].transform('diff')
# The operation above will produce NaN values for the first values in each group
# fill the 'valid' with 1 according the original code
df_ori[df_ori['timeGoneSec'].isna(), 'valid'] = 1
df_ori['timeHours'] = df_ori['timeGoneSec']/3600 # 60*60 = 3600
df_ori['flag'] = (df_ori['mileage'] / df_ori['timeHours']) <= maxSpeedKMH
df_ori.loc[df_ori['flag'], 'valid'] = 1
# Remove helper columns
df_ori = df.drop(columns=['flag', 'timeHours', 'timeGoneSec'])
The basic idea is try to use vectorized operation as much as possible and to avoid for loops, typically iteration row by row, which can be insanly slow.
Since I can't get the context of your code, please double check the logic and make sure it works as desired.

Pandas - Issue with plotting a graph after group by

I am have a table "attendance" in sqlite3 which I am importing as a dataframe using pandas. The dataframe looks like this,
id name date time
0 12345 Pankaj 2020-09-12 1900-01-01 23:17:49
1 12345 Pankaj 2020-09-12 1900-01-01 23:20:28
2 12345 Pankaj 2020-09-13 1900-01-01 13:36:01
A person a 'id' can appear multiple times, which is equivalent to a person going in and out of the door multiple times a day and we are recording each of that transition.
I wish to find the difference of time last out and first in, to find the number of hours a person was present at the work.
Since we need only the data for one person at a time, I am first filtering out data for one person, like this .
df = df.loc[df['id']== id]
This leaves me all the entries for a particular person.
Now, for difference of the last entry time and the first entry time, I am calculating like this,
df_gp = df.groupby('date')['time']
difference = df_gp.max() - df_gp.min()
Now, the "difference" comes out as a pandas series.
date
2020-09-12 00:02:39
2020-09-13 00:00:00
When I try to plot the graph using the pandas.series.plot() method, with type kind = 'line', like this,
difference.plot(kind = 'line')
I don't see graph being made at all. I don't see any error of such, it just simply does not show anything.
When I print,
print(difference.plot(kind = 'line'))
It prints in the terminal this,
AxesSubplot(0.125,0.2;0.775x0.68)
So I thought, it must be something with time.sleep() that the graph gets destroyed and exits the function too quickly, but it is not the case, I have tried so many thing, it simply doesn't show.
I need help with-
I don't; know if this is the correct way to have a graph when I want to have difference of time for a particular day. Please suggest if you have any elegant way to do the same.
What is the reason it doesn't show at all?
Complete code
def main():
emp_id = "12345"
db = os.path.join(constants.BASE_DIR.format("db"),"db_all.db")
with closing(sqlite3.connect(db)) as conn:
df = pd.read_sql_query("select * from attendance where id = {} order by date ASC".format(emp_id), conn, parse_dates={'date':'%Y-%m-%d',
'time':'%H:%M:%S'})
print(df.head())
#df = df.loc[df['id']== id]
is_empty = df.empty
if is_empty:
messagebox.showerror("Error","There are not enough records of employee")
return
# Add the latest row
emp_name = df.loc[(df['id'] == id).idxmax(),'name']
# dt_time = datetime.datetime.now().replace(microsecond=0)
# _date, _time = dt_time.date(),dt_time.time()
# print(type(_date))
# print(type(_time))
# df.loc[-1] = [emp_id,emp_name,_date,_time]
# df.index += 1
# df = df.sort_index()
# print(df.dtypes)
df_gp = df.groupby('date')['time']
print("Here")
difference = df_gp.max() - df_gp.min()
print(difference)
print(difference.plot(kind = 'line'))
if __name__ == '__main__':
main()
-Thanks

Mapping a column from one dataframe to another in pandas based on condition

I have two dataframes df_inv
df_inv
and df_sales.
df_sales
I need to add a column to df_inv with the sales person name based on the doctor he is tagged in df_sales. This would be a simple merge I guess if the sales person to doctor relationship in df_sales was unique. But There is change in ownership of doctors among sales person and a row is added with each transfer with an updated day.
So if the invoice date is less than updated date then previous tagging should be used, If there are no tagging previously then it should show nan. In other word for each invoice_date in df_inv the previous maximum updated_date in df_sales should be used for tagging.
The resulting table should be like this
Final Table
I am relatively new to programming but I can usually find my way through problems. But I can not figure this out. Any help is appreciated
import pandas as pd
import numpy as np
df_inv = pd.read_excel(r'C:\Users\joy\Desktop\sales indexing\consolidated report.xlsx')
df_sales1 = pd.read_excel(r'C:\Users\joy\Desktop\sales indexing\Sales Person
tagging.xlsx')
df_sales2 = df_sales1.sort_values('Updated Date',ascending=False)
df_sales = df_sales2.reset_index(drop=True)
sales_tag = []
sales_dup = []
counter = 0
for inv_dt, doc in zip(df_inv['Invoice_date'],df_inv['Doctor_Name']):
for sal, ref, update in zip(df_sales['Sales
Person'],df_sales['RefDoctor'],df_sales['Updated Date']):
if ref==doc:
if update<=inv_dt and sal not in sales_dup :
sales_tag.append(sal)
sales_dup.append(ref)
break
else:
pass
else:
pass
sales_dup = []
counter = counter+1
if len(sales_tag)<counter:
sales_tag.append('none')
else:
pass
df_inv['sales_person'] = sales_tag
This appears to work.

Categories

Resources