I am trying to solve a problem where I have two dataframe which are df1 and df2. Both dataframe has the same column. I wanted to check if df1['column1'] == df2["column1"] and df1['column2'] == df2['column2] and df1['column3'] == df2['column3'] if this true wanted to get index of the both dataframe where condition is matched. I tried this but it takes a long time because I have around 250 000 row dataframe. Does anyone suggest some efficient way to find out this?
Tried solution :
from datetime import datetime
MS_counter = 0
matched_ws_index = []
start = datetime.now()
for MS_id in Mastersheet_df["Index"]:
WS_counter = 0
for WS_id in Weekly_sheet_df["Index"]:
if (Weekly_sheet_df.loc[WS_counter,"Trial ID"] == Mastersheet_df.loc[MS_counter,"Trial ID"]) and (Mastersheet_df.loc[MS_counter,"Biomarker Type"] == Weekly_sheet_df.loc[WS_counter,"Biomarker Type"]) and (WS_id == MS_id): # match trial id
print("Trial id, index and biomarker type are matched")
print(WS_counter)
print(MS_counter)
matched_ws_index.append(WS_counter)
WS_counter +=1
MS_counter +=1
end = datetime.now()
print("The time of execution of above program is :",
str(end-start)[5:])
Expected output is :
If above three condition is true it should gives the dataframe index postion like this
Matched
df1 index is = 170
Matched df2 index is = 658
Related
I have the following problem: I have data (table called 'answers') of a quiz application including the answered questions per user with the respective answering date (one answer per line), e.g.:
UserID
Time
Term
QuestionID
Answer
1
2019-12-28 18:25:15
Winter19
345
a
2
2019-12-29 20:15:13
Winter19
734
b
I would like to write an algorithm to determine whether a user has used the quiz application several days in a row (a so-called 'streak'). Therefore, I want to create a table ('appData') with the following information:
UserID
Term
HighestStreak
1
Winter19
7
2
Winter19
10
For this table I need to compute the variable 'HighestStreak'. I managed to do so with the following code:
for userid, term in zip(appData.userid, appData.term):
final_streak = 1
for i in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
temp_streak = 1
while i + pd.DateOffset(days=1) in answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique():
i += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Unfortunately, running this code takes about 45 minutes. The table 'answers' has about 4,000 lines. Is there any structural 'mistake' in my code that makes it so slow or do processes like this take that amount of time?
Any help would be highly appreciated!
EDIT:
I managed to increase the speed from 45 minutes to 2 minutes with the following change:
I filtered the data to students who answered at least one answer first and set the streak to 0 for the rest (as the streak for 0 answers is 0 in every case):
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
Furthermore I moved the filtered list out of the loop, so the algorithm does not need to filter twice, resulting in the following new code:
appData.loc[appData.totAnswers==0, 'highestStreak'] = 0
appDataActive = appData[appData.totAnswers!=0]
for userid, term in zip(appData.userid, appData.term):
activeDays = answers[(answers.userid==userid) & (answers.term==term)].time.dt.date.unique()
final_streak = 1
for day in activeDays:
temp_streak = 1
while day + pd.DateOffset(days=1) in activeDays:
day += pd.DateOffset(days=1)
temp_streak += 1
if temp_streak > final_streak:
final_streak = temp_streak
appData.loc[(appData.userid==userid) & (appData.term==term), 'HighestStreak'] = final_streak
Of course, 2 minutes is much better than 45 minutes. But are there any more tips?
my attempt, which borrows some key ideas from the connected components problem; a fairly early problem when looking at graphs
first I create a random DataFrame with some user id's and some dates.
import datetime
import random
import pandas
import numpy
#generate basic dataframe of users and answer dates
def sample_data_frame():
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.Series(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()),
name='date')
users = pandas.Series(users, name='user')
df = pandas.merge(date_range, users, how='cross')
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
def sample_data_frame_v2(): #pandas version <1.2
users = ['A' + str(x) for x in range(10000)] #generate user id
date_range = pandas.DataFrame(pandas.date_range(datetime.date.today() - datetime.timedelta(days=364) , datetime.date.today()), columns = ['date'])
users = pandas.DataFrame(users, columns = ['user'])
date_range['key'] = 1
users['key'] = 1
df = users.merge(date_range, on='key')
df.drop(labels = 'key', axis = 1)
removals = numpy.random.randint(0, len(df), int(len(df)/4)) #remove random quarter of entries
df.drop(removals, inplace=True)
return df
put your DataFrame in sorted order, so that the next row is next answer day and then by user
create two new columns from the row below containing the userid and the date of the row below
if the user of row below is the same as the current row and the current date + 1 day is the same as the row below set the column result to false numerically known as 0, otherwise if it's a new streak set to True, which can be represented numerically as 1.
cumulatively sum the results which will group your streaks
finally count how many entries exist per group and find the max for each user
for 10k users over 364 days worth of answers my running time is about a 1 second
df = sample_data_frame()
df = df.sort_values(by=['user', 'date']).reset_index(drop = True)
df['shift_date'] = df['date'].shift()
df['shift_user'] = df['user'].shift()
df['result'] = ~((df['shift_date'] == df['date'] - datetime.timedelta(days=1)) & (df['shift_user'] == df['user']))
df['group'] = df['result'].cumsum()
summary = (df.groupby(by=['user', 'group']).count()['result'].max(level='user'))
summary.sort_values(ascending = False) #print user with highest streak
I'm looking to optimize the time taken for a function with a for loop. The code below is ok for smaller dataframes, but for larger dataframes, it takes too long. The function effectively creates a new column based on calculations using other column values and parameters. The calculation also considers the value of a previous row value for one of the columns. I read that the most efficient way is to use Pandas vectorization, but i'm struggling to understand how to implement this when my for loop is considering the previous row value of 1 column to populate a new column on the current row. I'm a complete novice, but have looked around and cant find anything that suits this specific problem, though I'm searching from a position of relative ignorance, so may have missed something.
The function is below and I've created a test dataframe and random parameters too. it would be great if someone could point me in the right direction to get the processing time down. Thanks in advance.
def MODE_Gain (Data, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1):
print('Calculating Gains')
df = Data
df.fillna(0, inplace=True)
df['MODE'] = ""
df['Nominal'] = ""
df.iloc[0, df.columns.get_loc('MODE')] = 0
for i in range(1, (len(df.index))):
print('Computing Status{i}/{r}'.format(i=i, r=len(df.index)))
if ((df['MODE'].loc[i-1] == 1) & (df['A'].loc[i] > Normalin)) :
df['MODE'].loc[i] = 1
elif (((df['MODE'].loc[i-1] == 0) & (df['A'].loc[i] > NormalLim600))|((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ))):
df['MODE'].loc[i] = 1
else:
df['MODE'].loc[i] = 0
df[''] = (df['C']/6)
for i in range(len(df.index)):
print('Computing MODE Gains {i}/{r}'.format(i=i, r=len(df.index)))
if ((df['A'].loc[i] > MODEin) & (df['A'].loc[i] < NormalLim600)&(df['B'].loc[i] < NormalLim1)) :
df['Nominal'].loc[i] = rated/6
else:
df['Nominal'].loc[i] = 0
df["Upgrade"] = df[""] - df["Nominal"]
return df
A = np.random.randint(0,28,size=(8000))
B = np.random.randint(0,45,size=(8000))
C = np.random.randint(0,2300,size=(8000))
df = pd.DataFrame()
df['A'] = pd.Series(A)
df['B'] = pd.Series(B)
df['C'] = pd.Series(C)
MODELim600 = 32
MODELim30 = 28
MODELim1 = 39
MODEin = 23
Normalin = 20
NormalLim600 = 25
NormalLim1 = 32
rated = 2150
finaldf = MODE_Gain(df, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1)
Your second loop doesn't evaluate the prior row, so you should be able to use this instead
df['Nominal'] = 0
df.loc[(df['A'] > MODEin) & (df['A'] < NormalLim600) & (df['B'] < NormalLim1), 'Nominal'] = rated/6
For your first loop, the elif statements looks to evaluate this
((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 )) and sets it to 1 regardless of the other condition, so you can remove that and vectorize that operation. didn't try, but this should do it
df.loc[(df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ), 'MODE'] = 1
then you may be able to collapse the other conditions into one statement use |
Not sure how much all that will save you, but you should cut the time in half getting rid of the 2nd loop.
For vectorizing it I suggest you first shift your column in another one :
df['MODE_1'] = df['MODE'].shift(1)
and then use :
(df['MODE_1'].loc[i] == 1)
After that you should be able to vectorize
Could you please take a look at my code and give me some advice how I might improve my code so it takes less time to process? Main purpose is to look at each row (ID) at test table and find same ID at list table, if it's match then look at time difference between the two identical ID and label them as if it takes less than 1 hours (3600s) or not.
Thanks in advance
test.csv has two col (ID, time) and 100K rows
list.csv has tow col (ID, time) and 40k rows
sample data:
ID Time
83d-36615fa05fb0 2019-12-11 10:41:48
a = -1
for row_index,row in test.iterrows():
a = a + 1
for row_index2,row2 in list.iterrows():
if row['ID'] == row2['ID']:
time_difference = row['Time'] - row2['Time']
time_difference = time_difference.total_seconds()
if time_difference < 3601 and time_difference > 0:
test.loc[a, 'result'] = "short waiting time"
The code could be converted to the following.
def find_match(id_val, time_val, df):
" This uses your algorithm for generating a string based upon time difference "
# Find rows with matching ID in Dataframe df
# (this will be list_ref in our usage)
matches = df[df['ID'] == id_val]
if not matches.empty:
# Use iterrows on matched rows
# This replaced your inner loop but only apply to rows with the same ID
for row_index2, row2 in matches.iterrows():
time_difference = time_val - row2['Time']
time_difference = time_difference.total_seconds()
if 0 < time_difference < 3601:
return "short waiting time" # Exit since found a time in window
return "" # Didn't find a time in our window
# Test
# test - your test Dataframe
# list_ref - (your list DataFrame, but using list_ref since list is a bad
# name for a Python variable)
# Create result using Apply
# More info on Apply available https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
test['result'] = test.apply(lambda row: find_match (row['ID'], row['Time'], list_ref), axis = 1)
You should not loop through the rows of your data frames at all.
Instead combine your test and list data frames using
The merge or join methods from you data frame using ‘ID’ as the ‘on’ join key.
You the have the data in one big table and you create a new column where you subtract Time.x from the first table from Time.y in the second table.
E.g call the new column ‘timediff ’.
You then filter the result data frame by this timediff column for timediff < 3601.
suppose dataframes are named testdf and listdf.
joindf = testdf.merge(listdf, on='ID')
joindf['timediff'] = joindf['Time_x'] - joindf['Time_y']
joindf.loc[timediff < 3601, 'result'] = 'short waiting time'
I have 2 DataFrames contain 110.000 rows and 10 columns and another with 47.000 datapoints with 8 columns. I use the 2 dataframe to check the validate of first DataFrame. If this match, i will take this row of first DataFrame and second DataFrame to new DataFrame.
The way i check is inside the 2nd DataFrame i got a keyword in column keyword. Then i check this with column string of 1st DataFrame if it include keyword or not.
Now i'm using 2 loops iterrows() to check it. But i will take a lot of day to do. I wonder is there any more efficient way to do this.
My code is like this :
for index, ebayrow in ebaydata.iterrows():
make_match = [e_scrubrow for idx,e_scrubrow in etail_scrub_data.iterrows() if e_scrubrow['keyword'] in ebayrow['title']]
nummatch = len(make_match)
if nummatch == 0:
continue
else:
model_match = [e_scrubrow for e_scrubrow in make_match if e_scrubrow['keyword2'] in ebayrow['title']]
nummatch = len(model_match)
if nummatch == 0:
continue
else:
if nummatch == 1:
scrubrow = model_match[0]
ebaychecked.append(scrubrow['keyword'])
ebaychecked1.append(scrubrow['keyword2'])
ebaychecked2.append(scrubrow['keyword3'])
ebaychecked7.append(ebayrow['info'])
print(len(ebaychecked))
else:
year_match = [e_scrubrow for e_scrubrow in model_match if e_scrubrow['keyword3'] in ebayrow['title']]
nummatch = len(year_match)
if nummatch == 0:
scrubrow = model_match[0]
ebaychecked.append(scrubrow['keyword'])
ebaychecked1.append(scrubrow['keyword2'])
ebaychecked2.append(scrubrow['keyword3'])
ebaychecked7.append(ebayrow['info'])
print(len(ebaychecked))
I'm filtering a dataframe by hour and weekday:
if type == 'daily':
hour = data.index.hour
day = data.index.weekday
selector = ((hour != 17)) | ((day!=5) & (day!=6))
data = data[selector]
if type == 'weekly':
day = data.index.weekday
selector = ((day!=5) & (day!=6))
data = data[selector]
Then I'm using a for where I need to write some conditional according to the weekday/hour and the row.index doesn't have any information. What can I do in this case ?
I need to do something like (this it won't work since row.index doesn't have weekday or hour info):
for index, row in data.iterrows():
if type == 'weekly' & row.index.weekday == 1 & row.index.hour == 0 & row.index.min == 0 | \
type == 'daily' & row.index.hour == 18 & row.index.min == 0:
Thx in advance
I know this is not the most elegant way, but you could create your variables in columns:
df['Hour'] = df.index.hour
If you need a min or a max based on those variables, you could create another column and use rolling_min or rolling type formulas.
Once you have your columns, you can iterate as you please with iteration you suggested.
There's info about the index properties here