I'm working on a project to monitor my 5k time for my running/jogging activities based on their GPS data. I'm currently exploring my data in a Jupyter notebook & now realize that I will need to exclude some activities.
Each activity is a row in a dataframe. While I do want to exclude some rows, I don't want to drop them from my dataframe as I will also use the df for other calculations.
I've added a column to the df along with a custom function for checking the invalidity reasons of a row. It's possible that a run could be excluded for multiple reasons.
In []:
# add invalidity reasons column & update logic
df['invalidity_reasons'] = ''
def maintain_invalidity_reasons(reason):
"""logic for maintaining ['invalidity reasons']"""
reasons = []
if invalidity_reasons == '':
return list(reason)
else:
reasons = invalidity_reasons
reasons.append(reason)
return reasons
I filter down to specific rows in my df and pass them to my function. The below example returns a set of five rows from the df. Below is an example of using the function in my Jupyter notebook.
In []:
columns = ['distance','duration','notes']
filt = (df['duration'] < pd.Timedelta('5 minutes'))
df.loc[filt,columns].apply(maintain_invalidity_reasons('short_run'),axis=1)
Out []:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-107-0bd06407ef08> in <module>
2
3 filt = (df['duration'] < pd.Timedelta('5 minutes'))
----> 4 df.loc[filt,columns].apply(maintain_invalidity_reasons(reason='short_run'),axis=1)
<ipython-input-106-60264b9c7b13> in maintain_invalidity_reasons(reason)
5 """logic for maintaining ['invalidity reasons']"""
6 reasons = []
----> 7 if invalidity_reasons == '':
8 return list(reason)
9 else:
NameError: name 'invalidity_reasons' is not defined
Here is an example of the output of my filter if I remove the .apply() call to my function
In []:
columns = ['distance','duration', 'notes','invalidity_reasons']
filt = (df['duration'] < pd.Timedelta('5 minutes'))
df.loc[filt,columns]
Out []:
It seems that my issue lies in not knowing how to specify that I want to reference the scalar value in the 'invalidity_reasons' index/column (not sure of the proper term) of the specific row.
I've tried adjusting the IF statement with the below variants. I've also tried to apply the function with/out the axis argument. I'm stuck, please help!
if 'invalidity_reasons' == '':
if s['invalidity_reasons'] == '':
This is pretty much a stab in the dark, but I hope it helps. In the following I'm using this simple frame as an example (to have something to work with):
df = pd.DataFrame({'Col': range(5)})
Now if you define
def maintain_invalidity_reasons(current_reasons, new_reason):
if current_reasons == '':
return [new_reason]
if type(current_reasons) == list:
return current_reasons + [new_reason]
return [current_reasons] + [new_reason]
add another column invalidity_reasons to df
df['invalidity_reasons'] = ''
populate one cell (for the sake of exemplifying)
df.loc[0, 'invalidity_reasons'] = 'a reason'
Col invalidity_reasons
0 0 a reason
1 1
2 2
3 3
4 4
build a filter
filt = (df.Col < 3)
and then do
df.loc[filt, 'invalidity_reasons'] = (df.loc[filt, 'invalidity_reasons']
.apply(maintain_invalidity_reasons,
args=('another reason',)))
you will get
Col invalidity_reasons
0 0 [a reason, another reason]
1 1 [another reason]
2 2 [another reason]
3 3
4 4
Does that somehow resemble what you are looking for?
Related
I'm looking to optimize the time taken for a function with a for loop. The code below is ok for smaller dataframes, but for larger dataframes, it takes too long. The function effectively creates a new column based on calculations using other column values and parameters. The calculation also considers the value of a previous row value for one of the columns. I read that the most efficient way is to use Pandas vectorization, but i'm struggling to understand how to implement this when my for loop is considering the previous row value of 1 column to populate a new column on the current row. I'm a complete novice, but have looked around and cant find anything that suits this specific problem, though I'm searching from a position of relative ignorance, so may have missed something.
The function is below and I've created a test dataframe and random parameters too. it would be great if someone could point me in the right direction to get the processing time down. Thanks in advance.
def MODE_Gain (Data, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1):
print('Calculating Gains')
df = Data
df.fillna(0, inplace=True)
df['MODE'] = ""
df['Nominal'] = ""
df.iloc[0, df.columns.get_loc('MODE')] = 0
for i in range(1, (len(df.index))):
print('Computing Status{i}/{r}'.format(i=i, r=len(df.index)))
if ((df['MODE'].loc[i-1] == 1) & (df['A'].loc[i] > Normalin)) :
df['MODE'].loc[i] = 1
elif (((df['MODE'].loc[i-1] == 0) & (df['A'].loc[i] > NormalLim600))|((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ))):
df['MODE'].loc[i] = 1
else:
df['MODE'].loc[i] = 0
df[''] = (df['C']/6)
for i in range(len(df.index)):
print('Computing MODE Gains {i}/{r}'.format(i=i, r=len(df.index)))
if ((df['A'].loc[i] > MODEin) & (df['A'].loc[i] < NormalLim600)&(df['B'].loc[i] < NormalLim1)) :
df['Nominal'].loc[i] = rated/6
else:
df['Nominal'].loc[i] = 0
df["Upgrade"] = df[""] - df["Nominal"]
return df
A = np.random.randint(0,28,size=(8000))
B = np.random.randint(0,45,size=(8000))
C = np.random.randint(0,2300,size=(8000))
df = pd.DataFrame()
df['A'] = pd.Series(A)
df['B'] = pd.Series(B)
df['C'] = pd.Series(C)
MODELim600 = 32
MODELim30 = 28
MODELim1 = 39
MODEin = 23
Normalin = 20
NormalLim600 = 25
NormalLim1 = 32
rated = 2150
finaldf = MODE_Gain(df, rated, MODELim1, MODEin, Normalin,NormalLim600,NormalLim1)
Your second loop doesn't evaluate the prior row, so you should be able to use this instead
df['Nominal'] = 0
df.loc[(df['A'] > MODEin) & (df['A'] < NormalLim600) & (df['B'] < NormalLim1), 'Nominal'] = rated/6
For your first loop, the elif statements looks to evaluate this
((df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 )) and sets it to 1 regardless of the other condition, so you can remove that and vectorize that operation. didn't try, but this should do it
df.loc[(df['B'].loc[i] > NormalLim1) & (df['B'].loc[i] < MODELim1 ), 'MODE'] = 1
then you may be able to collapse the other conditions into one statement use |
Not sure how much all that will save you, but you should cut the time in half getting rid of the 2nd loop.
For vectorizing it I suggest you first shift your column in another one :
df['MODE_1'] = df['MODE'].shift(1)
and then use :
(df['MODE_1'].loc[i] == 1)
After that you should be able to vectorize
This is my first time posting a question, so take it easy on me if I don't know stack overflow norm of asking questions.
Attached is a snippet of what I am trying to accomplish on my side-project. I want to be able to compare a user input with a database .xlsx file that was imported by pandas.
I want to compare the user input with the database column ['Component'], if that component is there, it will grab its properties associated with that component.
comp_loc = r'C:\Users\ayubi\Documents\Python Files\Chemical_Database.xlsx'
data = pd.read_excel(comp_loc)
print(data)
LK = input('What is the Light Key?: ') #Answer should be Benzene in this case
if LK == data['Component'].any():
Tcrit = data['TC, (K)']
Pcrit = data['PC, (bar)']
A = data['A']
B = data['B']
C = data['C']
D = data['D']
else:
print('False')
Results
Component TC, (K) PC, (bar) A B C D
0 Benzene 562.2 48.9 -6.983 1.332 -2.629 -3.333
1 Toluene 591.8 41.0 -7.286 1.381 -2.834 -2.792
What is the Light Key?: Benzene
False
Please let me know if you have any questions.
I do appreciate your help!
You can do this by taking advantage of indices and using the df.loc accessor in pandas:
# set index to Component column for convenience
data = data.set_index('Component')
LK = input('What is the Light Key?: ') #Answer should be Benzene in this case
# check if your lookup is in the index
if LK in data.index:
# grab the row by the index using .loc
row = data.loc[LK]
# if the column name has spaces, you need to access as key
Tcrit = row['TC, (K)']
Pcrit = row['PC, (bar)']
# if the column name doesn't have a space, you can access as attribute
A = row.A
B = row.B
C = row.C
D = row.D
else:
print('False')
This is a great case for an Index. Set 'Component' to the Index, and then you can use a very fast loc call to look up the data. Instead of the if-else use a try-except as a KeyError is going to tell you that the LK doesn't exist, without requiring the slower check of first checking whether it's in the index.
I also highly suggest you keep the values as a single Series, instead of floating around as 6 different varibales. It's simple to access each item by the Series index, i.e. Series['A'].
df = df.set_index('Component')
def grab_data(df, LK):
try:
return df.loc[LK]
except KeyError:
return False
grab_data(df, 'Benzene')
#TC, (K) 562.200
#PC, (bar) 48.900
#A -6.983
#B 1.332
#C -2.629
#D -3.333
#Name: Benzene, dtype: float64
grab_data(df, 'foo')
#False
Overview: I am working with pandas dataframes of census information, while they only have two columns, they are several hundred thousand rows in length. One column is a census block ID number and the other is a 'place' value, which is unique to the city in which that census block ID resides.
Example Data:
BLOCKID PLACEFP
0 60014001001000 53000
1 60014001001001 53000
...
5844 60014099004021 53000
5845 60014100001000
5846 60014100001001
5847 60014100001002 53000
Problem: As shown above, there are several place values that are blank, though they have a census block ID in their corresponding row. What I found was that in several instances, the census block ID that is missing a place value, is located within the same city as the surrounding blocks that do not have a missing place value, especially if the bookend place values are the same - as shown above, with index 5844 through 5847 - those two blocks are located within the same general area as the surrounding blocks, but just seem to be missing the place value.
Goal: I want to be able to go through this dataframe, find these instances and fill in the missing place value, based on the place value before the missing value and the place value that immediately follows.
Current State & Obstacle: I wrote a loop that goes through the dataframe to correct these issues, shown below.
current_state_blockid_df = pandas.DataFrame({'BLOCKID':[60014099004021,60014100001000,60014100001001,60014100001002,60014301012019,60014301013000,60014301013001,60014301013002,60014301013003,60014301013004,60014301013005,60014301013006],
'PLACEFP': [53000,,,53000,11964,'','','','','','',11964]})
for i in current_state_blockid_df.index:
if current_state_blockid_df.loc[i, 'PLACEFP'] == '':
#Get value before blank
prior_place_fp = current_state_blockid_df.loc[i - 1, 'PLACEFP']
next_place_fp = ''
_n = 1
# Find the end of the blank section
while next_place_fp == '':
next_place_fp = current_state_blockid_df.loc[i + _n, 'PLACEFP']
if next_place_fp == '':
_n += 1
# if the blanks could likely be in the same city, assign them the city's place value
if prior_place_fp == next_place_fp:
for _i in range(1, _n):
current_state_blockid_df.loc[_i, 'PLACEFP'] = prior_place_fp
However, as expected, it is very slow when dealing with hundreds of thousands or rows of data. I have considered using maybe ThreadPool executor to split up the work, but I haven't quite figured out the logic I'd use to get that done. One possibility to speed it up slightly, is to eliminate the check to see where the end of the gap is and instead just fill it in with whatever the previous place value was before the blanks. While that may end up being my goto, there's still a chance it's too slow and ideally I'd like it to only fill in if the before and after values match, eliminating the possibility of the block being mistakenly assigned. If someone has another suggestion as to how this could be achieved quickly, it would be very much appreciated.
You can use shift to help speed up the process. However, this doesn't solve for cases where there are multiple blanks in a row.
df['PLACEFP_PRIOR'] = df['PLACEFP'].shift(1)
df['PLACEFP_SUBS'] = df['PLACEFP'].shift(-1)
criteria1 = df['PLACEFP'].isnull()
criteria2 = df['PLACEFP_PRIOR'] == df['PLACEFP_AFTER']
df.loc[criteria1 & criteria2, 'PLACEFP'] = df.loc[criteria1 & criteria2, 'PLACEFP_PRIOR']
If you end up needing to iterate over the dataframe, use df.itertuples. You can access the column values in the row via dot notation (row.column_name).
for idx, row in df.itertuples():
# logic goes here
Using your dataframe as defined
def fix_df(current_state_blockid_df):
df_with_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] == '']
df_no_blanks = current_state_blockid_df[current_state_blockid_df['PLACEFP'] != '']
sections = {}
last_i = 0
grouping = []
for i in df_with_blanks.index:
if i - 1 == last_i:
grouping.append(i)
last_i = i
else:
last_i = i
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
grouping = []
grouping.append(i)
if len(grouping) > 0:
sections[min(grouping)] = {'indexes': grouping}
for i in sections.keys():
sections[i]['place'] = current_state_blockid_df.loc[i-1, 'PLACEFP']
l = []
for i in sections:
for x in sections[i]['indexes']:
l.append(sections[i]['place'])
df_with_blanks['PLACEFP'] = l
final_df = pandas.concat([df_with_blanks, df_no_blanks]).sort_index(axis=0)
return final_df
df = fix_df(current_state_blockid_df)
print(df)
Output:
BLOCKID PLACEFP
0 60014099004021 53000
1 60014100001000 53000
2 60014100001001 53000
3 60014100001002 53000
4 60014301012019 11964
5 60014301013000 11964
6 60014301013001 11964
7 60014301013002 11964
8 60014301013003 11964
9 60014301013004 11964
10 60014301013005 11964
11 60014301013006 11964
I have a for-loop looks like this, but it takes a long time to run once a large dataset being passed into it.
for i in range(0,len(data_sim.index)):
for j in range(1,len(data_sim.columns)):
user = data_sim.index[i]
activity = data_sim.columns[j]
if dt_full.loc[i][j] != 0:
data_sim.loc[i][j] = 0
else:
activity_top_names = data_neighbours.loc[activity][1:dt_length]
activity_top_sims = data_corr.loc[activity].sort_values(ascending=False)[1:dt_length]
user_purchases = data_activity.loc[user,activity_top_names]
data_sim.loc[i][j] = getScore(user_purchases,activity_top_sims)
In for-loop, data_sim looks like this:
CustomerId A B C D E
1 NAs NAs NAs NAs NAs
2 ..
I tried to reproduce the same process in apply function, which looks like this:
def test(cell):
user = cell.index
activity = cell
activity_top_names = data_neighbours.loc[activity][1:dt_length]
activity_top_sims = data_corr.loc[activity].sort_values(ascending=False)[1:dt_length]
user_purchase = data_activity_index.loc[user, activity_top_names]
if dt_full.loc[user][activity] != 0:
return cell.replace(cell, 0)
else:
re = getScore(user_purchase, activity_top_sims)
return cell.replace(cell, re)
In function, data_sim2 looks like this, I set the 'CustomerId' column to index column and duplicated the activity name to each activity column.
CustomerId(Index) A B C D E
1 A B C D E
2 A B C D E
Inside of the function 'def test(cell)', if the cell is in data_sim2[1][0],
cell.index = 1 # userId
cell # activity name
The whole idea of this for-loop is to fit the scoring data into 'data_sim' table based on position of each cell. And I used the same idea in creating function, used the same calculation in each cell, then apply this to the data table 'data_sim',
data_test = data_sim2.apply(lambda x: test(x))
it gave me a error said
"sort_values() missing 1 required positional argument: 'by'"
which is odd, because this issue was not happening inside of for loop. It sounds like the 'data_corr.loc[activity]' is still a Dataframe istead of a Series.
I use python and I have data of 35 000 rows I need to change values by loop but it takes too much time
ps: I have columns named by succes_1, succes_2, succes_5, succes_7....suces_120 so I get the name of the column by the other loop the values depend on the other column
exemple:
SK_1 Sk_2 Sk_5 .... SK_120 Succes_1 Succes_2 ... Succes_120
1 0 1 0 1 0 0
1 1 0 1 2 1 1
for i in range(len(data_jeux)):
for d in range (len(succ_len)):
ids = succ_len[d]
if data_jeux['SK_%s' % ids][i] == 1:
data_jeux.iloc[i]['Succes_%s' % ids]= 1+i
I ask if there is a way for executing this problem with the faster way I try :
data_jeux.values[i, ('Succes_%s' % ids)] = 1+i
but it returns me the following error maybe it doesn't accept string index
You can define columns and then use loc to increment. It's not clear whether your columns are naturally ordered; if they aren't you can use sorted with a custom function. String-based sorting will cause '20' to come before '100'.
def splitter(x):
return int(x.rsplit('_', maxsplit=1)[-1])
cols = df.columns
sk_cols = sorted(cols[cols.str.startswith('SK')], key=splitter)
succ_cols = sorted(cols[cols.str.startswith('Succes')], key=splitter)
df.loc[df[sk_cols] == 1, succ_cols] += 1