For loop pandas and numpy: Performance - python

I have coded the following for loop. The main idea is that in each occurrence of 'D' in the column 'A_D', it looks for all the possible cases where some specific conditions should happen. When all the conditions are verified, a value is added to a list.
a = []
for i in df.index:
if df['A_D'][i] == 'D':
if df['TROUND_ID'][i] == ' ':
vb = df[(df['O_D'] == df['O_D'][i])
& (df['A_D'] == 'A' )
& (df['Terminal'] == df['Terminal'][i])
& (df['Operator'] == df['Operator'][i])]
number = df['number_ac'][i]
try: ## if all the conditions above are verified a value is added to a list
x = df.START[i] - pd.Timedelta(int(number), unit='m')
value = vb.loc[(vb.START-x).abs().idxmin()].FlightID
except: ## if are not verified, several strings are added to the list
value = 'No_link_found'
else:
value = 'Has_link'
else:
value = 'IsArrival'
a.append(value)
My main problem is that df has millions of rows, therefore this for loop is way too time consuming. Is there any vectorized solution where I do not need to use a for loop?

An initial set of improvements: use apply rather than a loop; create a second dataframe at the start of the rows where df["A_D"] == "A"; and vectorise the value x.
arr = df[df["A_D"] == "A"]
# if the next line is slow, apply it only to those rows where x is needed
df["x"] = df.START - pd.Timedelta(int(df["number_ac"]), unit='m')
def link_func(row):
if row["A_D"] != "D":
return "IsArrival"
if row["TROUND_ID"] != " ":
return "Has_link"
vb = arr[arr["O_D"] == row["O_D"]
& arr["Terminal"] == row["Terminal"]
& arr["Operator"] == row["Operator"]]
try:
return vb.loc[(vb.START - row["x"]).abs().idxmin()].FlightID
except:
return "No_link_found"
df["a"] = df.apply(link_func, axis=1)
Using apply is apparently more efficient but does not automatically vectorise the calculation. But finding a value in arr based on each row of df is inherently time consuming, however efficiently it is implemented. Consider whether the two parts of the original dataframe (where df["A_D"] == "A" and df["A_D"] == "D", respectively) can be reshaped into a wide format somehow.
EDIT: You might be able to speed up the querying of arr by storing query strings in df, like this:
df["query_string"] = ('O_D == "' + df["O_D"]
+ '" & Terminal == "' + df["Terminal"]
+ '" & Operator == "' + df["Operator"] + '"')
def link_func(row):
vb = arr.query(row["query_string"])
try:
row["a"] = vb.loc[(vb.START - row["x"]).abs().idxmin()].FlightID
except:
row["a"] = "No_link_found"
df.query('(A_D == "D") & (TROUND_ID == " ")').apply(link_func, axis=1)

Related

How to define a function based on vaues of multiple columns

I have a datafile as follows:
Activity
Hazard
Condition
consequence
welding
fire
chamber
high
painting
falling
none
medium
I need to create a fifth column based on the values in the activity, hazard, condition, and consequence columns. The conditions are as follows
if the activity column includes "working" or "performing" return 'none'
if the condition column includes 'none' return 'none'
else return datafile.Hazard.map(str) + " is " + datafile.Condition.map(str) + " impact " datafile.consequence.map(str)
I wrote the following code using regular expressions and dictionaries. But it didn't provide an answer. It is really appreciated if someone can give an answer
Dt1 = ['working','performing duties']
Dt2 = ["none"]
Dt1_regex = "|".join(Dt1)
Dt2_regex = "|".join(Dt2)
def clean(x):
if datafile.Activity.str.contains (Dt1_regex, regex=True) | datafile.Condition.str.contains(Dt2_regex, regex=True):
return 'none'
else:
return datafile.Hazard.map(str) + " is " + datafile.Condition.map(str)" impact " datafile.consequence.map(str)
datafile['combined'] = datafile.apply(clean)
You can create a bunch of conditions, and a list of the associated values when the conditions are true. You can then create the new column as shown in the example below:
Conditions:
When Activity == 'working' OR Activity == 'performing' - set to none
When Condition == 'none' - set to none
Otherwise set to:
df.Hazard + ' is ' + df.Condition + ' impact ' + df.consequence
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({ 'Activity': ['welding', 'painting', 'working'],
'Hazard': ['fire', 'falling', 'drowning'],
'Condition': ['chamber', 'none', 'wet'],
'consequence': ['high', 'medium', 'high']})
# Create your conditions
conditions = [
(df['Activity'] == 'working') | (df['Activity'] == 'performing'),
(df['Condition'] == 'none')
]
# create a list of the values we want to assign for each condition
values = ['none', 'none']
# Create the new column based on the conditions and values
df['combined'] = np.select(conditions, values, default = "x")
df.loc[df['combined'] == 'x', 'combined'] = df.Hazard + ' is ' + df.Condition + " impact " + df.consequence
print(df)
Output:
I tried the following code which also gave me the correct answer.
Dt1 = ["working", "performing duties"]
Dt1_regex = "|".join(Dt1)
conditions = [
(datafile.Activity.str.contains(Dt1_regex,regex=True)),
(datafile.Condition.str.contains('none')==True),
(datafile.Activity.str.contains(Dt1_regex,regex=False))|(datafile.Condition.str.contains('none')==False)
]
values = ['none', 'none', datafile.Hazard + ' is ' + datafile.Condition + " impact " + datafile.consequence]
datafile['combined'] = np.select(conditions, values)

Accessing pyomo variables with two indices

I have started using pyomo to solve optimization problems. I have a bit of an issue regarding accessing the variables, which use two indices. I can easily print the solution, but I want to store the index depending variable values within a pd.DataFrame to further analyze the result. I have written following code, but it needs forever to store the variables. Is there a faster way?
df_results = pd.DataFrame()
df_variables = pd.DataFrame()
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
frequency = np.empty([len(price_dict)])
for index in varobject:
exist = False
two = False
if index is not None:
if type(index) is int:
#For time index t (0:8760 hours of year)
exists = True #does a index exist
frequency[index] = float(varobject[index].value)
else:
#For components (names)
if type(index) is str:
print(index)
print(varobject[index].value)
else:
#for all index with two indices
two = True #is index of two indices
if index[1] in df_variables.columns:
df_variables[index[0], str(index[1]) + '_' + str(v)] = varobject[index].value
else:
df_variables[index[1]] = np.nan
df_variables[index[0], str(index[1]) + '_' + str(v)] = varobject[index].value
else:
# If no index exist, simple print the variable value
print(varobject.value)
if not(exists):
if not(two):
df_variable = pd.Series(frequency, name=str(v))
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)
else:
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)
with some more work and less DataFrame, I have solved the issue with following code. Thanks to BlackBear for the comment
df_results = pd.DataFrame()
df_variables = pd.DataFrame()
results.write()
instance.solutions.load_from(results)
for v in instance.component_objects(Var, active=True):
print ("Variable",v)
varobject = getattr(instance, str(v))
frequency = np.empty([20,len(price_dict)])
exist = False
two = False
list_index = []
dict_position = {}
count = 0
for index in varobject:
if index is not None:
if type(index) is int:
#For time index t (0:8760 hours of year)
exist = True #does a index exist
frequency[0,index] = float(varobject[index].value)
else:
#For components (names)
if type(index) is str:
print(index)
print(varobject[index].value)
else:
#for all index with two indices
exist = True
two = True #is index of two indices
if index[1] in list_index:
position = dict_position[index[1]]
frequency[position,index[0]] = varobject[index].value
else:
dict_position[index[1]] = count
list_index.append(index[1])
print(list_index)
frequency[count,index[0]] = varobject[index].value
count += 1
else:
# If no index exist, simple print the variable value
print(varobject.value)
if exist:
if not(two):
frequency = np.transpose(frequency)
df_variable = pd.Series(frequency[:,0], name=str(v))
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)
else:
for i in range(count):
df_variable = pd.Series(frequency[i,:], name=str(v)+ '_' + list_index[i])
df_results = pd.concat([df_results, df_variable], axis=1)
df_variable.drop(df_variable.index, inplace=True)

Check result using 4 operations based with Python

I'm struggling to make a Python program that can solve riddles such as:
get 23 using [1,2,3,4] and the 4 basic operations however you'd like.
I expect the program to output something such as
# 23 reached by 4*(2*3)-1
So far I've come up with the following approach as reduce input list by 1 item by checking every possible 2-combo that can be picked and every possible result you can get to.
With [1,2,3,4] you can pick:
[1,2],[1,3],[1,4],[2,3],[2,4],[3,4]
With x and y you can get to:
(x+y),(x-y),(y-x),(x*y),(x/y),(y/x)
Then I'd store the operation computed so far in a variable, and run the 'reducing' function again onto every result it has returned, until the arrays are just 2 items long: then I can just run the x,y -> possible outcomes function.
My problem is this "recursive" approach isn't working at all, because my function ends as soon as I return an array.
If I input [1,2,3,4] I'd get
[(1+2),3,4] -> [3,3,4]
[(3+3),4] -> [6,4]
# [10,2,-2,24,1.5,0.6666666666666666]
My code so far:
from collections import Counter
def genOutputs(x,y,op=None):
results = []
if op == None:
op = str(y)
else:
op = "("+str(op)+")"
ops = ['+','-','*','/','rev/','rev-']
z = 0
#will do every operation to x and y now.
#op stores the last computated bit (of other functions)
while z < len(ops):
if z == 4:
try:
results.append(eval(str(y) + "/" + str(x)))
#yield eval(str(y) + "/" + str(x)), op + "/" + str(x)
except:
continue
elif z == 5:
results.append(eval(str(y) + "-" + str(x)))
#yield eval(str(y) + "-" + str(x)), op + "-" + str(x)
else:
try:
results.append(eval(str(x) + ops[z] + str(y)))
#yield eval(str(x) + ops[z] + str(y)), str(x) + ops[z] + op
except:
continue
z = z+1
return results
def pickTwo(array):
#returns an array with every 2-combo
#from input array
vomit = []
a,b = 0,1
while a < (len(array)-1):
choice = [array[a],array[b]]
vomit.append((choice,list((Counter(array) - Counter(choice)).elements())))
if b < (len(array)-1):
b = b+1
else:
b = a+2
a = a+1
return vomit
def reduceArray(array):
if len(array) == 2:
print("final",array)
return genOutputs(array[0],array[1])
else:
choices = pickTwo(array)
print(choices)
for choice in choices:
opsofchoices = genOutputs(choice[0][0],choice[0][1])
for each in opsofchoices:
newarray = list([each] + choice[1])
print(newarray)
return reduceArray(newarray)
reduceArray([1,2,3,4])
The largest issues when dealing with problems like this is handling operator precedence and parenthesis placement to produce every possible number from a given set. The easiest way to do this is to handle operations on a stack corresponding to the reverse polish notation of the infix notation. Once you do this, you can draw numbers and/or operations recursively until all n numbers and n-1 operations have been exhausted, and store the result. The below code generates all possible permutations of numbers (without replacement), operators (with replacement), and parentheses placement to generate every possible value. Note that this is highly inefficient since operators such as addition / multiplication commute so a + b equals b + a, so only one is necessary. Similarly by the associative property a + (b + c) equals (a + b) + c, but the below algorithm is meant to be a simple example, and as such does not make such optimizations.
def expr_perm(values, operations="+-*/", stack=[]):
solution = []
if len(stack) > 1:
for op in operations:
new_stack = list(stack)
new_stack.append("(" + new_stack.pop() + op + new_stack.pop() + ")")
solution += expr_perm(values, operations, new_stack)
if values:
for i, val in enumerate(values):
new_values = values[:i] + values[i+1:]
solution += expr_perm(new_values, operations, stack + [str(val)])
elif len(stack) == 1:
return stack
return solution
Usage:
result = expr_perm([4,5,6])
print("\n".join(result))

How to optimize below code to run faster, size of my dataframe is almost 100,000 data points

def encoder(expiry_dt,expiry1,expiry2,expiry3):
if expiry_dt == expiry1:
return 1
if expiry_dt == expiry2:
return 2
if expiry_dt == expiry3:
return 3
FINAL['Expiry_encodings'] = FINAL.apply(lambda row: '{0}_{1}_{2}_{3}_{4}'.format(row['SYMBOL'],row['INSTRUMENT'],row['STRIKE_PR'],row['OPTION_TYP'], encoder(row['EXPIRY_DT'],
row['Expiry1'],
row['Expiry2'],
row['Expiry3'])), axis =1)
The code runs totally fine but its too slow, is there any other alternative to achieve this in less time bound?
Give the following a try:
FINAL['expiry_number'] = '0'
for c in '321':
FINAL.loc[FINAL['EXPIRY_DT'] == FINAL['Expiry'+c], 'expiry_number'] = c
FINAL['Expiry_encodings'] = FINAL['SYMBOL'].astype(str) + '_' + \
FINAL['INSTRUMENT'].astype(str) + '_' + FINAL['STRIKE_PR'].astype(str) + \
'_' + FINAL['OPTION_TYP'].astype(str) + '_' + FINAL['expiry_number']
This avoids the three if statements, has a default value ('0') if none of the if statements evaluates to True, and avoids all the string formatting; above that, it also avoids the apply method with a lambda.
Note on the '321' order: this reflects the order in which the if-chain in the original code section is evaluated: 'Expiry3' has the lowest priority, and in my code given here, it is first overridden by #2 and then by #1. The original if-chain would shortcut at #1, given that the highest priority. For example, if 'Expiry1' and 'Expiry3' have the same value (equal to 'EXPIRY_DT'), the assigned value is 1, not 3.
Solution as same as above with slight change,
FINAL['expiry_number'] = '0'
for c in '321':
FINAL.loc[FINAL['EXPIRY_DT'] == FINAL['Expiry'+c], 'expiry_number'] = c
FINAL['Expiry_encodings'] = FINAL['SYMBOL'].astype(str) + '_' + \
FINAL['INSTRUMENT'].astype(str) + '_' + FINAL['STRIKE_PR'].astype(str) + \
'_' + FINAL['OPTION_TYP'].astype(str) +' _' + FINAL['expiry_number']

Speeding up iteration over dataframe

I have a data frame that basically consists of three columns: group, timestamp, value.
I created the following for loop that will iterate through the data frame and run tests to see if the values are acceptable or not. For example, if not enough time has passed between the timestamps to account for the value, then it is tagged as potentially bad data.
The only caveat here is that values should not always be compared to the previous value, but rather the last 'good' value within the group. Thus the reason I went with the loop.
I'm wondering if there is a better way to do this without the loop, or are there inefficiencies in the loop that would help speed it up?
dfy = pd.DataFrame(index=dfx.index,columns = ['gvalue','quality'])
for row in df.itertuples():
thisgroup = row[1]
thistimestamp = row[2]
thisvalue = row[3]
qualitytag = ''
qualitytest = True
if prevgroup == thisgroup:
ts_gap = thistimestamp - goodtimestamp
hour_gap = (thisvalue - goodvalue) * 3600
if hour_gap < 0:
qualitytag = 'H'
qualitytest = False
elif hour_gap > ts_gap:
qualitytag = 'A'
qualitytest = False
elif hour_gap >= 86400
qualitytag = 'U'
qualitytest = False
#if tests pass, update good values
if qualitytest:
goodvalue = thisvalue
goodtimestamp = thistimestamp
#save good values to y dataframe
dfy.iat[row[0],0] = goodvalue
dfy.iat[row[0],1] = qualitytag
prevgroup = thisgroup
df = dfx.join(dfy)

Categories

Resources