How to define a function based on vaues of multiple columns - python

I have a datafile as follows:
Activity
Hazard
Condition
consequence
welding
fire
chamber
high
painting
falling
none
medium
I need to create a fifth column based on the values in the activity, hazard, condition, and consequence columns. The conditions are as follows
if the activity column includes "working" or "performing" return 'none'
if the condition column includes 'none' return 'none'
else return datafile.Hazard.map(str) + " is " + datafile.Condition.map(str) + " impact " datafile.consequence.map(str)
I wrote the following code using regular expressions and dictionaries. But it didn't provide an answer. It is really appreciated if someone can give an answer
Dt1 = ['working','performing duties']
Dt2 = ["none"]
Dt1_regex = "|".join(Dt1)
Dt2_regex = "|".join(Dt2)
def clean(x):
if datafile.Activity.str.contains (Dt1_regex, regex=True) | datafile.Condition.str.contains(Dt2_regex, regex=True):
return 'none'
else:
return datafile.Hazard.map(str) + " is " + datafile.Condition.map(str)" impact " datafile.consequence.map(str)
datafile['combined'] = datafile.apply(clean)

You can create a bunch of conditions, and a list of the associated values when the conditions are true. You can then create the new column as shown in the example below:
Conditions:
When Activity == 'working' OR Activity == 'performing' - set to none
When Condition == 'none' - set to none
Otherwise set to:
df.Hazard + ' is ' + df.Condition + ' impact ' + df.consequence
Code:
import numpy as np
import pandas as pd
df = pd.DataFrame({ 'Activity': ['welding', 'painting', 'working'],
'Hazard': ['fire', 'falling', 'drowning'],
'Condition': ['chamber', 'none', 'wet'],
'consequence': ['high', 'medium', 'high']})
# Create your conditions
conditions = [
(df['Activity'] == 'working') | (df['Activity'] == 'performing'),
(df['Condition'] == 'none')
]
# create a list of the values we want to assign for each condition
values = ['none', 'none']
# Create the new column based on the conditions and values
df['combined'] = np.select(conditions, values, default = "x")
df.loc[df['combined'] == 'x', 'combined'] = df.Hazard + ' is ' + df.Condition + " impact " + df.consequence
print(df)
Output:

I tried the following code which also gave me the correct answer.
Dt1 = ["working", "performing duties"]
Dt1_regex = "|".join(Dt1)
conditions = [
(datafile.Activity.str.contains(Dt1_regex,regex=True)),
(datafile.Condition.str.contains('none')==True),
(datafile.Activity.str.contains(Dt1_regex,regex=False))|(datafile.Condition.str.contains('none')==False)
]
values = ['none', 'none', datafile.Hazard + ' is ' + datafile.Condition + " impact " + datafile.consequence]
datafile['combined'] = np.select(conditions, values)

Related

create a table and different columns within that table based on input of other columns using python

I have been working on a python code to try to create the output below in the python terminal. Basically what I'm trying to accomplish is I want the user to input the numbers for Columns A and B, then the code outputs Columns C, D and E. In Column C, I show how I want the python code to compute that particular column and so on (in other words, I don't want it to actually show "2 + 3 = 5", but I only want it to show 5. That's only to show you how I'm calculated it). Also, at the bottom of the code, I am trying to calculate and print the averages of Columns C and D.
I have been struggling with this for quite some time. Below is the example of how I want the output to be and also the code that I have. Sorry in advance for the code. Some things I know how to do and other things I don't which is why I am asking the Python community for help. Any help would be greatly appreciated.
def solve(columnA, columnB):
print(" Class ColumnA ColumnB ColumnC ColumnD ColumnE")
i = 0
columnC = columnB
columnD = columnC - columnB
columnE = []
empty1 = []
empty2 = []
empty3 = []
while(i < len(columnA)):
columnC += columnB[i]
print("str(i + 1) + " " + str(columnA[i]) + " " + str(columnB) + \
" " + str(columnC) + " " + str(columnD) + " " +
str(columnE))")
empty1.append(columnC)
empty2.append(columnD)
empty2.append(columnE)
i += 1
return (empty1, empty2, empty3)
np = int(input("Enter the number of Classes: "))
column_A = []
column_B = []
for i in range(np):
column_A.append(int(input("Enter the numbers for Column A " + str(i + 1) + ":
")))
column_B.append(int(input("Enter the numbers for Column B " + str(i + 1) + ":
")))
lis = solve(column_A, column_B)
print("Average of Column C: " + str(sum(lis[0]) / len(lis[0])))
print("Average of Column D: " + str(sum(lis[1]) / len(lis[1])))
You could use something like this:
import itertools
import statistics
number_of_classes = int(input("Enter the number of Classes: "))
def ask_numbers(name, amount_to_ask):
print('Please enter the numbers for Column', name)
for how_many_asked_yet in range(amount_to_ask):
yield (int(input(f'{how_many_asked_yet + 1}) ')))
print()
columns_names = ('A', 'B')
columns = {name: tuple(ask_numbers(name, number_of_classes)) for name in columns_names}
def compute_c_values(source_column):
values_accumulator = zip(source_column, itertools.accumulate(source_column))
current_value, previous_value = next(values_accumulator)
yield (f'{current_value}', current_value)
for current_value, sum_value in values_accumulator:
yield (f'{previous_value}+{current_value}={sum_value}', sum_value)
previous_value = sum_value
columns['C'] = tuple(compute_c_values(columns['A']))
columns['D'] = tuple(c_value - b_value for (b_value, (_, c_value)) in zip(columns['B'], columns['C']))
columns['E'] = tuple(0 if d_value < 0 else 1 for d_value in columns['D'])
output_header_footer_format = "{:6} | {:10} | {:10} | {:15} | {:10} | {:10}"
print(output_header_footer_format.format('Class', 'Column A', 'Column B', 'Column C', 'Column D', 'Column E'))
print('-------|------------|------------|-----------------|------------|-----------')
output_line_format = "{0:6} | {1:10} | {2:10} | {3[0]:^15} | {4:10} | {5:^10}"
for class_nb, output_values in enumerate(zip(*columns.values())):
print(output_line_format.format(class_nb, *output_values))
print(output_header_footer_format.format('', '', '', '', '', ''))
averages = {
'C' : statistics.mean(c_value for (_, c_value) in columns['C']),
'D' : statistics.mean(columns['D'])
}
print(output_header_footer_format.format('' , '', 'Averages =', averages['C'], averages['D'], ''))
To fully understand this code you will need to read about:
Generators
Generator expressions
Dictionary comprehensions
.format() method of string objects
Tuples unpacking
Unpacking Argument List

Need to Compare DataFrames in Pandas using < and > operators for specific data in a column

I am trying to compare the following dataframes:
I have a pair of Z Scores with a specific ENST number here :
Z_SCORE_Raw
ENST00000547849 ENST00000587894
0 -1.3099506 21.56600492
I have to compare each of these numbers to their corresponding ENST code in this dataFrame:
df_new
ENST00000547849High_Avg ENST00000587894 High_Avg
ENST00000547849 Low_Avg ENST00000587894 Low_Avg
0.0026421609368421000 -0.0457525087368421
-0.040015074588235300 -0.04140853107142860
I am given the following formula:
if Z_Score[given ENSTCode] > Avg_High[ENSTCode]
return 1
elif Z_Score[given ENSTCode] > Avg_Low[ENSTCode]
return 0
Elif Avg_High>Z_Score>AVg_Low
return 0.5
I currently have the following code to gather the correct ENST code and compare that ZScore to the corresponding High and Low average of each ENST Code:
for x in Z_score_raw:
if Z_score_raw[x].any() > df_new[x + ' High_Avg'].any():
print('1')
elif Z_score_raw[x].any() < df_new[x + ' Low_Avg'].any():
print('0')
elif df_new[x + ' High_Avg'].any() > Z_score_raw[x].any() > df_new[x + ' Low_Avg']:
print('0.5')
The expected output would be for
ENST00000547849: 0 (as -1.309 < -0.0400150745882353)
ENST00000587894: 1 (as 21.56600492 > -0.45725)
My current code gives me no results and skips by all of the checks. How can I get this to work properly?
The problem is, that you are iterating correctly, but then you are comparing a boolean value that is returned by .any() using > or <.
What is True > False or True < True?
So that doesn't make sense.
If you only have one value per column, just use [0] to select the value at the index 0.
Also, Make sure your column naming pattern is consistent (e.g. no spaces everywhere).
Your Example:
ENST00000547849High_Avg ENST00000587894 High_Avg
My Correction (no Space):
ENST00000547849High_Avg ENST00000587894High_Avg
This will provide your desired result:
import pandas as pd
d = {"ENST00000547849": [-1.3099506], "ENST00000587894": [21.56600492]}
d_2 = {"ENST00000547849High_Avg": [0.0026421609368421000], "ENST00000587894High_Avg" : [-0.0457525087368421], "ENST00000547849Low_Avg" : [-0.040015074588235300], "ENST00000587894Low_Avg": [-0.04140853107142860]}
Z_score_raw = pd.DataFrame(data = d)
df_new = pd.DataFrame(data = d_2)
for x in Z_score_raw:
if Z_score_raw[x][0] > df_new[x + 'High_Avg'][0]:
print(f"{x}: 1")
elif Z_score_raw[x][0] < df_new[x + 'Low_Avg'][0]:
print(f"{x}: 0")
elif df_new[x + 'High_Avg'][0] > Z_score_raw[x][0] > df_new[x + 'Low_Avg'][0]:
print(f"{x}: 0.5")
Output:
ENST00000547849: 0
ENST00000587894: 1

How to apply a complex lambda function in Pandas DataFrame with long list of elements per row

I have a pandas DataFrame in which I have a long string per every row in one column (see variable 'dframe'). In separate list I stored all keywords, which I have to compare with every word from each string from DataFrame. If keyword is found, I have to store it as a success and mark it, in which sentence it has been found. I am using a complex for-loop, with few 'if' statements, which is giving me correct output but it is not very efficient. It takes nearly 4 hours to run on my whole set where I have 130 keywords and thousands of rows to iterate.
I thought to apply some lambda function for optimization and this is something I am struggling with. Below I present you the idea of my data set and my current code.
import pandas as pd
from fuzzywuzzy import fuzz
dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
'this is a second e-mail about money',
'this would be a next message where people talk about secret information',
'this is a sentence where someone misspelled word frad',
'this sentence has no keyword']})
keywords = ['fraud','money','secret']
keyword_set = set(keywords)
dframe['Flag'] = False
dframe['part_word'] = 0
output = []
for k in range(0, len(keywords)):
count_ = 0
dframe['Flag'] = False
for j in range(0, len(dframe['Email'])):
row_list = []
print(str(k) + ' / ' + str(len(keywords)) + ' || ' + str(j) + ' / ' + str(len(dframe['Email'])))
for i in dframe['Email'][j].split():
if dframe['part_word'][j] != 0 :
row_list = dframe['part_word'][j]
fuz_part = fuzz.partial_ratio(keywords[k].lower(),i.lower())
fuz_set = fuzz.token_set_ratio(keywords[k],i)
if ((fuz_part > 90) | (fuz_set > 85)) & (len(i) > 3):
if keywords[k] not in row_list:
row_list.append(keywords[k])
print(keywords[k] + ' found as : ' + i)
dframe['Flag'][j] = True
dframe['part_word'][j] = row_list
count_ = dframe['Flag'].values.sum()
if count_ > 0:
y = keywords[k] + ' ' + str(count_)
output.append(y)
else:
y = keywords[k] + ' ' + '0'
output.append(y)
Maybe someone who has experience with lambda functions could give me a hint how I could apply it on my DataFrame to perform similar operation ?
It would require to somehow apply fuzzymatching in lambda after splitting whole sentence per row to separate words and choosing the value with highest matching value with condition it should be bigger 85 or 90. This is something I am confused with. Thanks in advance for any help.
I don't have a lambda function for you, but a function which you can apply to dframe.Email:
import pandas as pd
from fuzzywuzzy import fuzz
At first create the same example dataframe like you:
dframe = pd.DataFrame({ 'Email' : ['this is a first very long e-mail about fraud and money',
'this is a second e-mail about money',
'this would be a next message where people talk about secret information',
'this is a sentence where someone misspelled word frad',
'this sentence has no keyword']})
keywords = ['fraud','money','secret']
This is the function to apply:
def fct(sntnc, kwds):
mtch = []
for kwd in kwds:
fuz_part = [fuzz.partial_ratio(kwd.lower(), w.lower()) > 90 for w in sntnc.split()]
fuz_set = [fuzz.token_set_ratio(kwd, w) > 85 for w in sntnc.split()]
bL = [len(w) > 3 for w in sntnc.split()]
mtch.append(any([(p | s) & l for p, s, l in zip(fuz_part, fuz_set, bL)]))
return mtch
For each keyword
it calculates fuz_part > 90 for all words in the sentence,
the same with fuz_set > 85
and the same with wordlength > 3.
And finally for each keyword it saves in an list if there is any ((fuz_part > 90) | (fuz_set > 85)) & (wordlength > 3) in all the words of a sentence .
And this is how it is applied and how the result is created:
s = dframe.Email.apply(fct, kwds=keywords)
s = s.apply(pd.Series).set_axis(keywords, axis=1, inplace=False)
dframe = pd.concat([dframe, s], axis=1)
Result:
result = dframe.drop('Email', 1)
# fraud money secret
# 0 True True False
# 1 False True False
# 2 False False True
# 3 True False False
# 4 False False False
result.sum()
# fraud 2
# money 2
# secret 1
# dtype: int64

How to optimize below code to run faster, size of my dataframe is almost 100,000 data points

def encoder(expiry_dt,expiry1,expiry2,expiry3):
if expiry_dt == expiry1:
return 1
if expiry_dt == expiry2:
return 2
if expiry_dt == expiry3:
return 3
FINAL['Expiry_encodings'] = FINAL.apply(lambda row: '{0}_{1}_{2}_{3}_{4}'.format(row['SYMBOL'],row['INSTRUMENT'],row['STRIKE_PR'],row['OPTION_TYP'], encoder(row['EXPIRY_DT'],
row['Expiry1'],
row['Expiry2'],
row['Expiry3'])), axis =1)
The code runs totally fine but its too slow, is there any other alternative to achieve this in less time bound?
Give the following a try:
FINAL['expiry_number'] = '0'
for c in '321':
FINAL.loc[FINAL['EXPIRY_DT'] == FINAL['Expiry'+c], 'expiry_number'] = c
FINAL['Expiry_encodings'] = FINAL['SYMBOL'].astype(str) + '_' + \
FINAL['INSTRUMENT'].astype(str) + '_' + FINAL['STRIKE_PR'].astype(str) + \
'_' + FINAL['OPTION_TYP'].astype(str) + '_' + FINAL['expiry_number']
This avoids the three if statements, has a default value ('0') if none of the if statements evaluates to True, and avoids all the string formatting; above that, it also avoids the apply method with a lambda.
Note on the '321' order: this reflects the order in which the if-chain in the original code section is evaluated: 'Expiry3' has the lowest priority, and in my code given here, it is first overridden by #2 and then by #1. The original if-chain would shortcut at #1, given that the highest priority. For example, if 'Expiry1' and 'Expiry3' have the same value (equal to 'EXPIRY_DT'), the assigned value is 1, not 3.
Solution as same as above with slight change,
FINAL['expiry_number'] = '0'
for c in '321':
FINAL.loc[FINAL['EXPIRY_DT'] == FINAL['Expiry'+c], 'expiry_number'] = c
FINAL['Expiry_encodings'] = FINAL['SYMBOL'].astype(str) + '_' + \
FINAL['INSTRUMENT'].astype(str) + '_' + FINAL['STRIKE_PR'].astype(str) + \
'_' + FINAL['OPTION_TYP'].astype(str) +' _' + FINAL['expiry_number']

For loop pandas and numpy: Performance

I have coded the following for loop. The main idea is that in each occurrence of 'D' in the column 'A_D', it looks for all the possible cases where some specific conditions should happen. When all the conditions are verified, a value is added to a list.
a = []
for i in df.index:
if df['A_D'][i] == 'D':
if df['TROUND_ID'][i] == ' ':
vb = df[(df['O_D'] == df['O_D'][i])
& (df['A_D'] == 'A' )
& (df['Terminal'] == df['Terminal'][i])
& (df['Operator'] == df['Operator'][i])]
number = df['number_ac'][i]
try: ## if all the conditions above are verified a value is added to a list
x = df.START[i] - pd.Timedelta(int(number), unit='m')
value = vb.loc[(vb.START-x).abs().idxmin()].FlightID
except: ## if are not verified, several strings are added to the list
value = 'No_link_found'
else:
value = 'Has_link'
else:
value = 'IsArrival'
a.append(value)
My main problem is that df has millions of rows, therefore this for loop is way too time consuming. Is there any vectorized solution where I do not need to use a for loop?
An initial set of improvements: use apply rather than a loop; create a second dataframe at the start of the rows where df["A_D"] == "A"; and vectorise the value x.
arr = df[df["A_D"] == "A"]
# if the next line is slow, apply it only to those rows where x is needed
df["x"] = df.START - pd.Timedelta(int(df["number_ac"]), unit='m')
def link_func(row):
if row["A_D"] != "D":
return "IsArrival"
if row["TROUND_ID"] != " ":
return "Has_link"
vb = arr[arr["O_D"] == row["O_D"]
& arr["Terminal"] == row["Terminal"]
& arr["Operator"] == row["Operator"]]
try:
return vb.loc[(vb.START - row["x"]).abs().idxmin()].FlightID
except:
return "No_link_found"
df["a"] = df.apply(link_func, axis=1)
Using apply is apparently more efficient but does not automatically vectorise the calculation. But finding a value in arr based on each row of df is inherently time consuming, however efficiently it is implemented. Consider whether the two parts of the original dataframe (where df["A_D"] == "A" and df["A_D"] == "D", respectively) can be reshaped into a wide format somehow.
EDIT: You might be able to speed up the querying of arr by storing query strings in df, like this:
df["query_string"] = ('O_D == "' + df["O_D"]
+ '" & Terminal == "' + df["Terminal"]
+ '" & Operator == "' + df["Operator"] + '"')
def link_func(row):
vb = arr.query(row["query_string"])
try:
row["a"] = vb.loc[(vb.START - row["x"]).abs().idxmin()].FlightID
except:
row["a"] = "No_link_found"
df.query('(A_D == "D") & (TROUND_ID == " ")').apply(link_func, axis=1)

Categories

Resources