How to create a join condition using a loop? - python

I am creating a generic condition for joining 2 dataframes which have the same key and same structure as a code below. I would like to make it as a function for comparing 2 dataframes. First idea, I made it as string condition as it's easy for concatenate the condition with the loop. Finally, it seems like the join condition couldn't accept the string condition. Could somebody please guide me on this?
import pyspark.sql.functions as F
key = "col1 col2 col3"
def CompareData(df1,df2,key) :
key_list = key.split(" ")
key_con=""
for col in key_list:
condi = "(F.col(\"" + col +"\") == F.col(\""+"x_"+col+"\"))" # trying to generate generic condition
key_con=key_con + "&" + condi
key_condition=key_con.replace('&','',1)
df1_tmp = df1.select([F.col(c).alias("x_"+c) for c in df1.columns])
df_compare = df2.join(df1_tmp, key_condition , "left") # The problem was here. key_condition has error. If I copy the condition string below and place into join condition, it works fine.
# key_condition = (F.col("col1") == F.col("x_col1")) & (F.col("col2") == F.col("x_col2")) & (F.col("col3") == F.col("x_col3"))

Try this:
key_con = F.lit(True)
for col in key_list:
condi = (F.col(col) == F.col(f"x_{col}"))
key_con = key_con & condi
In your attempt, your condition is of type string. But join's argument on only accepts string if it is a plain column name. You are trying to create a column expression and pass it to the on argument. Column expression is not the same thing as string, so you need a slightly different method to make a composite column expression.

Related

Apply function on multiple columns and create new column based on condition

I am trying to apply a function on multiple columns in a pandas dataframe where I compare the value of two columns to create a third new based on this comparison. The code runs, however, the output does not get correct. For example, this code:
def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"
lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = df.apply(lambda x: conditions(x,column1=x[item], column2=x[lst1[i]]) , axis=1)
i=i+1
The output should be that the first row contains an incorrect instance, but it marks it as correct.
This is how the output looks:
The correct would be that col4_4_2 and col5_5_2 should be marked as incorrect. This is how it should look:
Is it not possible to apply a function in this way on multiple columns and pass the column name as arguments in pandas? If so, how should it be performed?
You didn't provide a df, so I used this:
df = pd.DataFrame([[0,0,0,1,0,0,0,0,0,1,0,0,0,0,0]],columns = ['col1', 'col2', 'col3', 'col4', 'col5','col1_1','col2_2','col3_3','col4_4','col5_5','col1_1_2','col2_2_2','col3_3_2','col4_4_2','col5_5_2',])
Your conditions function is expecting a dataframe and then references to two of it's columns, but you are supplying it a df and then two values. One way to solve your problem is to change your comparison function to something like this (note you don't actually need the df itself anymore):
def conditions(x,column1, column2):
print(column1,column2)
if column1 != column2:
return "incorrect"
else:
return "correct"
Alternatively, you could change the line with lamba in it to something like this:
df[str(item)+"_2"] = df.apply(lambda x: conditions(x, lst2[i], lst1[i]) , axis=1)
I first had to add the columns and fill them with zeros, then apply the function.
def conditions(x,column1, column2):
if x[column1] != x[column2]:
return "incorrect"
else:
return "correct"
lst1=["col1","col2","col3","col4","col5"]
lst2=["col1_1","col2_2","col3_3","col4_4","col5_5"]
i=0
for item in lst2:
df[str(item)+"_2"] = 0
i=0
for item in df.columns[-5:]:
df[item]=df.apply(lambda x: conditions(x, column1=lst1[i], column2=lst2[i]) , axis=1)
i=i+1

pandas - If partial string match exists, put value in new column

I've got a tricky problem in pandas to solve. I was previously referred to this thread as a solution but it is not what I am looking for.
Take this example dataframe with two columns:
df = pd.DataFrame([['Mexico', 'Chile'], ['Nicaragua', 'Nica'], ['Colombia', 'Mex']], columns = ["col1", "col2"])
I first want to check each row in column 2 to see if that value exists in column 1. This is checking full and partial strings.
df['compare'] = df['col2'].apply(lambda x: 'Yes' if df['col1'].str.contains(x).any() else 'No')
I can check to see that I have a match of a partial or full string, which is good but not quite what I need. Here is what the dataframe looks like now:
What I really want is the value from column 1 which the value in column 2 matched with. I have not been able to figure out how to associate them
My desired result looks like this:
Here's a "pandas-less" way to do it. Probably not very efficient but it gets the job done:
def compare_cols(match_col, partial_col):
series = []
for partial_str in partial_col:
for match_str in match_col:
if partial_str in match_str:
series.append(match_str)
break # matches to the first value found in match_col
else: # for loop did not break = no match found
series.append(None)
return series
df = pd.DataFrame([['Mexico', 'Chile'], ['Nicaragua', 'Nica'], ['Colombia', 'Mex']], columns = ["col1", "col2"])
df['compare'] = compare_cols(match_col=df.col1, partial_col=df.col2)
Note that if a string in col2 matches to more than one string in col1, the first occurrence is used.

Python: Write a nested loop to test whether a series of string values is present in the column of a dataframe

I have two dataframes df1 and df2. df1 has a column called 'comments' that contains a string. df2 has a column called 'labels' that contains smaller strings. I am trying to write a function that searches df1['comments'] for the strings contained in df2['labels'] and creates a new variable for d1 called df1['match'] that is True if df1['comments'] contains any of the strings in df2['labels'] and False if df1['comments'] does not contain any of the strings in df2['labels'].
I'm trying to use df.str.contains('word', na=False) to solve this problem and I have managed to create the column df1['match'] searching for one specific string using the following function:
df1['match'] = df1['comment'].str.contains('mystring', na=False)
However, I struggle to write a function that iterates over all the words in df2['label'] and creates a df1['match'] with True if any of the words in df2['label'] are present and False otherwise.
This is my attempt at writing the loop:
for comment in df1['comment']:
for word in df2['label']:
if df1['comment'].str.contains(word, na=False)=True:
df1['match']=True
#(would need something to continue to next comment if there is a match)
else:
df1['match']=False #(put value as false if there none of the items in df2['label' is contained in df1['comment']``
Any help would be greately appreciated.
You can do a multiple substring search through a regex search using pipe. See this post
df1['match'] = df['comment'].str.contains('|'.join(df2['label'].values), na=False)
Try this If this helps
df2['match'] = "False"
for idx, word in enumerate(df2['labels']):
q = df1['comment'][idx:].str.contains(word)
df2['match'][idx] = q[idx]
I don't know how much it will help but better way to compare is below way. It's efficient.
If df1['match'] you want to mention row by row then code will need some change . But I think you got what actually you wanted.
test1=df2['label'].to_list()
test2=df1['comments'].to_list()
flag = 0
if(set(test1).issubset(set(test2))):
flag = 1
if (flag) :
df1['match']=True
else :
df1['match']=False
Here is the complete code let me know if this is what you are asking for
import pandas as pd
d = {'comment': ["abcd efgh ijk", "lmno pqrst uvwxyz", "123456789 4567895062"]}
df1 = pd.DataFrame(data=d)
print(df1)
d = {'labels': ["efgh", "pqrst", "12389"]}
df2 = pd.DataFrame(data=d)
print(df2)
df2['match'] = "False"
for idx, word in enumerate(df2['labels']):
q = df1['comment'][idx:].str.contains(word)
df2['match'][idx] = q[idx]
print("final df2")
print(df2)

A Better solution to check if dataframe value is in another dataframe and within specific date boundaries or ther specifications

I want to check if each value in a column exist in another dataframe (df2) and if its date is at least 3 days close to the date in the second dataframe (df2) or if they meet other conditions.
The code I've written works, but I want to know if there's a better solution to this problem or a code that's more efficient
Exemple:
def check_answer(df):
if df.ticket_count == 1:
return 'Yes'
elif (df.ticket_count > 0) and (df.occurrences == 1):
return 'Yes'
elif any(
df2[df2.partnumber == df.partnumber]['ticket_date'] >= df['date']
) and any(
df2[df2.partnumber == df.partnumber]['ticket_date'] <= df['date'] + pd.DateOffset(days=3)
):
return 'Yes'
else:
return 'No'
df['result'] = df.apply(check_answer, axis=1)
You could try using list comprehension.
Here's an example:
list comprehension in pandas
And if you need to create a copy of your data frame with news columns containing the result of your conditions, you can check this exemple: Pandas DataFrame Comprehensions
I hope I could help
Best regards.

if else condition pandas

I have a df with cols
start end strand
3 90290834 90290905 +
3 90290834 90291149 +
3 90291019 90291149 +
3 90291239 90291381 +
5 33977824 33984550 -
5 33983577 33984550 -
5 33984631 33986386 -
what i am trying to do is add new columns(5ss and 3ss)based on the strand column
f = pd.read_clipboard()
f
def addcolumns(row):
if row['strand'] == "+":
row["5ss"] == row["start"]
row["3ss"] == row["end"]
else:
row["5ss"] == row["end"]
row["3ss"] == row["start"]
return row
f = f.apply(addcolumns, axis=1)
KeyError: ('5ss', u'occurred at index 0')
which part of the code is wrong? or there is an easier way to do this?
Instead of using .apply() I'd suggest using np.where() instead:
df.loc[:, '5ss'] = np.where(f.strand == '+', f.start, f.end)
df.loc[:, '3ss'] = np.where(f.strand == '+', f.end, f.start)
np.where() creates a new object based on three arguments
A logical condition (in this case f.strand == '+')
A value to take when the condition is true
A value to take when the condition is false
Using apply() with axis=1 applies the function to each column. So even though you've named the variable row, it's actually iterating over columns. You could leave out the axis argument or specify axis=0 to apply the function to the rows. But given what you're trying to do, it would be simpler to use np.where(), which allows you to specify some conditional logic for column assignment.

Categories

Resources