pandas - If partial string match exists, put value in new column - python

I've got a tricky problem in pandas to solve. I was previously referred to this thread as a solution but it is not what I am looking for.
Take this example dataframe with two columns:
df = pd.DataFrame([['Mexico', 'Chile'], ['Nicaragua', 'Nica'], ['Colombia', 'Mex']], columns = ["col1", "col2"])
I first want to check each row in column 2 to see if that value exists in column 1. This is checking full and partial strings.
df['compare'] = df['col2'].apply(lambda x: 'Yes' if df['col1'].str.contains(x).any() else 'No')
I can check to see that I have a match of a partial or full string, which is good but not quite what I need. Here is what the dataframe looks like now:
What I really want is the value from column 1 which the value in column 2 matched with. I have not been able to figure out how to associate them
My desired result looks like this:

Here's a "pandas-less" way to do it. Probably not very efficient but it gets the job done:
def compare_cols(match_col, partial_col):
series = []
for partial_str in partial_col:
for match_str in match_col:
if partial_str in match_str:
series.append(match_str)
break # matches to the first value found in match_col
else: # for loop did not break = no match found
series.append(None)
return series
df = pd.DataFrame([['Mexico', 'Chile'], ['Nicaragua', 'Nica'], ['Colombia', 'Mex']], columns = ["col1", "col2"])
df['compare'] = compare_cols(match_col=df.col1, partial_col=df.col2)
Note that if a string in col2 matches to more than one string in col1, the first occurrence is used.

Related

How to clean dataframe column filled with names using Python?

I have the following dataframe:
df = pd.DataFrame( columns = ['Name'])
df['Name'] = ['Aadam','adam','AdAm','adammm','Adam.','Bethh','beth.','beht','Beeth','Beth']
I want to clean the column in order to achieve the following:
df['Name Corrected'] = ['adam','adam','adam','adam','adam','beth','beth','beth','beth','beth']
df
Cleaned names are based on the following reference table:
ref = pd.DataFrame( columns = ['Cleaned Names'])
ref['Cleaned Names'] = ['adam','beth']
I am aware of fuzzy matching but I'm not sure if that's the most efficient way of solving the problem.
You can try:
lst=['adam','beth']
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#but In certain condition ffill() gives you wrong values
Explaination:
lst=['adam','beth']
#created a list of words
out=pd.concat([df['Name'].str.contains(x,case=False).map({True:x}) for x in lst],axis=1)
#checking If the 'Name' column contain the word one at a time that are inside the list and that will give a boolean series of True and False and then we are mapping The value of that particular element that is inside list so True becomes that value and False become NaN and then we are concatinating both list of Series on axis=1 so that It becomes a Dataframe
df['Name corrected']=out.bfill(axis=1).iloc[:,0]
#Backword filling values on axis=1 and getting the 1st column
#Finally:
df['Name corrected']=df['Name corrected'].ffill()
#Forward filling the missing values

Checking string for words in list stored in Pandas Dataframe

I have a pandas dataframe containing a list of strings in a column called contains_and. Now I want to select the rows from that dataframe whose words in contains_and are all contained in a given string, e.g.
example: str = "I'm really satisfied with the quality and the price of product X"
df: pd.DataFrame = pd.DataFrame({"columnA": [1,2], "contains_and": [["price","quality"],["delivery","speed"]]})
resulting in a dataframe like this:
columnA contains_and
0 1 [price, quality]
1 2 [delivery, speed]
Now, I would like to only select row 1, as example contains all words in the list in contains_and.
My initial instinct was to do the following:
df.loc[
all([word in example for word in df["contains_and"]])
]
However, doing that results in the following error:
TypeError: 'in <string>' requires string as left operand, not list
I'm not quite sure how to best do this, but it seems like something that shouldn't be all too difficult. Does someone know of a good way to do this?
One way:
df = df[df.contains_and.apply(lambda x: all((i in example) for i in x), 1)]
OUTPUT:
columnA contains_and
0 1 [price, quality]
another way is explodeing the list of candidate words and checking (per row) if they are all in the words of example which are found with str.split:
# a Series of words
ex = pd.Series(example.split())
# boolean array reduced with `all`
to_keep = df["contains_and"].explode().isin(ex).groupby(level=0).all()
# keep only "True" rows
new_df = df[to_keep]
to get
>>> new_df
columnA contains_and
0 1 [price, quality]
Based on #Nk03 answer, you could also try:
df = df[df.contains_and.apply(lambda x: any([q for q in x if q in example]))]
In my opinion is more intuitive to check if words are in example, rather than the opposite, as your first attempt shows.

Pandas - contains from other DF

I have 2 dataframes:
DF A:
and DF B:
I need to check every row in the DFA['item'] if it contains some of the values in the DFB['original'] and if it does, then add new column in DFA['my'] that would correspond to the value in DFB['my'].
So here is the result I need:
I tought of converting the DFB['original'] into list and then use regex, but this way I wont get the matching result from column 'my'.
Ok, maybe not the best solution, but it seems to be working.
I did cartesian join and then check the records which contains the data needed
dfa['join'] = 1
dfb['join'] = 1
dfFull = dfa.merge(dfb, on='join').drop('join' , axis=1)
dfFull['match'] = dfFull.apply(lambda x: x.original in x.item, axis = 1)
dfFull[dfFull['match']]

Python: Write a nested loop to test whether a series of string values is present in the column of a dataframe

I have two dataframes df1 and df2. df1 has a column called 'comments' that contains a string. df2 has a column called 'labels' that contains smaller strings. I am trying to write a function that searches df1['comments'] for the strings contained in df2['labels'] and creates a new variable for d1 called df1['match'] that is True if df1['comments'] contains any of the strings in df2['labels'] and False if df1['comments'] does not contain any of the strings in df2['labels'].
I'm trying to use df.str.contains('word', na=False) to solve this problem and I have managed to create the column df1['match'] searching for one specific string using the following function:
df1['match'] = df1['comment'].str.contains('mystring', na=False)
However, I struggle to write a function that iterates over all the words in df2['label'] and creates a df1['match'] with True if any of the words in df2['label'] are present and False otherwise.
This is my attempt at writing the loop:
for comment in df1['comment']:
for word in df2['label']:
if df1['comment'].str.contains(word, na=False)=True:
df1['match']=True
#(would need something to continue to next comment if there is a match)
else:
df1['match']=False #(put value as false if there none of the items in df2['label' is contained in df1['comment']``
Any help would be greately appreciated.
You can do a multiple substring search through a regex search using pipe. See this post
df1['match'] = df['comment'].str.contains('|'.join(df2['label'].values), na=False)
Try this If this helps
df2['match'] = "False"
for idx, word in enumerate(df2['labels']):
q = df1['comment'][idx:].str.contains(word)
df2['match'][idx] = q[idx]
I don't know how much it will help but better way to compare is below way. It's efficient.
If df1['match'] you want to mention row by row then code will need some change . But I think you got what actually you wanted.
test1=df2['label'].to_list()
test2=df1['comments'].to_list()
flag = 0
if(set(test1).issubset(set(test2))):
flag = 1
if (flag) :
df1['match']=True
else :
df1['match']=False
Here is the complete code let me know if this is what you are asking for
import pandas as pd
d = {'comment': ["abcd efgh ijk", "lmno pqrst uvwxyz", "123456789 4567895062"]}
df1 = pd.DataFrame(data=d)
print(df1)
d = {'labels': ["efgh", "pqrst", "12389"]}
df2 = pd.DataFrame(data=d)
print(df2)
df2['match'] = "False"
for idx, word in enumerate(df2['labels']):
q = df1['comment'][idx:].str.contains(word)
df2['match'][idx] = q[idx]
print("final df2")
print(df2)

Iterating over multiIndex dataframe

I have a data frame as shown below
I have a problem in iterating over the rows. for every row fetched I want to return the key value. For example in the second row for 2016-08-31 00:00:01 entry df1 & df3 has compass value 4.0 so I wanted to return the keys which has the same compass value which is df1 & df3 in this case
I Have been iterating rows using
for index,row in df.iterrows():
Update
Okay so now I understand your question better this will work for you.
First change the shape of your dataframe with
dfs = df.stack().swaplevel(axis=0)
This will make your dataframe look like:
Then you can iterate the rows like before and extract the information you want. I'm just using print statements for everything, but you can put this in some more appropriate data structure.
for index, row in dfs.iterrows():
dup_filter = row.duplicated(keep=False)
dfss = row_tuple[dup_filter].index.values
print("Attribute:", index[0])
print("Index:", index[1])
print("Matches:", dfss, "\n")
which will print out something like
.....
Attribute: compass
Index: 5
Matches: ['df1' 'df3']
Attribute: gyro
Index: 5
Matches: ['df1' 'df3']
Attribute: accel
Index: 6
Matches: ['df1' 'df3']
....
You could also do it one attribute at a time by
dfs_compass = df.stack().swaplevel(axis=0).loc['compass']
and iterate through the rows with just the index.
Old
If I understand your question correctly, i.e. you want to return the indexes of rows which have matching values on the second level of your columns, i.e. ('compass', 'accel', 'gyro'). The following will work.
compass_match_indexes = []
for index, row in df.iterrows():
match_filter = row[:, 'compass'].duplicated()
if len(row[:, 'compass'][match_filter] > 0)
compass_match_indexes.append(index)
You can use select your dataframe with that list like df.loc[compass_match_indexes]
--
Another approach, you could get the transform of your DataFrame with df.T and then use the duplicated function.

Categories

Resources