matching rows between dataframes in pandas in python

matching rows between dataframes in pandas in python - python

I have two dataframes,
df1,
Names
one two three
Sri is a good player
Ravi is a mentor
Kumar is a cricketer
df2,
values
sri
NaN
sri, is
kumar,cricketer
I am trying to get the row in df1 which contains the all the items in df2
My expected output is,
values Names
sri Sri is a good player
NaN
sri, is Sri is a good player
kumar,cricketer Kumar is a cricketer
i tried, df1["Names"].str.contains("|".join(df2["values"].values.tolist()))
but I cannot achieve my expected output as it has (","). Please help

Using sets
s1 = df1.Names.dropna()
s1.loc[:] = [set(x.lower().split()) for x in s1.values.tolist()]
a1 = s1.values
s2 = df2['values'].dropna()
s2.loc[:] = [set(x.replace(' ', '').lower().split(',')) for x in s2.values.tolist()]
a2 = s2.values
i = np.column_stack([a1 >= a2[:, None], [True] * len(a2)]).argmax(1)
df2.assign(Names=pd.Series(
np.append(df1.Names.values, np.nan)[i], s2.index
))
values Names
0 sri Sri is a good player
1 NaN NaN
2 sri, is Sri is a good player
3 kumar,cricketer Kumar is a cricketer

import pandas as pd
names = [
'one two three',
'Sri is a good player',
'Ravi is a mentor',
'Kumar is a cricketer'
]
values = [
'sri',
'NaN',
'sri, is',
'kumar,cricketer',
]
names = pd.Series(names)
values = pd.DataFrame(values, columns=['values'])
def foo(words):
names_copy = names.copy()
for word in words.split(','):
names_copy = names_copy[names_copy.str.contains(word, case=False)]
return names_copy.values
values['names'] = values['values'].map(foo)
values
values names
0 sri [Sri is a good player]
1 NaN []
2 sri, is [Sri is a good player]
3 kumar,cricketer [Kumar is a cricketer]

Related

How to search a list of strings in a data frame column and return the matching string as an adjacent column

What I have
I have a column 'Student' with students name and their personalities.
I have a list named as 'qualities' that consisits of qualities that is required for filtering purpose.
What I want
I want a column next to to the 'Student' that returns the matching string from the list.
What I have
import pandas as pd
Personality = {'Student':["Aysha is clever", "Ben is stronger", "Cathy is clever and strong", "Dany is intelligent", "Ella is naughty", "Fred is quieter"]}
index_labels=['1','2','3','4','5','6']
df = pd.DataFrame(Personality,index=index_labels)
qualities = ['calm', 'clever', 'quiet', 'bold', 'strong', 'cute']
What I want
Output

Use str.findall and then split by ,
df['ex'] = df['Student'].str.findall('|'.join(qualities)).apply(set).str.join(', ')
new = df["ex"].str.split(pat = ",", expand=True)[1]
df =pd.concat([df, new], axis = 1)
df = df.fillna('')
print(df)
Gives #
Student ex 1
1 Aysha is clever clever
2 Ben is stronger strong
3 Cathy is clever and strong clever, strong strong
4 Dany is intelligent
5 Ella is naughty
6 Fred is quieter quiet

You can use str.extractall and unstack, then join to the original DataFrame:
import re
pattern = '|'.join(map(re.escape, qualities))
out = df.join(df['Student'].str.extractall(f'({pattern})')[0].unstack())
Output:
Student 0 1
1 Aysha is clever clever NaN
2 Ben is stronger strong NaN
3 Cathy is clever and strong clever strong
4 Dany is intelligent NaN NaN
5 Ella is naughty NaN NaN
6 Fred is quieter quiet NaN

Splitting series of Strings into Dataframe

I have a big series of strings that I want to split into a dataframe.
The series looks like this:
s = pd.Series({"1":"Name=Marc-Age=48-Car=Tesla",
"2":"Name=Ben-Job=Pilot-Car=Porsche",
"3":"Name=Tom-Age=24-Car=Ford"})
I want to split this into a dataframe looking like this:
Name Age Job Car
1 Marc 48 Nan Tesla
2 Ben Nan Pilot Porsche
3 Tom 24 Nan Ford
I tried to split the strings first by "-" and then by "=" but don't understand how to continue after
df=s.str.split("-", expand=True)
for col in df.columns:
df[col]=df[col].str.split("=")
I get this:
` 0 1 2`
`1 ['Name', 'Marc'] ['Age', '48'] ['Car', 'Tesla']`
`2 ['Name', 'Ben'] ['Job', 'Pilot'] ['Car', 'Porsche']`
`3 ['Name', 'Tom'] ['Age', '24'] ['Car', 'Ford']`
I don't know how to continue from here. I can't loop through the rows because my dataset is really big.
Can anyone help on how to go on from here?

If you split then explode and split again you can then use a pivot.
import pandas as pd
s = pd.Series({"1":"Name=Marc-Age=48-Car=Tesla",
"2":"Name=Ben-Job=Pilot-Car=Porsche",
"3":"Name=Tom-Age=24-Car=Ford"})
s = s.str.split('-').explode().str.split('=', expand=True).reset_index()
s = s.pivot(index='index', columns=0, values=1).reset_index(drop=True)
Output
Age Car Job Name
0 48 Tesla NaN Marc
1 NaN Porsche Pilot Ben
2 24 Ford NaN Tom

Apply a function on elements in a Pandas column, grouped on another column

I have a dataset with several columns.
Now what I want is to basically calculate score based on a particular column ("name") but grouped on the "id" column.
_id fName lName age
0 ABCD Andrew Schulz
1 ABCD Andreww 23
2 DEFG John boy
3 DEFG Johnn boy 14
4 CDGH Bob TANNA 13
5 ABCD. Peter Parker 45
6 DEFGH Clark Kent 25
So what I am looking is whether for the same id, I am getting similar entries, so I can remove those entries based on a threshold score values. Like here if i run it for col "fName". I should be able to reduce this dataframe to based on a score threshold:
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13
5 ABCD Peter Parker 45
6 DEFG Clark Kent 25
I intend to use pyjarowinkler.
If I had two independent columns (without all the group by stuff) to check, this is how I use it.
df['score'] = [distance.get_jaro_distance(x, y) for x, y in zip(df['name_1'],df['name_2'])]
df = df[df['score'] > 0.87]
Can someone suggest a pythonic and fast way of doing this
UPDATE
So, I have tried using record linkage library for this. And I have ended up at a dataframe containing pair of indexes that are similar called 'matches'. Now I just want to basically combine the data.
# Indexation step
indexer = recordlinkage.Index()
indexer.block(left_on='_id')
candidate_links = indexer.index(df)
# Comparison step
compare_cl = recordlinkage.Compare()
compare_cl.string('fName', 'fName', method='jarowinkler', threshold=threshold, label='full_name')
features = compare_cl.compute(candidate_links, df)
# Classification step
matches = features[features.sum(axis=1) >= 1]
print(len(matches))
This is how matches looks:
index1 index2 fName
0 1 1.0
2 3 1.0
I need someone to suggest a way to combine the similar rows in a way that takes data from similar rows

just wanted to clear some doubts regarding your ques. Couldn't clear them in comments due to low reputation.
Like here if i run it for col "fName". I should be able to reduce this
dataframe to based on a score threshold:
So basically your function would return the DataFrame containing the first row in each group (by ID)? This will result in the above listed resultant DataFrame.
_id fName lName age
0 ABCD Andrew Schulz 23
2 DEFG John boy 14
4 CDGH Bob TANNA 13

I hope this code answer your question
r0 =['ABCD','Andrew','Schulz', '' ]
r1 =['ABCD','Andrew', '' , '23' ]
r2 =['DEFG','John' ,'boy' , '' ]
r3 =['DEFG','John' ,'boy' , '14' ]
r4 =['CDGH','Bob' ,'TANNA' , '13' ]
Rx =[r0,r1,r2,r3,r4]
print(Rx)
print()
Dict= dict()
for i in Rx:
if (Dict.__contains__(i[0]) == True):
if (i[2] != ''):
Dict[i[0]][2] = i[2]
if (i[3] != ''):
Dict[i[0]][3] = i[3]
else:
Dict[i[0]]=i
Rx[:] = Dict.values()
print(Rx)

I am lost with the 'score' part of your question, but if what you need is to fill the gaps in data with values from other rows and then drop the duplicates by id, maybe this can help:
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
First make sure that empty values are replaced with nulls. Then use fillna to 'back fill' the data. Then drop duplicates keeping the first occurrence of Id. fillna will fill the values from the next value found in the column, which may correspond to other Id, but since you will discard the duplicated rows, I believe drop_duplicates keeping the first occurrence will do the job. (This assumes that at least one value is provided in every column for every Id)
I've tested with this dataset and code:
data = [
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', 'Schulz'],
['AABBCC', 'Andrew', '', 23],
['AABBCC', 'Andrew', '',],
['AABBCC', 'Andrew', '',],
['DDEEFF', 'Karl', 'boy'],
['DDEEFF', 'Karl', ''],
['DDEEFF', 'Karl', '', 14],
['GGHHHH', 'John', 'TANNA', 13],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', ''],
['HLHLHL', 'Bob', 'Blob'],
['HLHLHL', 'Bob', 'Blob', 15],
['HLHLHL', 'Bob','', 15],
['JLJLJL', 'Nick', 'Best', 20],
['JLJLJL', 'Nick', '']
]
df = pd.DataFrame(data, columns=['Id', 'fName', 'lName', 'Age'])
df.replace('', np.nan, inplace=True)
df_filled = df.fillna(method='bfill').drop_duplicates('Id', keep='first')
Output:
Id fName lName Age
0 AABBCC Andrew Schulz 23.0
5 DDEEFF Karl boy 14.0
8 GGHHHH John TANNA 13.0
9 HLHLHL Bob Blob 15.0
14 JLJLJL Nick Best 20.0
Hope this helps and apologies if I misunderstood the question.

How to iterate through pandas columns and rows simultaneously?

I have two df A & B, I want to iterate through df B's certain columns and check values of all its rows and see if values exist in one of the columns in A, and use fill null values with A's other columns' values.
df A:
country region product
USA NY apple
USA NY orange
UK LON banana
UK LON chocolate
CANADA TOR syrup
CANADA TOR fish
df B:
country ID product1 product2 product3 product4 region
USA 123 other stuff other stuff apple NA NA
USA 456 orange other stuff other stuff NA NA
UK 234 banana other stuff other stuff NA NA
UK 766 other stuff other stuff chocolate NA NA
CANADA 877 other stuff other stuff syrup NA NA
CANADA 109 NA fish NA other stuff NA
so I want to iterate through dfB and for example see if dfA.product (apple) is in columns of dfB.product1-product4 if true such as the first row of dfB indicates, then I want to add the region value from dfA.region into dfB's region which now is currently NA.
here is the code I have, I am not sure if it is right:
import pandas as pd
from tqdm import tqdm
def fill_null_value(dfA, dfB):
for i, row in tqdm(dfA.iterrows()):
for index, row in tqdm(dfB.iterrows()):
if dfB['product1'][index] == dfA['product'][i]:
dfB['region'] = dfA['region '][i]
elif dfB['product2'][index] == dfA['product'[i]:
dfB['region'] = dfA['region'][i]
elif dfB['product3'][index] == dfA['product'][i]:
dfB['region'] = dfA['region'][i]
elif dfB['product4'][index] == dfA['product'][i]:
dfB['region'] = dfA['region'][i]
else:
dfB['region '] = "not found"
print('outputing data')
return dfB.to_excel('test.xlsx')

If i where you I would create some join and then concat them and drop duplicates
df_1 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product1'], how='right')
df_2 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product2'], how='right')
df_3 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product3'], how='right')
df_4 = df_A.merge(df_B, right_on=['country', 'product'], left_on=['country', 'product4'], how='right')
df = pd.concat([df_1, df_2, df_3, df_4]).drop_duplicates()

The main issue here seems to be finding a single column for products in your second data set that you can do your join on. It's not clear how exactly you are deciding what values in the various product columns in df_b are meant to be used as keys to lookup vs. the ones that are ignored.
Assuming, though, that your df_a contains an exhaustive list of product values and each of those values only ever occurs in a row once you could do something like this (simplifying your example):
import pandas as pd
df_a = pd.DataFrame({'Region':['USA', 'Canada'], 'Product': ['apple', 'banana']})
df_b = pd.DataFrame({'product1': ['apple', 'xyz'], 'product2': ['xyz', 'banana']})
product_cols = ['product1', 'product2']
df_b['Product'] = df_b[product_cols].apply(lambda x: x[x.isin(df_a.Product)][0], axis=1)
df_b = df_b.merge(df_a, on='Product')
The big thing here is generating a column that you can join on for your lookup

Use a different dataframe to replace value of text in dataframe

I have a simple dataframe (df1) where I am replacing values with the replace function (see below). Instead of always having to change the names of the items I want to replace in the code, I would like this to be done from an excel sheet, where either the columns or rows give the different names that should be replaced. I would import the excel as a dataframe (df2). All I am missing is the scrip that would turn the info from df2 into the replace function.
df1 = pd.DataFrame({'Product':['Tart', 'Cookie', 'Black'],
'Quantity': [1234, 4, 333]})
print(df1)
Product Quantity
0 Tart 1234
1 Cookie 4
2 Black 333
This is what I used so far
sales = sales.replace (["Tart","Tart2", "Cookie", "Cookie2"], "Tartlet")
sales = sales.replace (["Ham and cheese Sandwich" , "Chicken focaccia"], "Sandwich")
After replacement
print(df1)
Product Quantity
0 Tartlet 1234
1 Tartlet 4
2 Black 333
This is how my dataframe 2 could look like (I am flexible how to design it) after I imported it from an excel file
df2 = pd.read_excel (setup_folder / "Product Replacements.xlsx", index_col= 0)
print (df2)
Tartlet Sandwich
0 Tart Ham and cheese Sandwich
1 Tart2 Chicken Focaccia
2 Cookie2 nan

Use:
df2 = pd.DataFrame({'Tartlet':['Tart', 'Tart2', 'Cookie'],
'Sandwich': ['Ham and Cheese Sandwich', 'Chicken Focaccia', 'another']})
#swap key values in dict
#http://stackoverflow.com/a/31674731/2901002
d1 = {k: oldk for oldk, oldv in df2.items() for k in oldv}
print (d1)
{'Tart': 'Tartlet', 'Tart2': 'Tartlet', 'Cookie': 'Tartlet', 'Ham and Cheese Sandwich':
'Sandwich', 'Chicken Focaccia': 'Sandwich', 'another': 'Sandwich'}
df1['Product'] = df1['Product'].replace(d1)
#for improve performance
#df1['Product'] = df1['Product'].map(d1).fillna(df1['Product'])
print (df1)
Product Quantity
0 Tartlet 1234
1 Tartlet 4
2 Black 333

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

matching rows between dataframes in pandas in python - python

Related

How to search a list of strings in a data frame column and return the matching string as an adjacent column

Splitting series of Strings into Dataframe

Apply a function on elements in a Pandas column, grouped on another column

How to iterate through pandas columns and rows simultaneously?

Use a different dataframe to replace value of text in dataframe

Categories

Resources