Match Strings of 2 dataframe columns in Python - python

I have two data frame:
Df1:
Original df has 1000+ Name
Id Name
1 Paper
2 Paper Bag
3 Scissors
4 Mat
5 Cat
6 Good Cat
2nd Df:
Original df has 1000+ Item_Name
Item_ID Item_Name
1 Paper Bag
2 wallpaper
3 paper
4 cat cage
5 good cat
Expected Output:
Id Name Item_ID
1 Paper 1,2,3
2 Paper Bag 1,2,3
3 Scissors NA
4 Mat NA
5 Cat 4,5
6 Good Cat 4,5
My Code:
def matcher(x):
res = df2.loc[df2['Item_Name'].str.contains(x, regex=False, case=False), 'Item_ID']
return ','.join(res.astype(str))
df1['Item_ID'] = df1['Name'].apply(matcher)
Current Challenges
str.contains work when name has Paper and Item_Name has Paper Bag but it doesn't work other way around. So, it my example it work for row 1,3,4,5 for df1 but not for row 2 & 6. So, it will not map row 2 of df1 with row 3 of df2
Ask
So, if you can help me in modifying the code so that it can help in matching otherway round also

You can modify your custom matcher function and use apply():
def matcher(query):
matches = [i['Item_ID'] for i in df2[['Item_ID','Name']].to_dict('records') if any(q in i['Name'].lower() for q in query.lower().split())]
if matches:
return ','.join(map(str, matches))
else:
return 'NA'
df1['Item_ID'] = df1['Name'].apply(matcher)
Returns:
Id Name Item_ID
0 1 Paper 1,2,3
1 2 Paper Bag 1,2,3
2 3 Scissors NA
3 4 Mat NA
4 5 Cat 4,5
5 6 Good Cat 4,5
Explanation:
We are using apply() to apply our custom matcher() function to each row value of your df1['Name'] column. In our matcher() function, we are converting df2 into a dictionary with the Item_ID as the keys and the Name as the values. We then can check if our current row value query is present in any() of the Name values from df1 (converted to lowercase via lower()), and if so, then we can add the Item_ID to a list to be returned.

Related

How to compare two DataFrames and return a matrix containing values where columns matched

I have two data frames as follows:
df1
id start end label item
0 1 0 3 food burger
1 1 4 6 drink cola
2 2 0 3 food fries
df2
id start end label item
0 1 0 3 food burger
1 1 4 6 food cola
2 2 0 3 drink fries
I would like to compare the two data frames (by checking where they match in the id, start, end columns) and create a matrix of size 2 x (number of items) for each id. The cells should contain the label corresponding to an item. In this example:
M_id1: [[food, drink], M_id2: [[food],
[food, food]] [drink]]
I tried looking at the pandas documentation but didn't really find anything that could help me.
You can merge the dataframe df1 and df2 on columns id, start, end then group the merged dataframe on id and for each group per id create key-value pairs inside dict comprehension where key is the id and value is the corresponding matrix of labels:
m = df1.merge(df2, on=['id', 'start', 'end'])
dct = {f'M_id{k}': g.filter(like='label').to_numpy().T for k, g in m.groupby('id')}
To access the matrix of labels use dictionary lookup:
>>> dct['M_id1']
array([['food', 'drink'], ['food', 'food']], dtype=object)
>>> dct['M_id2']
array([['food'], ['drink']], dtype=object)

How do I combine data accurately using python pandas between columns with shared column values?

Python may be the best bet for this. My attempted solution is below.
sheet 1:
Names Trophies
Scott 3
Jim 3
Ron 2
Bob 1
Jack 1
sheet 2:
Names Age Hobby
Bob 1 fishing
Scott 4 math
Jim 6 chess
Ron 2 tennis
The desired result is this:
Names Trophies Age Hobby
Bob 1 1 fishing
Scott 3 4 math
Jim 3 6 chess
Ron 2 2 tennis
Jack 1 1
Basically, I want to match the names from both sheets together, and combine their data accurately.
python code:
import pandas as pd
df1 = pd.read_csv('test.csv',index_col=0, usecols=[0, 1])
print(df1.head())
df2 = pd.read_csv('test.csv',index_col=0, usecols=[0,1,2,3,4,6])
print(df2.head())
df = pd.merge(df2, df1, right_on=['Names'], left_on=['Names'], how='inner')
This gives me this error:
raise KeyError(key) KeyError: 'Names'
Data inside the csv:
Combination of INDEX and MATCH formulas will work in MS Excel
For examples to get the AGE values in your desired result you conceptually do this:
=INDEX(X; MATCH(Y;Z))
X stands for the range of values that you want to fetch from sheet 2, so in this case your column 'AGE' from sheet 2
Y stands for the name value in sheet 1
Z stands for the range of values in sheet 2 in which you want to look for Y. So in this case your column 'NAMES' from sheet 2

String Matching and get more than 1 column in Pandas

I need to match Name from df1 to Item_Name from df2. Wherever the name matches I also need Item_Id and Material_Name from df2.
I have two data frames:
Df1:
Original df has 1000+ Name
Id Name
1 Paper
2 Paper Bag
3 Scissors
4 Mat
5 Cat
6 Good Cat
2nd Df:
Original df has 1000+ Item_Name
Item_ID Item_Name Material_Name
1 Paper Bag Office
2 wallpaper Decor
3 paper Office
4 cat cage Animal Misc
5 good cat Animal
Expected Output:
Id Name Item_ID Material_Name
1 Paper 1,2,3 Office,Decor,Office
2 Paper Bag 1,2,3 Office,Decor,Office
3 Scissors NA NA
4 Mat NA NA
5 Cat 4,5 Animal Misc, Animal
6 Good Cat 4,5 Animal Misc,Animal
Code:
def matcher(query):
matches = [i['Item_ID'] for i in df2[['Item_ID','Name']].to_dict('records') if any(q in i['Name'].lower() for q in query.lower().split())]
if matches:
return ','.join(map(str, matches))
else:
return 'NA'
df1['Item_ID'] = df1['Name'].apply(matcher)
This worked properly when I need to have one column and currently I am running this code twice to get Item_ID and Material_Name.
ASK:
Need help if there is another way to not to run the function twice but I can get even 2 or 3 columns in one go
Here's one way using pd.DataFrame.loc and reusing Boolean masks:
def matcher(x):
# construct 2-way mask
m1 = df2['Item_Name'].str.contains(x, regex=False, case=False)
m2 = [any(w in i.lower() for w in x.lower().split()) for i in df2['Item_Name']]
# apply 2-way mask
res_id = df2.loc[m1 | m2, 'Item_ID']
res_mat = df2.loc[m1 | m2, 'Material_Name']
return ','.join(res_id.astype(str)), ','.join(res_mat.astype(str))
df1[['Item_ID', 'Material_Name']] = pd.DataFrame(df1['Name'].apply(matcher).tolist())
print(df1)
Id Name Item_ID Material_Name
0 1 Paper 1,2,3 Office,Decor,Office
1 2 Paper Bag 1,2,3 Office,Decor,Office
2 3 Scissors
3 4 Mat
4 5 Cat 4,5 Animal Misc,Animal
5 6 Good Cat 4,5 Animal Misc,Animal
You can try getting both Item_ID and Material_Name as a tuple from your query, then apply the appropriate column with [i[0] for i in matches] or [i[1] for i in matches].
def matcher(query):
matches = [(i['Item_ID'], i['Material_Name']) for i in df2[['Item_ID','Name']].to_dict('records') if any(q in i['Name'].lower() for q in df1['Name'].lower().split())]
if matches:
df1['Material_Name'].apply(','.join(map(str, [i[1] for i in matches])))
return ','.join(map(str, [i[0] for i in matches]))
else:
df1['Material_Name'].apply("NA")
return 'NA'
df1['Item_ID'] = df1['Name'].apply(matcher)

Matching partial expressions between dataframes

I am attempting to perform a partial string match between columns in data frames for example:
df_A:
Items_A
purse
string
hat
glue
gum
cherry
cherry
cherry pie
and
df_B:
1 2 3
string gum cherry
glue
desired output:
df_matched:
matched Items_A
0 purse
1 string
0 hat
1 glue
2 gum
3 cherry
3 cherry
3 cherry pie
Note that numbers in the matched columns are the labels from the column that is matched, either 1, 2, or 3. If there is no match, then the label is 0.
I was able to use Regular expression matching with several nested loops but was wondering if there was a way to use the panda's libraries to perform the operation more efficiently.
Reshape df_B to get this :
level_0 level_1 0
0 0 1 string
1 0 2 gum
2 0 3 cherry
3 1 1 glue
rename df_B columns
get the list of unique words in df_B
create a new column in df_B to find the matching word from df_B in
df_A
Merge and filter
import regex
df_B = df_B.stack().reset_index()
df_B = df_B.rename(columns={"level_1": "matched", 0: "Items_A"})
items = df_B.Items_A.unique()
def partial_match(x, items):
for item in items:
if regex.search(r'.?'+item+'.?', x):
return item
return 0
df_A["matching_item"] = df_A["Items_A"].apply(lambda x: partial_match(x, items))
df_A = df_A.merge(df_B, how="left", left_on="matching_item", right_on="Items_A", suffixes=('', '_y'))
df_A = df_A.loc[:,["Items_A", "matched"]]

Split a pandas dataframe into rows based on an integer column

Not a an ideal title but I wouldn't know how to describe it better.
I have a dataframe (df1) and want to split it on the column "chicken" so that:
each chicken that laid an egg becomes a distinct row
the chickens that didn't lay an egg are aggregated in a unique row.
The output I need is df2, example:
In farm "A", there are 5 chicken, of which 2 chicken laid an egg, so there are 2 rows with egg = "True" and weight = 1 each, and 1 row with egg = "False" and weight = 3 (the 3 chicken that didn't lay an egg).
The code I came up with is messy, can you guys think of a cleaner way of doing it? Thanks!!
#code to create df1:
df1 = pd.DataFrame({'farm':["A","B","C"],"chicken":[5,10,5],"eggs":[2,3,0]})
df1=df1[["farm","chicken","eggs"]]
#code to transform df1 to df2:
df2 = pd.DataFrame()
for i in df1.index:
number_of_trues = df1.iloc[i]["eggs"]
number_of_falses = df1.iloc[i]["chicken"] - number_of_trues
col_farm = [df1.iloc[i]["farm"]]*(number_of_trues+1)
col_egg = ["True"]*number_of_trues + ["False"]*1
col_weight = [1]*number_of_trues + [number_of_falses]
mini_df = pd.DataFrame({"farm":col_farm,"egg":col_egg,"weight":col_weight})
df2=df2.append(mini_df)
df2 = df2[["farm","egg","weight"]]
df2
This is customize solution , by creating two different sub dataframe then concat it back to achieve the expected output.Key method : repeat
s=pd.DataFrame({'farm':df1.farm.repeat(df1.eggs),'egg':[True]*df1.eggs.sum(),'weight':[1]*df1.eggs.sum()})
t=pd.DataFrame({'farm':df1.farm,'egg':[False]*len(df1.farm),'weight':df1.chicken-df1.eggs})
pd.concat([t,s]).sort_values(['farm','egg'],ascending=[True,False])
Out[847]:
egg farm weight
0 True A 1
0 True A 1
0 False A 3
1 True B 1
1 True B 1
1 True B 1
1 False B 7
2 False C 5

Categories

Resources