String Matching and get more than 1 column in Pandas - python

I need to match Name from df1 to Item_Name from df2. Wherever the name matches I also need Item_Id and Material_Name from df2.
I have two data frames:
Df1:
Original df has 1000+ Name
Id Name
1 Paper
2 Paper Bag
3 Scissors
4 Mat
5 Cat
6 Good Cat
2nd Df:
Original df has 1000+ Item_Name
Item_ID Item_Name Material_Name
1 Paper Bag Office
2 wallpaper Decor
3 paper Office
4 cat cage Animal Misc
5 good cat Animal
Expected Output:
Id Name Item_ID Material_Name
1 Paper 1,2,3 Office,Decor,Office
2 Paper Bag 1,2,3 Office,Decor,Office
3 Scissors NA NA
4 Mat NA NA
5 Cat 4,5 Animal Misc, Animal
6 Good Cat 4,5 Animal Misc,Animal
Code:
def matcher(query):
matches = [i['Item_ID'] for i in df2[['Item_ID','Name']].to_dict('records') if any(q in i['Name'].lower() for q in query.lower().split())]
if matches:
return ','.join(map(str, matches))
else:
return 'NA'
df1['Item_ID'] = df1['Name'].apply(matcher)
This worked properly when I need to have one column and currently I am running this code twice to get Item_ID and Material_Name.
ASK:
Need help if there is another way to not to run the function twice but I can get even 2 or 3 columns in one go

Here's one way using pd.DataFrame.loc and reusing Boolean masks:
def matcher(x):
# construct 2-way mask
m1 = df2['Item_Name'].str.contains(x, regex=False, case=False)
m2 = [any(w in i.lower() for w in x.lower().split()) for i in df2['Item_Name']]
# apply 2-way mask
res_id = df2.loc[m1 | m2, 'Item_ID']
res_mat = df2.loc[m1 | m2, 'Material_Name']
return ','.join(res_id.astype(str)), ','.join(res_mat.astype(str))
df1[['Item_ID', 'Material_Name']] = pd.DataFrame(df1['Name'].apply(matcher).tolist())
print(df1)
Id Name Item_ID Material_Name
0 1 Paper 1,2,3 Office,Decor,Office
1 2 Paper Bag 1,2,3 Office,Decor,Office
2 3 Scissors
3 4 Mat
4 5 Cat 4,5 Animal Misc,Animal
5 6 Good Cat 4,5 Animal Misc,Animal

You can try getting both Item_ID and Material_Name as a tuple from your query, then apply the appropriate column with [i[0] for i in matches] or [i[1] for i in matches].
def matcher(query):
matches = [(i['Item_ID'], i['Material_Name']) for i in df2[['Item_ID','Name']].to_dict('records') if any(q in i['Name'].lower() for q in df1['Name'].lower().split())]
if matches:
df1['Material_Name'].apply(','.join(map(str, [i[1] for i in matches])))
return ','.join(map(str, [i[0] for i in matches]))
else:
df1['Material_Name'].apply("NA")
return 'NA'
df1['Item_ID'] = df1['Name'].apply(matcher)

Related

How to compare two DataFrames and return a matrix containing values where columns matched

I have two data frames as follows:
df1
id start end label item
0 1 0 3 food burger
1 1 4 6 drink cola
2 2 0 3 food fries
df2
id start end label item
0 1 0 3 food burger
1 1 4 6 food cola
2 2 0 3 drink fries
I would like to compare the two data frames (by checking where they match in the id, start, end columns) and create a matrix of size 2 x (number of items) for each id. The cells should contain the label corresponding to an item. In this example:
M_id1: [[food, drink], M_id2: [[food],
[food, food]] [drink]]
I tried looking at the pandas documentation but didn't really find anything that could help me.
You can merge the dataframe df1 and df2 on columns id, start, end then group the merged dataframe on id and for each group per id create key-value pairs inside dict comprehension where key is the id and value is the corresponding matrix of labels:
m = df1.merge(df2, on=['id', 'start', 'end'])
dct = {f'M_id{k}': g.filter(like='label').to_numpy().T for k, g in m.groupby('id')}
To access the matrix of labels use dictionary lookup:
>>> dct['M_id1']
array([['food', 'drink'], ['food', 'food']], dtype=object)
>>> dct['M_id2']
array([['food'], ['drink']], dtype=object)

String manipulation within a column (pandas): split, replace, join

I'd like to create a new column based on the following conditions:
if the row contains dogs/dog/chien/chiens, then add -00
if the row contains cats/cat/chat/chats, then add 00-
A sample of data is as follows:
Animal
22 dogs
1 dog
1 cat
3 dogs
32 chats
and so far.
I'd like as output a column with only numbers (numerical):
Animal New column
22 dogs 22-00
1 dog 1-00
1 cat 00-1
3 dogs 3-00
32 chats 00-32
I think I should use an if condition to check the words, then .split and .join . It's about string manipulation but I'm having trouble breaking down this problem.
PRES = set(("cats", "cat", "chat", "chats"))
POSTS = set(("dogs", "dog", "chien", "chiens"))
def fun(words):
# words will come as e.g. "22 dogs"
num, ani = words.split()
if ani in PRES:
return "00-" + num
elif ani in POSTS:
return num + "-00"
else:
# you might want to handle this..
return "unexpected"
df["New Column"] = df["Animal"].apply(fun)
where df is your dataframe. For a fast lookup, we turn the condition lists into sets. Then we apply a function to values of the Animal column of df and act accordingly.
You could do this, first extract the number, then use np.where to conditionally add characters to the string:
df['New Col'] = df['Animal'].str.extract(r'([0-9]*)')
df['New Col'] = np.where(df['Animal'].str.contains('dogs|dog|chiens|chien'), df['New Col']+'-00', df['New Col'])
df['New Col'] = np.where(df['Animal'].str.contains('cats|cat|chat|chats'), '00-'+df['New Col'], df['New Col'])
print(df)
Animal New Col
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32
Since your data is well-formatted, you can use a basic substitution and apply it to the row:
import pandas as pd
import re
def replacer(s):
return re.sub(r" (chiens?|dogs?)", "-00",
re.sub(r"(\d+) ch?ats?", r"00-\1", s))
df = pd.DataFrame({"Animal": ["22 dogs", "1 dog", "1 cat", "3 dogs", "32 chats"]})
df["New Column"] = df["Animal"].apply(replacer)
Output:
Animal New Column
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32
Using re:
import re
list1 = ['dogs', 'dog', 'chien', 'chiens']
list2 = ['cats', 'cat', 'chat', 'chats']
df['New_col'] = [(re.search(r'(\w+)', val).group(1).strip()+"-00") if re.search(r'([a-zA-Z]+)', val).group(1).strip() in list1 else ("00-" + re.search(r'(\w+)', val).group(1).strip()) if re.search(r'([a-zA-Z]+)', val).group(1).strip() in list2 else val for val in list(df['Animal'])]
print(df)
Output:
Animal New_col
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32
Create tuple of search words
dog = ('dogs', 'dog', 'chien', 'chiens')
cat = ('cats', 'cat', 'chat', 'chats')
Create conditions for each tuple created with corresponding replacements and apply the conditions to the column, using numpy select :
num = df.Animal.str.split().str[0] #the numbers
#conditions
cond1 = df.Animal.str.endswith(dog)
cond2 = df.Animal.str.endswith(cat)
condlist = [cond1,cond2]
#what should be returned for each successful condition
choicelist = [num+"-00","00-"+num]
df['New Column'] = np.select(condlist,choicelist)
df
Animal New Column
0 22 dogs 22-00
1 1 dog 1-00
2 1 cat 00-1
3 3 dogs 3-00
4 32 chats 00-32

Match Strings of 2 dataframe columns in Python

I have two data frame:
Df1:
Original df has 1000+ Name
Id Name
1 Paper
2 Paper Bag
3 Scissors
4 Mat
5 Cat
6 Good Cat
2nd Df:
Original df has 1000+ Item_Name
Item_ID Item_Name
1 Paper Bag
2 wallpaper
3 paper
4 cat cage
5 good cat
Expected Output:
Id Name Item_ID
1 Paper 1,2,3
2 Paper Bag 1,2,3
3 Scissors NA
4 Mat NA
5 Cat 4,5
6 Good Cat 4,5
My Code:
def matcher(x):
res = df2.loc[df2['Item_Name'].str.contains(x, regex=False, case=False), 'Item_ID']
return ','.join(res.astype(str))
df1['Item_ID'] = df1['Name'].apply(matcher)
Current Challenges
str.contains work when name has Paper and Item_Name has Paper Bag but it doesn't work other way around. So, it my example it work for row 1,3,4,5 for df1 but not for row 2 & 6. So, it will not map row 2 of df1 with row 3 of df2
Ask
So, if you can help me in modifying the code so that it can help in matching otherway round also
You can modify your custom matcher function and use apply():
def matcher(query):
matches = [i['Item_ID'] for i in df2[['Item_ID','Name']].to_dict('records') if any(q in i['Name'].lower() for q in query.lower().split())]
if matches:
return ','.join(map(str, matches))
else:
return 'NA'
df1['Item_ID'] = df1['Name'].apply(matcher)
Returns:
Id Name Item_ID
0 1 Paper 1,2,3
1 2 Paper Bag 1,2,3
2 3 Scissors NA
3 4 Mat NA
4 5 Cat 4,5
5 6 Good Cat 4,5
Explanation:
We are using apply() to apply our custom matcher() function to each row value of your df1['Name'] column. In our matcher() function, we are converting df2 into a dictionary with the Item_ID as the keys and the Name as the values. We then can check if our current row value query is present in any() of the Name values from df1 (converted to lowercase via lower()), and if so, then we can add the Item_ID to a list to be returned.

Matching partial expressions between dataframes

I am attempting to perform a partial string match between columns in data frames for example:
df_A:
Items_A
purse
string
hat
glue
gum
cherry
cherry
cherry pie
and
df_B:
1 2 3
string gum cherry
glue
desired output:
df_matched:
matched Items_A
0 purse
1 string
0 hat
1 glue
2 gum
3 cherry
3 cherry
3 cherry pie
Note that numbers in the matched columns are the labels from the column that is matched, either 1, 2, or 3. If there is no match, then the label is 0.
I was able to use Regular expression matching with several nested loops but was wondering if there was a way to use the panda's libraries to perform the operation more efficiently.
Reshape df_B to get this :
level_0 level_1 0
0 0 1 string
1 0 2 gum
2 0 3 cherry
3 1 1 glue
rename df_B columns
get the list of unique words in df_B
create a new column in df_B to find the matching word from df_B in
df_A
Merge and filter
import regex
df_B = df_B.stack().reset_index()
df_B = df_B.rename(columns={"level_1": "matched", 0: "Items_A"})
items = df_B.Items_A.unique()
def partial_match(x, items):
for item in items:
if regex.search(r'.?'+item+'.?', x):
return item
return 0
df_A["matching_item"] = df_A["Items_A"].apply(lambda x: partial_match(x, items))
df_A = df_A.merge(df_B, how="left", left_on="matching_item", right_on="Items_A", suffixes=('', '_y'))
df_A = df_A.loc[:,["Items_A", "matched"]]

Split a pandas dataframe into rows based on an integer column

Not a an ideal title but I wouldn't know how to describe it better.
I have a dataframe (df1) and want to split it on the column "chicken" so that:
each chicken that laid an egg becomes a distinct row
the chickens that didn't lay an egg are aggregated in a unique row.
The output I need is df2, example:
In farm "A", there are 5 chicken, of which 2 chicken laid an egg, so there are 2 rows with egg = "True" and weight = 1 each, and 1 row with egg = "False" and weight = 3 (the 3 chicken that didn't lay an egg).
The code I came up with is messy, can you guys think of a cleaner way of doing it? Thanks!!
#code to create df1:
df1 = pd.DataFrame({'farm':["A","B","C"],"chicken":[5,10,5],"eggs":[2,3,0]})
df1=df1[["farm","chicken","eggs"]]
#code to transform df1 to df2:
df2 = pd.DataFrame()
for i in df1.index:
number_of_trues = df1.iloc[i]["eggs"]
number_of_falses = df1.iloc[i]["chicken"] - number_of_trues
col_farm = [df1.iloc[i]["farm"]]*(number_of_trues+1)
col_egg = ["True"]*number_of_trues + ["False"]*1
col_weight = [1]*number_of_trues + [number_of_falses]
mini_df = pd.DataFrame({"farm":col_farm,"egg":col_egg,"weight":col_weight})
df2=df2.append(mini_df)
df2 = df2[["farm","egg","weight"]]
df2
This is customize solution , by creating two different sub dataframe then concat it back to achieve the expected output.Key method : repeat
s=pd.DataFrame({'farm':df1.farm.repeat(df1.eggs),'egg':[True]*df1.eggs.sum(),'weight':[1]*df1.eggs.sum()})
t=pd.DataFrame({'farm':df1.farm,'egg':[False]*len(df1.farm),'weight':df1.chicken-df1.eggs})
pd.concat([t,s]).sort_values(['farm','egg'],ascending=[True,False])
Out[847]:
egg farm weight
0 True A 1
0 True A 1
0 False A 3
1 True B 1
1 True B 1
1 True B 1
1 False B 7
2 False C 5

Categories

Resources