Matching partial expressions between dataframes - python

I am attempting to perform a partial string match between columns in data frames for example:
df_A:
Items_A
purse
string
hat
glue
gum
cherry
cherry
cherry pie
and
df_B:
1 2 3
string gum cherry
glue
desired output:
df_matched:
matched Items_A
0 purse
1 string
0 hat
1 glue
2 gum
3 cherry
3 cherry
3 cherry pie
Note that numbers in the matched columns are the labels from the column that is matched, either 1, 2, or 3. If there is no match, then the label is 0.
I was able to use Regular expression matching with several nested loops but was wondering if there was a way to use the panda's libraries to perform the operation more efficiently.

Reshape df_B to get this :
level_0 level_1 0
0 0 1 string
1 0 2 gum
2 0 3 cherry
3 1 1 glue
rename df_B columns
get the list of unique words in df_B
create a new column in df_B to find the matching word from df_B in
df_A
Merge and filter
import regex
df_B = df_B.stack().reset_index()
df_B = df_B.rename(columns={"level_1": "matched", 0: "Items_A"})
items = df_B.Items_A.unique()
def partial_match(x, items):
for item in items:
if regex.search(r'.?'+item+'.?', x):
return item
return 0
df_A["matching_item"] = df_A["Items_A"].apply(lambda x: partial_match(x, items))
df_A = df_A.merge(df_B, how="left", left_on="matching_item", right_on="Items_A", suffixes=('', '_y'))
df_A = df_A.loc[:,["Items_A", "matched"]]

Related

Get pandas column where two column values are equal

I want to subset a DataFrame by two columns in different dataframes if the values in the columns are the same. Here is an example of df1 and df2:
df1
A
0 apple
1 pear
2 orange
3 apple
df2
B
0 apple
1 orange
2 orange
3 pear
I would like the output to be a subsetted df1 based upon the df2 column:
A
0 apple
2 orange
I tried
df1 = df1[df1.A == df2.B] but get the following error:
ValueError: Can only compare identically-labeled Series objects
I do not want to rename the column in either.
What is the best way to do this? Thanks
If need compare index values with both columns create Multiindex and use Index.isin:
df = df1[df1.set_index('A', append=True).index.isin(df2.set_index('B', append=True).index)]
print (df)
A
0 apple
2 orange

Left joining multiple datasets with same column headers

not sure if this question is answered ,Please help me to solve this .
I have tried my max to explain this .
Please refer the images to understand my query .
I want my below query solved in Python .
The query is :
I need to left merge a dataframe with 3 other dataframes .
But the tricky part is all the dataframes are having same column headers , and I want the same column to overlap the preceeding column in my output dataframe .
But while I use left merge in python , the column headers of all the dataframes are printed along with sufix "_x" and "_y".
The below are my 4 dataframes:
df1 = pd.DataFrame({"Fruits":['apple','banana','mango','strawberry'],
"Price":[100,50,60,70],
"Count":[1,2,3,4],
"shop_id":['A','A','A','A']})
df2 = pd.DataFrame({"Fruits":['apple','banana','mango','chicku'],
"Price":[10,509,609,1],
"Count":[8,9,10,11],
"shop_id":['B','B','B','B']})
df3 = pd.DataFrame({"Fruits":['apple','banana','chicku'],
"Price":[1000,5090,10],
"Count":[5,6,7],
"shop_id":['C','C','C']})
df4 = pd.DataFrame({"Fruits":['apple','strawberry','mango','chicku'],
"Price":[50,51,52,53],
"Count":[11,12,13,14],
"shop_id":['D','D','D','D']})
Now I want to left join df1 , with df2 , df3 and df4.
from functools import reduce
data_frames = [df1, df2,df3,df4]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['Fruits'],
how='left'), data_frames)
But this produces an output as below :
The same columns are printed in the o/p dataset with suffix _x and _y
I want only a single Price , shop_id and count column like below:
It looks like what you want is combine_first, not merge:
from functools import reduce
data_frames = [df1, df2,df3,df4]
df_merged = reduce(lambda left,right: right.set_index('Fruits').combine_first(left.set_index('Fruits')).reset_index(),
data_frames)
output:
Fruits Price Count shop_id
0 apple 50 11 D
1 banana 5090 6 C
2 chicku 53 14 D
3 mango 52 13 D
4 strawberry 51 12 D
To filter the output to get only the keys from df1:
df_merged.set_index('Fruits').loc[df1['Fruits']].reset_index()
output:
Fruits Price Count shop_id
0 apple 50 11 D
1 banana 5090 6 C
2 mango 52 13 D
3 strawberry 51 12 D
NB. everything would actually be easier if you set Fruits as index

How to compare two DataFrames and return a matrix containing values where columns matched

I have two data frames as follows:
df1
id start end label item
0 1 0 3 food burger
1 1 4 6 drink cola
2 2 0 3 food fries
df2
id start end label item
0 1 0 3 food burger
1 1 4 6 food cola
2 2 0 3 drink fries
I would like to compare the two data frames (by checking where they match in the id, start, end columns) and create a matrix of size 2 x (number of items) for each id. The cells should contain the label corresponding to an item. In this example:
M_id1: [[food, drink], M_id2: [[food],
[food, food]] [drink]]
I tried looking at the pandas documentation but didn't really find anything that could help me.
You can merge the dataframe df1 and df2 on columns id, start, end then group the merged dataframe on id and for each group per id create key-value pairs inside dict comprehension where key is the id and value is the corresponding matrix of labels:
m = df1.merge(df2, on=['id', 'start', 'end'])
dct = {f'M_id{k}': g.filter(like='label').to_numpy().T for k, g in m.groupby('id')}
To access the matrix of labels use dictionary lookup:
>>> dct['M_id1']
array([['food', 'drink'], ['food', 'food']], dtype=object)
>>> dct['M_id2']
array([['food'], ['drink']], dtype=object)

Faster way to identify and compare rows based on matching conditions within a dataframe having millions of rows

I have a dataframe as below.
Date Fruit level_0 Num Color
0 2013-11-25 Apple DF2 22.1 Red
1 2013-11-24 Banana DF1 22.1 Yellow
2 2013-11-24 Banana DF2 122.1 Yellow
3 2013-11-23 Celery DF1 10.2 Green
4 2013-11-24 Orange DF1 8.6 Orange
5 2013-11-24 Orange DF2 8.6 Orange1
6 2013-11-25 Orange DF1 8.6 Orange
I need to find and compare the rows within the dataframe and see which columns have data mismatch. The rows that are selected for comparison should be only those which have the same "Date" and "Fruit" values but different "level_0" values. So in the dataframe i need to compare rows having index 1 and 2 since they have same value for "Date" & "Fruit", but different "level_0" values. When comparing these since they differ in the "Num" column, we need to suffix a label(say "NM" ) beside the value in both rows. Rows which have only one occurrence of "Date" & "Fruit" combination will need to have a label (say "Miss") suffixed to the value in "Fruit" column.
Example of expected output below:
1.)Is it possible to get such an output?
2.)Is there a fast way get it, as my actual dataset contains millions of rows and 20-25 columns?
This is pretty complex, since there are lot different filters you want to do. If I get you right, you want
for rows that have the same "Date" and "Fruit" values, and
of those rows, those that have different "level_0" values, and
of those rows, those that have different "Num" values to get -NM. From your example you want to do the same with the "Color"-column.
Rows that are the only occurence of a "Date" and "Fruit" value get -Miss.
First, you'll need to make Num a string column, since we are adding suffixes. Then we groupby Date and Fruit (1). Then, since you wanted the groups to have different level_0 values, we make filter on that called diff_frames (2). Then we add the suffixes using transform on both columns if they have two unique elements (3).
df['Num'] = df['Num'].astype(str)
g = df.groupby(['Date', 'Fruit'])
diff_frames = g['level_0'].transform(lambda s: s.nunique() == 2)
df[['Num', 'Color']] = df[diff_frames].groupby(['Date', 'Fruit'])[['Num', 'Color']].transform(
lambda s: s+'-NM' if s.nunique() == 2 else s)
Then, for the second part, we get the non-duplicated rows in Date and Fruit, and add -Miss to the Fruit column. (4)
df.loc[~df.duplicated(subset=['Date', 'Fruit'], keep=False), 'Fruit'] += '-Miss'
print(df)
Date Fruit level_0 Num Color
0 0 Apple-Miss DF2 22.1 Red
1 1 Banana DF1 22.1-NM Yellow
2 1 Banana DF2 122.1-NM Yellow
3 2 Celery-Miss DF1 10.2 Green
4 3 Orange DF1 8.6 Orange-NM
5 3 Orange DF2 8.6 Orange1-NM
6 4 Orange-Miss DF2 8.6 Orange

Match Strings of 2 dataframe columns in Python

I have two data frame:
Df1:
Original df has 1000+ Name
Id Name
1 Paper
2 Paper Bag
3 Scissors
4 Mat
5 Cat
6 Good Cat
2nd Df:
Original df has 1000+ Item_Name
Item_ID Item_Name
1 Paper Bag
2 wallpaper
3 paper
4 cat cage
5 good cat
Expected Output:
Id Name Item_ID
1 Paper 1,2,3
2 Paper Bag 1,2,3
3 Scissors NA
4 Mat NA
5 Cat 4,5
6 Good Cat 4,5
My Code:
def matcher(x):
res = df2.loc[df2['Item_Name'].str.contains(x, regex=False, case=False), 'Item_ID']
return ','.join(res.astype(str))
df1['Item_ID'] = df1['Name'].apply(matcher)
Current Challenges
str.contains work when name has Paper and Item_Name has Paper Bag but it doesn't work other way around. So, it my example it work for row 1,3,4,5 for df1 but not for row 2 & 6. So, it will not map row 2 of df1 with row 3 of df2
Ask
So, if you can help me in modifying the code so that it can help in matching otherway round also
You can modify your custom matcher function and use apply():
def matcher(query):
matches = [i['Item_ID'] for i in df2[['Item_ID','Name']].to_dict('records') if any(q in i['Name'].lower() for q in query.lower().split())]
if matches:
return ','.join(map(str, matches))
else:
return 'NA'
df1['Item_ID'] = df1['Name'].apply(matcher)
Returns:
Id Name Item_ID
0 1 Paper 1,2,3
1 2 Paper Bag 1,2,3
2 3 Scissors NA
3 4 Mat NA
4 5 Cat 4,5
5 6 Good Cat 4,5
Explanation:
We are using apply() to apply our custom matcher() function to each row value of your df1['Name'] column. In our matcher() function, we are converting df2 into a dictionary with the Item_ID as the keys and the Name as the values. We then can check if our current row value query is present in any() of the Name values from df1 (converted to lowercase via lower()), and if so, then we can add the Item_ID to a list to be returned.

Categories

Resources