Why does .str.contains() not find partial matches here? (Pandas dataframe)

Why does .str.contains() not find partial matches here? (Pandas dataframe) - python

Pandas dataframe "df1" has a column ("Receiver") with string values.
df1
Receiver
44 BANK
106 restaurant
149 Tax office
63 house
55 car insurance
I want to go through each row of that column, check if they match with values (mostly one- or two-word search terms) in another dataframe ("df2") and return the matching column's title on the correct rows. I'm trying to do it with the following function:
df1.Receiver.apply(lambda x:
''.join([i for i in df2.columns
if df2.loc[:,i].str.contains(x).any()])
)
Problem:
However, this only works for values in df1's "Receiver" column that consist of just one word (so "BANK", "restaurant" and "house" work in this case).
Values with two or more words do not work ("Tax office" and "car insurance" in this case).
Isn't str.contains() supposed to find also partial matches? How can I find partial matches also for values in the "Receiver" column that have two or more words?
edit: here's how df2 looks like, it has different categories as column titles, and then each column has the search terms as values
df2
Banks Restaurants Car House
0 BANK restaurant car house
1 bank mcdonalds
2 Subway
Here is the whole problem in a single image, the output can be seen on the right, and categories "Car" and "Tax office" are not found because the receivers "car insurance" and "Tax office" (receiver column in df1) are only partial matches with the search terms "car" and "Tax" (values in df2's columns "Car" and "Tax office".

Instead of iterating your dataframe rows, you can iterate columns of df2 and use regex with pd.Series.str.contains:
df1 = pd.DataFrame({'Receiver': ['BANK', 'restaurant house', 'Tax office', 'mcdonalds car']})
df1['Receiver_new'] = ''
for col in df2:
values = '|'.join(df2[col].dropna())
bool_series = df1['Receiver'].str.contains(values)
df1.loc[bool_series, 'Receiver_new'] += f'{col}|'
print(df1)
# Receiver Receiver_new
# 0 BANK Banks|
# 1 restaurant house Restaurants|House|
# 2 Tax office
# 3 mcdonalds car Restaurants|Car|

Related

How do I coalesce Pandas columns only where the beginnings of the columns don't match?

I have a table with some company information that we're trying to clean up. In the first column is a clean company name, but not necessarily the correct one. In the second column, there is the correct company name, but often not very clean / missing. Here is an example.
Name
Info
Nike
Nike, a footwear manufacturer is headquartered in Oregon.
ASG Shoes
Reebok
Adidas
None
We're working with this dataset in Pandas. We'd like to follow the rules below.
If the Name column is equal to the left side of the Info column, keep the name column. We would like this to be dynamic with the length of column 1. For "Nike", it should check the first 4 letters of the Info column, for "ASG Shoes", it should check the first 9 characters.
If rule 1 is false, use the Info column.
If Info is None, use the Name column.
The output we seek is a 3rd column that is the output of these rules. I am hoping someone can help me with writing this code in an efficient manner. There's a lot going on here and I want to ensure I'm doing this properly. How can I achieve this output with the most efficient Python code possible?
Name
Info
Clean
Nike
Nike, a footwear manufacturer is headquartered in Oregon.
Nike
ASG Shoes
Reebok
Reebok
Adidas
None
Adidas

You can start by creating another column that contains the length of your Name column. This is really straight-forward. Let us call the new column Slicers. What you can then do is to create a function that slices a string by a certain number and map this function to your columns Info and Slicers, where Info is the string column that should be sliced and Slicers defines the slicing number. (There may be even a pandas implementation for this, but I do not know one). After that, you can compare your sliced info with your Name variable and assign all matches to your Clean column. Then, just apply a pandas coalesce over your desired columns.
The code implementation is given below:
import pandas as pd
def slicer(strings, slicers):
return strings[:slicers] if isinstance(strings, str) else strings
df = pd.DataFrame({
"Name": ["Nike", "ASG Shoes", "Adidas"],
"Info": ["Nike, a footwear manufacturer is headquartered in Oregon.", "Reebok", None]
})
# Define length column
df["Slicers"] = df["Name"].str.len()
# Slice Info column by length column and overwrite
df["Slicers"] = list(map(slicer, df["Info"], df["Slicers"]))
# Check whether sliced str column and name column are equal
mask = df["Name"].eq(df["Slicers"])
# Overwrite if they are equal
df.loc[mask, "Clean"] = df.loc[mask, "Name"]
# Apply coalesce
coalesce_rules = ["Clean", "Info", "Name"]
df.drop(columns=["Slicers"]).assign(Clean=df[coalesce_rules].fillna(method="bfill", axis=1).iloc[:,0])
Output:
Name Info Clean
0 Nike Nike, a footwear manufacturer is headquartered... Nike
1 ASG Shoes Reebok Reebok
2 Adidas None Adidas
It only needs around five seconds for 3. Mio rows. Obviously, I do not know whether this is the most efficient way to solve your problem. But I think it's an efficient one.

Using pandas to sum columns based on a criteria

I am trying to use pandas to group sales information based on category and a criteria.
For example in "Table 1" below, I want sales totals for each category excluding those with a "Not Stated" in the Reg/Org column. My ideal output would be in "Table 2" below. My actual data set has 184 columns, and I am trying to capture the sales volume by category across any values excluding those that are "Not Stated".
Thank you for any help or direction that you can provide.
TABLE 1
Category
Reg/Org
Sales
Apple
Regular
10
Apple
Organic
5
Apple
Not Stated
5
Banana
Regular
15
Banana
Organic
5
TABLE 2
Category
Reg/Org
Apple
15
Banana
20
The first part was to summarize the values by column for the entire data set. I utilized the code below to gather that info for each of the 184 columns. Now I want to create a further summary where I create those column totals again, but split by the 89 categories I have. Ideally, I am trying to create a cross tab, where the categories are listed down the rows, and each of the 184 columns contains the sales. (e.g. the column "Reg/Org" would no longer show "Organic" or "Regular", it would just show the sales volume for all values that are not "Not Stated".)
att_list = att.columns.tolist()
ex_list = ['NOT STATED','NOT COLLECTED']
sales_list = []
for att_col in att_list:
sales_list.append(att[~att[att_col].isin(ex_list)]['$'].sum())

Try
df[df["Reg/Org"]!="Not Stated"].groupby("Category").sum()
Or
df.groupby("Category").sum().drop(index= ["Not Stated"])

try using "YourDataframe.loc[]" with a filter inside
import pandas as pd
data = pd.read_excel('Test_excel.xlsx')
sales_volum = data.loc[data["Reg/Org"] != "Not Stated"]
print(str(sales_volum))

filter rows in a pandas dataframe from substrings (keys) in a list and also add new column "key" to dataframe containing the substring matched (key)

Iam new to python. The below code filter rows in a dataframe df based on substrings (keys) in a list and add a new column say 'Key" containing the substring (all of them). The dataframe contains name of student, age, sport. The sport page contains all sports played by him. the list array contains two sports names. The code here extracts the names that play any of the sports mentions in the list. I want another field 'Key' in the dataframe that mentions the matches from list like "hockey" or " "football" or "jockey football" depending on the match.
'''
import requests
import pandas as pd
import numpy as np
data = {'Name': ['Tom', 'Joseph','Krish', 'Mohan', 'Ram'], 'Age': [20, 21, 19, 18, 29],'Sport':['football', 'hockey football badminton', 'cricket', 'tennis football', 'hocey cricet']}
df= pd.DataFrame(data)
print(df)
list = ['football','hockey'] # list of sports to filter
list_s = np.array(list)
print(list_s)
#Filter rows from df which are in list_s
dff = df[df['Sport'].str.contains('|'.join(list_s))]
print(dff)

While it might be possible in this case to just use substrings, a more robust approach would be to make a new DataFrame that maps each Name to each associated Sport, and select the desired Names from there.
>>> name_to_sport = (
df[['Name']]
.join(df['Sport'].str.split())
.explode('Sport')
)
>>> name_to_sport
Name Sport
0 Tom football
1 Joseph hockey
1 Joseph football
1 Joseph badminton
2 Krish cricket
3 Mohan tennis
3 Mohan football
4 Ram hocey
4 Ram cricet
>>> name_to_sport.loc[name_to_sport['Sport'].isin(['football','hockey']), 'Name'].unique()
array(['Tom', 'Joseph', 'Mohan'], dtype=object)

Vlookup based on multiple columns in Python DataFrame

I have two dataframes. I am trying to Vlookup 'Hobby' column from the 2nd dataframe and update the 'Interests' column of the 1st dataframe. Please note that the columns: Key, Employee and Industry should match exactly between the two dataframes but in case of City, even if the first part of the city matches between the two dataframe it should be acceptable. Though, it is starightforward in Excel, it looks a bit complicated to implement it on Python. Any cue on how to proceed will be really helpful. (Please see below the screenshot for the expected output.)
data1=[['AC32456','NYC-URBAN','JON','BANKING','SINGING'],['AD45678','WDC-RURAL','XING','FINANCE','DANCING'],
['DE43216', 'LONDON-URBAN','EDWARDS','IT','READING'],['RT45327','SINGAPORE-URBAN','WOLF','SPORTS','WALKING'],
['Rs454457','MUMBAI-RURAL','NEMBIAR','IT','ZUDO']]
data2=[['AH56245','NYC','MIKE','BANKING','BIKING'],['AD45678','WDC','XING','FINANCE','TREKKING'],
['DE43216', 'LONDON-URBAN','EDWARDS','FINANCE','SLEEPING'],['RT45327','SINGAPORE','WOLF','SPORTS','DANCING'],
['RS454457','MUMBAI','NEMBIAR','IT','ZUDO']]
List1=['Key','City','Employee', 'Industry', 'Interests']
List2=['Key','City','Employee', 'Industry', 'Hobby']
df1=pd.DataFrame(data1, columns=List1)
df2=pd.DataFrame(data2,columns=List2)

Set in index of df1 to Key (you can set the index to whatever you want to match on) and the use update:
# get the first part of the city
df1['City_key'] = df1['City'].str.split('-', expand=True)[0]
df2['City_key'] = df2['City'].str.split('-', expand=True)[0]
# set index
df1 = df1.set_index(['Key', 'Employee', 'Industry', 'City_key'])
# update
df1['Interests'].update(df2.set_index(['Key', 'Employee', 'Industry', 'City_key'])['Hobby'])
# reset index and drop the City_key column
new_df = df1.reset_index().drop(columns=['City_key'])
Key Employee Industry City Interests
0 AC32456 JON BANKING NYC-URBAN SINGING
1 AD45678 XING FINANCE WDC-RURAL TREKKING
2 DE43216 EDWARDS IT LONDON-URBAN READING
3 RT45327 WOLF SPORTS SINGAPORE-URBAN DANCING
4 Rs454457 NEMBIAR IT MUMBAI-RURAL ZUDO

Python and pandas, groupby only column in DataFrame

I would like to group some strings in the column called 'type' and insert them in a plotly bar, the problem is that from the new table created with groupby I can't extract the x and y to define them in the graph:
tipol1 = df.groupby(['tipology']).nunique()
tipol1
the outpot gives me tipology as index and the grouping based on how many times they repeat
number data
typology
one 2 113
two 33 33
three 12 88
four 44 888
five 11 66
in the number column (in which I have other values it gives me the correct grouping of the tipology column)
Also in the date column it gives me values (I think grouping the dates but not the dates in the correct format)
I also found:
tipol=df.groupby(['tipology']).nunique()
tipol2 = tipol[['number']]
tipol2
to take only the number column,
but nothing to do, I would need the tipology column (not in index) and the column with the tipology grouping numbers to get the x and y axis to import it into plotly!
One last try I made (making a big mess):
tipol=df.groupby(['tipology'],as_index=False).nunique()
tipol2 = tipol[['number']]
fig = go.Figure(data=[
go.Bar(name='test', x=df['tipology'], y=tipol2)
])
fig.update_layout(barmode='stack')
fig.show()
any suggestions
thanks!
UPDATE
I would have too much code to give an example, it would be difficult for me and it would waste your time too. basically I would need a groupby with the addition of a column that would show the grouping value eg:
tipology Date
home 10/01/18
home 11/01/18
garden 12/01/18
garden 12/01/18
garden 13/01/18
bathroom 13/01/18
bedroom 14/01/18
bedroom 15/01/18
kitchen 16/01/18
kitchen 16/01/18
kitchen 17/01/18
I wish this would happen:
by deleting the date column and inserting the value column in the DataFrame that does the count
tipology value
home 2
garden 3
bathroom 1
bedroom 2
kitchen 3
Then (I'm working with jupyer notebook)
leaving the date column and adding the corresponding values to the value column based on their grouping:
tipology Date value
home 10/01/18 1
home 11/01/18 1
garden 12/01/18 2
garden 12/01/18_____.
garden 13/01/18 1
bathroom 13/01/18 1
bedroom 14/01/18 1
bedroom 15/01/18 1
kitchen 16/01/18 2
kitchen 16/01/18_____.
kitchen 17/01/18 1
I would need the columns to assign them to the x and y axes to import them to a graph! so none of the columns should be index

By default the method groupby will return a dataframe where the fields you are grouping on will be in the index of the dataframe. You can adjust this behaviour by setting as_index=False in the group by. Then tipology will still be a column in the dataframe that is returned:
tipol1 = df.groupby('tipology', as_index=False).nunique()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does .str.contains() not find partial matches here? (Pandas dataframe) - python

Related

How do I coalesce Pandas columns only where the beginnings of the columns don't match?

Using pandas to sum columns based on a criteria

filter rows in a pandas dataframe from substrings (keys) in a list and also add new column "key" to dataframe containing the substring matched (key)

Vlookup based on multiple columns in Python DataFrame

Python and pandas, groupby only column in DataFrame

Categories

Resources