I have a df with two columns (company_name and sales).
The company_name column includes the name of the company plus a short description (e.g. company X - medical insurance; company Y - travel and medical insurance; company Z - medical and holiday insurance etc.)
I want to add a third column with a binary classification (medical_insurance or travel_insurance) based on the first matching string value included in the company_name.
I have tried using str.contains but when matching words from different groups are present in the company_name column (e.g., medical and travel), str.contains doesn't necessarily classify it matching the first instance (which is what I need).
medical_focused = df.loc[df['company_name'].str.contains(
'medical|hospital', flags=re.IGNORECASE, na=False),'classification'] = 'medical_focused'
travel_focused = df.loc[df['company_name'].str.contains(
'travel|holiday', flags=re.IGNORECASE, na=False),'classification'] = 'travel_focused'
How can I force str.contains to stop at the first instance?
Thanks!
Related
I need to analyze a dataset with enteprises of more than 80 industries regarding the respective industries. Specifically, I need a for loop or a def function with which I can summarize the following step for all industries to get a nice short code:
HighTech = data.loc[data['MacrIndustry'] == "High Technology", ['Value']]
Preferably, I would like to separate the enteprises regarding their industries into a separate DataFrame with its value.
Use DataFrame.groupby. The following will get you a dictionary whose keys are all the MacrIndustry unique values, and the values are the Value column (as a DataFrame) of the corresponding industry group.
groups = {industry: df[['Value']] for industry, df in data.groupby('MacrIndustry')}
# or just (less readable)
# groups = dict(iter(data.groupby('MacrIndustry')[['Value']]))
According to your example HighTech = groups['High Technology'].
I have a dataset that shows the share of every disease from total diseases.
I want to find the country name which in that country AIDS is bigger than other diseases.
Try with
df.index[df.AIDS.eq(df.drop('Total',1).max(1))]
Have a look at pandas max function (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.max.html)
Here you could get the max for each row, as is:
most_frequent_disease = df.drop(columns=['Total', 'Other']).max(axis=1)
Then you can create a condition to check wether AIDS is the most frequent disease, and apply it to your dataframe:
is_aids_most_frequent_disease = df.loc[:, 'A'].eq(most_frequent_disease)
df[is_aids_most_frequent_disease]
You could get the country name by using the .index at the end of the expression too.
I have df with many columns and each column have repeated values because its survey data. As an example my data look like this:
df:
Q36r9: sales platforms - Before purchasing a new car Q36r32: Advertising letters - Before purchasing a new car
Not Selected Selected
So i want to strip the text from column names. For example from first column I want to get the text between ":" and "-". So it should be like this: "sales platform" and in second part i want to convert vales of column, "selected" should be changed with the name of column and "Not Selected" as NaN
so desired output would be like this:
sales platforms Advertising letters
NaN Advertising letters
Edited: Another Problem if i have column name like:
Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
If i just want to get something in between ":" and "-". It should extract "WeChat"
IIUC,
we can take advantage of some regex and greed matching using .* which matches everything between a defined pattern
import re
df.columns = [re.search(':(.*)-',i).group(1) for i in df.columns.str.strip()]
print(df.columns)
sales platforms Advertising letters
0 Not Selected None
Edit:
with greedy matching we can use +?
+? Quantifier — Matches between one and unlimited times, as few times as possible, expanding as needed (lazy)
Q36r9: sales platforms - Before purchasing a new car Q40r1c3: WeChat - Looking for a new car - And now if you think again - Which social media platforms or sources would you use in each situation?
0 1
import re
[re.search(':(.+?)-',i).group(1).strip() for i in df.columns]
['sales platforms', 'WeChat']
Pandas dataframe "df1" has a column ("Receiver") with string values.
df1
Receiver
44 BANK
106 restaurant
149 Tax office
63 house
55 car insurance
I want to go through each row of that column, check if they match with values (mostly one- or two-word search terms) in another dataframe ("df2") and return the matching column's title on the correct rows. I'm trying to do it with the following function:
df1.Receiver.apply(lambda x:
''.join([i for i in df2.columns
if df2.loc[:,i].str.contains(x).any()])
)
Problem:
However, this only works for values in df1's "Receiver" column that consist of just one word (so "BANK", "restaurant" and "house" work in this case).
Values with two or more words do not work ("Tax office" and "car insurance" in this case).
Isn't str.contains() supposed to find also partial matches? How can I find partial matches also for values in the "Receiver" column that have two or more words?
edit: here's how df2 looks like, it has different categories as column titles, and then each column has the search terms as values
df2
Banks Restaurants Car House
0 BANK restaurant car house
1 bank mcdonalds
2 Subway
Here is the whole problem in a single image, the output can be seen on the right, and categories "Car" and "Tax office" are not found because the receivers "car insurance" and "Tax office" (receiver column in df1) are only partial matches with the search terms "car" and "Tax" (values in df2's columns "Car" and "Tax office".
Instead of iterating your dataframe rows, you can iterate columns of df2 and use regex with pd.Series.str.contains:
df1 = pd.DataFrame({'Receiver': ['BANK', 'restaurant house', 'Tax office', 'mcdonalds car']})
df1['Receiver_new'] = ''
for col in df2:
values = '|'.join(df2[col].dropna())
bool_series = df1['Receiver'].str.contains(values)
df1.loc[bool_series, 'Receiver_new'] += f'{col}|'
print(df1)
# Receiver Receiver_new
# 0 BANK Banks|
# 1 restaurant house Restaurants|House|
# 2 Tax office
# 3 mcdonalds car Restaurants|Car|
I have 3 dataset of customers with 7 columns.
CustomerName
Address
Phone
StoreName
Mobile
Longitude
Latitude
every dataset has 13000-18000 record. I am trying to fuzzy match for deduplication between them. my data set columns don't have same weight in this matching. How i can handle it????
Do you know good library for my case?
I think Recordlinkage library would suit your purposes
you can use to the Compare object , requiring various kinds of matches:
compare_cl.exact('CustomerName', 'CustomerName', label='CustomerName')
compare_cl.string('StoreName', 'StoreName', method='jarowinkler', threshold=0.85, label='surname')
compare_cl.string('Address', 'Address', threshold=0.85, label='Address')
then defining the match you can customize how you want results, ie if you want 2 features to be matched at least
features = compare_cl.compute(pairs, df)
matches = features[features.sum(axis=1) > 3]