pandas join with condition using LIKE operator - python

I have 2 dataframes:
users
user_id position
0 201 Senior Engineer
1 207 Senior System Architect
2 223 Senior account manage
3 212 Junior Manager
4 112 junior Engineer
5 311 junior python developer
df1 = pd.DataFrame({'user_id': ['201', '207', '223', '212', '112', '311'],
'position': ['Senior Engineer', 'Senior System Architect', 'Senior account manage', 'Junior Manager', 'junior Engineer', 'junior python developer']})
roles
role_id role_position
0 10 %senior%
1 20 %junior%
df2 = pd.DataFrame({'role_id': ['10', '20'],
'role_position': ['%senior%', '%junior%']})
I want to join them to get role_id for each row in df1 using condition something like this:
lower(df1.position) LIKE df2.role_position
I want to use operator LIKE (like in SQL).
So it would look like this (or without role_position - it would be even better):
user_id position role_id role_position
0 201 Senior Engineer 10 %senior%
1 207 Senior System Architect 10 %senior%
2 223 Senior account manage 10 %senior%
3 212 Junior Manager 20 %junior%
4 112 junior Engineer 20 %junior%
5 311 junior python developer 20 %junior%
How can i make this?
Thank you for your help!

You can use str.extract()+merge():
pat='('+'|'.join(df2['role_position'].str.strip('%').unique())+')'
df1['role_position']='%'+df1['position'].str.lower().str.extract(pat,expand=False)+'%'
df1=df1.merge(df2,on='role_position',how='left')
output of df1:
user_id position role_id role_position
0 201 Senior Engineer 10 %senior%
1 207 Senior System Architect 10 %senior%
2 223 Senior account manage 10 %senior%
3 212 Junior Manager 20 %junior%
4 112 junior Engineer 20 %junior%
5 311 junior python developer 20 %junior%

You can save some trouble by doing a merge directly if the seniority level always start at front:
print (pd.merge(df, df2,
left_on=df["position"].str.split().str[0].str.lower(),
right_on=df2["role_position"].str.strip("%")).drop("key_0", axis=1))
Else you can do a pd.Series.str.extract during a merge:
pat = f'({"|".join(df2["role_position"].str.strip("%"))})'
print (pd.merge(df, df2,
left_on=df["position"].str.extract(pat, flags=re.IGNORECASE, expand=False).str.lower(),
right_on=df2["role_position"].str.strip("%")).drop("key_0", axis=1))
Both yields the same result:
user_id position role_id role_position
0 201 Senior Engineer 10 %senior%
1 207 Senior System Architect 10 %senior%
2 223 Senior account manage 10 %senior%
3 212 Junior Manager 20 %junior%
4 112 junior Engineer 20 %junior%
5 311 junior python developer 20 %junior%

Possibilities:
fuzzy words
Sequence Matcher
.extract
df1['Similarity'] = 0
df1['Role'] = 0
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
for index, row in df1.iterrows():
for x in df2['role_position']:
z = similar(row['position'],x)
if z >= 0.20:
df1.loc[index, "Similarity"] = z
df1.loc[index, "Role"] = x

You can generate a dict of mappings and then map the values:
df2['role_position'] = df2['role_position'].str.strip('%')
mappings = df2.set_index('role_position').to_dict('dict')['role_id']
>> mappings
{'senior': '10', 'junior': '20'}
Using a regular expression we can extract the roles for each position:
re_roles = f"({df2['role_position'].str.cat(sep='|')})"
position = df1['position'].str.extract(re_roles, flags=re.I).iloc[:, 0].str.lower()
>> position
0 senior
1 senior
2 senior
3 junior
4 junior
5 junior
Name: 0, dtype: object
And finally map the role_id using the mappings dictionary:
df1['role_id'] = position.map(mappings)
>> df1
user_id position role_id
0 201 Senior Engineer 10
1 207 Senior System Architect 10
2 223 Senior account manage 10
3 212 Junior Manager 20
4 112 junior Engineer 20
5 311 junior python developer 20

Related

Defining Parent For a Dataset with Several Conditions in Pandas

I have a CSV file with more than 10,000,000 rows of data with below structures:
I have an ID as my uniqueID per group:
Data Format
ID Type Name
1 Head abc-001
1 Senior abc-002
1 Junior abc-003
1 Junior abc-004
2 Head abc-005
2 Senior abc-006
2 Junior abc-007
3 Head abc-008
3 Junior abc-009
...
For defining parent relationship below conditions exist:
Each group MUST has 1 Head.
It is OPTIONAL to have ONLY 1 Senior in each group.
Each group MUST have AT LEAST one Junior.
EXPECTED RESULT
ID Type Name Parent
1 Senior abc-002 abc-001
1 Junior abc-003 abc-002
1 Junior abc-004 abc-002
2 Senior abc-006 abc-005
2 Junior abc-007 abc-006
3 Junior abc-009 abc-008
Below code works when I have one Junior, I want to know if there is any way to define parent for more than one juniors:
order = ['Head', 'Senior', 'Junior']
key = pd.Series({x: i for i,x in enumerate(order)})
df2 = df.sort_values(by='Type', key=key.get)
df4=df.join(df2.groupby('IP')['Type'].shift().dropna().rename('Parent'),how='right')
print(df4)
You could pivot the Type and Name columns then forword fill within ID group. Then take the right-hand two non-NaN entries to get the Parent and Name.
Pivot and forward-fill:
dfn = pd.concat([df[['ID','Type']], df.pivot(columns='Type', values='Name')], axis=1) \
.groupby('ID').apply(lambda x: x.ffill())[['ID','Type','Head','Senior','Junior']]
print(dfn)
ID Type Head Senior Junior
0 1 Head abc-001 NaN NaN
1 1 Senior abc-001 abc-002 NaN
2 1 Junior abc-001 abc-002 abc-003
3 1 Junior abc-001 abc-002 abc-004
4 2 Head abc-005 NaN NaN
5 2 Senior abc-005 abc-006 NaN
6 2 Junior abc-005 abc-006 abc-007
7 3 Head abc-008 NaN NaN
8 3 Junior abc-008 NaN abc-009
A function to pull the last two non-NaN entries:
def get_np(x):
rc = [np.nan,np.nan]
if x.isna().sum() != 2:
if x.isna().sum() == 0:
rc = [x['Junior'],x['Senior']]
elif pd.isna(x['Junior']):
rc = [x['Senior'],x['Head']]
else:
rc = [x['Junior'],x['Head']]
return pd.concat([x[['ID','Type']], pd.Series(rc, index=['Name','Parent'])])
Apply it and drop the non-applicable rows:
dfn.apply(get_np, axis=1).dropna()
ID Type Name Parent
1 1 Senior abc-002 abc-001
2 1 Junior abc-003 abc-002
3 1 Junior abc-004 abc-002
5 2 Senior abc-006 abc-005
6 2 Junior abc-007 abc-006
8 3 Junior abc-009 abc-008

Convert row data into column data in pandas

I have data that looks like this:
Field Value
0 CRD 146099
1 LegalName CHUNG, BUCK CHWEE
2 BusName PRINCIPA FINANCIAL ADVISORS
3 URL https://adviserinfo.sec.gov/IAPD/content/ViewF...
4 CRD 170701
5 LegalName MESSINA AND ASSOCIATES, INC
6 BusName FINANCIAL RESOURCES GROUP
7 URL https://adviserinfo.sec.gov/IAPD/content/ViewF...
8 CRD 133630
9 LegalName ALAN EDELMAN
10 BusName EDELMAN, ALAN
11 URL https://adviserinfo.sec.gov/IAPD/content/ViewF...
12 CRD 131792
13 LegalName RESOURCE MANAGEMENT LLC
14 BusName RESOURCE MANAGEMENT LLC
15 URL https://adviserinfo.sec.gov/IAPD/content/ViewF...
How can I convert it such that CRD, LegalName, BusName, URL are the columns. I tried using pd.melt but it doesn't seem to be what I'm looking for.
Use split for 2 columns first, then create counter Series by cumcount, create MultiIndex by set_index and reshape by unstack:
df[['Field','Value']] = df['Value'].str.split(n=1, expand=True)
groups = df.groupby('Field').cumcount()
df = df.set_index([groups, 'Field'])['Value'].unstack()
print (df)
Field BusName CRD LegalName \
0 PRINCIPA FINANCIAL ADVISORS 146099 CHUNG, BUCK CHWEE
1 FINANCIAL RESOURCES GROUP 170701 MESSINA AND ASSOCIATES, INC
2 EDELMAN, ALAN 133630 ALAN EDELMAN
3 RESOURCE MANAGEMENT LLC 131792 RESOURCE MANAGEMENT LLC
Field URL
0 https://adviserinfo.sec.gov/IAPD/content/ViewF...
1 https://adviserinfo.sec.gov/IAPD/content/ViewF...
2 https://adviserinfo.sec.gov/IAPD/content/ViewF...
3 https://adviserinfo.sec.gov/IAPD/content/ViewF...
I think you're looking for DataFrame.transpose

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.
You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

pandas sort values to get top 5 for each column in a groupby

I have a dataframe with city, name and members. I need to find the top 5 groups (name) in terms of highest member ('members') count per city.
This is what I get when I use:
clust.groupby(['city','name']).agg({'members':sum})
members
city name
Bath AWS Bath User Group 346
Agile Bath & Bristol 957
Bath Crypto Chat 47
Bath JS 142
Bath Machine Learning Meetup 435
Belfast 4th Industrial Revolution Challenge 609
Belfast Adobe Meetup 66
Belfast Azure Meetup 205
Southampton Crypto Currency Trading SouthCoast 50
Southampton Bitcoin and Altcoin Meetup 50
Southampton Functional Programming Meetup 28
Southampton Virtual Reality Meetup 248
Sunderland Sunderland Digital 287
I need the top 5 but as you can see the member count doesn't seem to be ordered, i.e. 346 before 957 etc.
I've also tried sorting the values before-hand and do:
clust.sort_values(['city', 'name'], axis=0).groupby('city').head(5)
But that returns a similar series.
I've used this one too clust.groupby(['city', 'name']).head(5)
but it gives me all the rows and not top 5. It also isn't structured so not in alphabetical order.
Please help. Thanks
I think need add ascending=[True, False] to sort_values and change column to members for sorting:
clust = clust.groupby(['city','name'], as_index=False)['members'].sum()
df = clust.sort_values(['city', 'members'], ascending=[True, False]).groupby('city').head(5)
print (df)
city name members
1 Bath Agile Bath & Bristol 957
4 Bath Machine Learning Meetup 435
0 Bath AWS Bath User Group 346
3 Bath JS 142
2 Bath Crypto Chat 47
5 Belfast 4th Industrial Revolution Challenge 609
7 Belfast Azure Meetup 205
6 Belfast Adobe Meetup 66
11 Southampton Virtual Reality Meetup 248
8 Southampton Crypto Currency Trading SouthCoast 50
9 Southampton Bitcoin and Altcoin Meetup 50
10 Southampton Functional Programming Meetup 28
12 Sunderland Sunderland Digital 287

Selecting data based on number of occurences using Python / Pandas

My dataset is based on the results of Food Inspections in the City of Chicago.
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
Here is how often each of the facilities appear in the dataset:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
I am trying to create a new set of data containing those rows where its Facility Type has over 50 occurrences in the dataset. How would I approach this?
Please note the list of facility counts is MUCH LARGER as I have cut out most of the information as it did not contribute to the question at hand (so simply removing occurrences of "Special Event", " Shared Kitchen User", and "Daycare" is not what I'm looking for).
IIUC then you want to filter:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
Example:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
Not tested, but should work.
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]

Categories

Resources