I have a CSV file with more than 10,000,000 rows of data with below structures:
I have an ID as my uniqueID per group:
Data Format
ID Type Name
1 Head abc-001
1 Senior abc-002
1 Junior abc-003
1 Junior abc-004
2 Head abc-005
2 Senior abc-006
2 Junior abc-007
3 Head abc-008
3 Junior abc-009
...
For defining parent relationship below conditions exist:
Each group MUST has 1 Head.
It is OPTIONAL to have ONLY 1 Senior in each group.
Each group MUST have AT LEAST one Junior.
EXPECTED RESULT
ID Type Name Parent
1 Senior abc-002 abc-001
1 Junior abc-003 abc-002
1 Junior abc-004 abc-002
2 Senior abc-006 abc-005
2 Junior abc-007 abc-006
3 Junior abc-009 abc-008
Below code works when I have one Junior, I want to know if there is any way to define parent for more than one juniors:
order = ['Head', 'Senior', 'Junior']
key = pd.Series({x: i for i,x in enumerate(order)})
df2 = df.sort_values(by='Type', key=key.get)
df4=df.join(df2.groupby('IP')['Type'].shift().dropna().rename('Parent'),how='right')
print(df4)
You could pivot the Type and Name columns then forword fill within ID group. Then take the right-hand two non-NaN entries to get the Parent and Name.
Pivot and forward-fill:
dfn = pd.concat([df[['ID','Type']], df.pivot(columns='Type', values='Name')], axis=1) \
.groupby('ID').apply(lambda x: x.ffill())[['ID','Type','Head','Senior','Junior']]
print(dfn)
ID Type Head Senior Junior
0 1 Head abc-001 NaN NaN
1 1 Senior abc-001 abc-002 NaN
2 1 Junior abc-001 abc-002 abc-003
3 1 Junior abc-001 abc-002 abc-004
4 2 Head abc-005 NaN NaN
5 2 Senior abc-005 abc-006 NaN
6 2 Junior abc-005 abc-006 abc-007
7 3 Head abc-008 NaN NaN
8 3 Junior abc-008 NaN abc-009
A function to pull the last two non-NaN entries:
def get_np(x):
rc = [np.nan,np.nan]
if x.isna().sum() != 2:
if x.isna().sum() == 0:
rc = [x['Junior'],x['Senior']]
elif pd.isna(x['Junior']):
rc = [x['Senior'],x['Head']]
else:
rc = [x['Junior'],x['Head']]
return pd.concat([x[['ID','Type']], pd.Series(rc, index=['Name','Parent'])])
Apply it and drop the non-applicable rows:
dfn.apply(get_np, axis=1).dropna()
ID Type Name Parent
1 1 Senior abc-002 abc-001
2 1 Junior abc-003 abc-002
3 1 Junior abc-004 abc-002
5 2 Senior abc-006 abc-005
6 2 Junior abc-007 abc-006
8 3 Junior abc-009 abc-008
Related
I tried to merge two tables on person_skills, but recieved a merged table that has a lot NaN value.
I'm sure the second table has no duplicate value and tried to zoom out the possible issues caused by datatype or NA value, but still receive the same wrong result.
Please help me and have a look at the following code.
Table 1
lst_col = 'person_skills'
skills = skills.assign(**{lst_col:skills[lst_col].str.split(',')})
skills = skills.explode(['person_skills'])
skills['person_id'] = skills['person_id'].astype(int)
skills['person_skills'] = skills['person_skills'].astype(str)
skills.head(10)
person_id person_skills
0 1 Talent Management
0 1 Human Resources
0 1 Performance Management
0 1 Leadership
0 1 Business Analysis
0 1 Policy
0 1 Talent Acquisition
0 1 Interviews
0 1 Employee Relations
Table 2
standard_skills = df["person_skills"].str.split(',', expand=True)
series1 = pd.Series(standard_skills[0])
standard_skills = series1.unique()
standard_skills= pd.DataFrame(standard_skills, columns = ["person_skills"])
standard_skills.insert(0, 'skill_id', range(1, 1 + len(standard_skills)))
standard_skills['skill_id'] = standard_skills['skill_id'].astype(int)
standard_skills['person_skills'] = standard_skills['person_skills'].astype(str)
standard_skills = standard_skills.drop_duplicates(subset='person_skills').reset_index(drop=True)
standard_skills = standard_skills.dropna(axis=0)
standard_skills.head(10)
skill_id person_skills
0 1 Talent Management
1 2 SEM
2 3 Proficient with Microsoft Windows: Word
3 4 Recruiting
4 5 Employee Benefits
5 6 PowerPoint
6 7 Marketing
7 8 nan
8 9 Human Resources (HR)
9 10 Event Planning
Merged table
combine_skill = skills.merge(standard_skills,on='person_skills', how='left')
combine_skill.head(10)
person_id person_skills skill_id
0 1 Talent Management 1.0
1 1 Human Resources NaN
2 1 Performance Management NaN
3 1 Leadership NaN
4 1 Business Analysis NaN
5 1 Policy NaN
6 1 Talent Acquisition NaN
7 1 Interviews NaN
8 1 Employee Relations NaN
9 1 Staff Development NaN
Please let me know where I made mistakes, thanks!
I have 2 dataframes:
users
user_id position
0 201 Senior Engineer
1 207 Senior System Architect
2 223 Senior account manage
3 212 Junior Manager
4 112 junior Engineer
5 311 junior python developer
df1 = pd.DataFrame({'user_id': ['201', '207', '223', '212', '112', '311'],
'position': ['Senior Engineer', 'Senior System Architect', 'Senior account manage', 'Junior Manager', 'junior Engineer', 'junior python developer']})
roles
role_id role_position
0 10 %senior%
1 20 %junior%
df2 = pd.DataFrame({'role_id': ['10', '20'],
'role_position': ['%senior%', '%junior%']})
I want to join them to get role_id for each row in df1 using condition something like this:
lower(df1.position) LIKE df2.role_position
I want to use operator LIKE (like in SQL).
So it would look like this (or without role_position - it would be even better):
user_id position role_id role_position
0 201 Senior Engineer 10 %senior%
1 207 Senior System Architect 10 %senior%
2 223 Senior account manage 10 %senior%
3 212 Junior Manager 20 %junior%
4 112 junior Engineer 20 %junior%
5 311 junior python developer 20 %junior%
How can i make this?
Thank you for your help!
You can use str.extract()+merge():
pat='('+'|'.join(df2['role_position'].str.strip('%').unique())+')'
df1['role_position']='%'+df1['position'].str.lower().str.extract(pat,expand=False)+'%'
df1=df1.merge(df2,on='role_position',how='left')
output of df1:
user_id position role_id role_position
0 201 Senior Engineer 10 %senior%
1 207 Senior System Architect 10 %senior%
2 223 Senior account manage 10 %senior%
3 212 Junior Manager 20 %junior%
4 112 junior Engineer 20 %junior%
5 311 junior python developer 20 %junior%
You can save some trouble by doing a merge directly if the seniority level always start at front:
print (pd.merge(df, df2,
left_on=df["position"].str.split().str[0].str.lower(),
right_on=df2["role_position"].str.strip("%")).drop("key_0", axis=1))
Else you can do a pd.Series.str.extract during a merge:
pat = f'({"|".join(df2["role_position"].str.strip("%"))})'
print (pd.merge(df, df2,
left_on=df["position"].str.extract(pat, flags=re.IGNORECASE, expand=False).str.lower(),
right_on=df2["role_position"].str.strip("%")).drop("key_0", axis=1))
Both yields the same result:
user_id position role_id role_position
0 201 Senior Engineer 10 %senior%
1 207 Senior System Architect 10 %senior%
2 223 Senior account manage 10 %senior%
3 212 Junior Manager 20 %junior%
4 112 junior Engineer 20 %junior%
5 311 junior python developer 20 %junior%
Possibilities:
fuzzy words
Sequence Matcher
.extract
df1['Similarity'] = 0
df1['Role'] = 0
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
for index, row in df1.iterrows():
for x in df2['role_position']:
z = similar(row['position'],x)
if z >= 0.20:
df1.loc[index, "Similarity"] = z
df1.loc[index, "Role"] = x
You can generate a dict of mappings and then map the values:
df2['role_position'] = df2['role_position'].str.strip('%')
mappings = df2.set_index('role_position').to_dict('dict')['role_id']
>> mappings
{'senior': '10', 'junior': '20'}
Using a regular expression we can extract the roles for each position:
re_roles = f"({df2['role_position'].str.cat(sep='|')})"
position = df1['position'].str.extract(re_roles, flags=re.I).iloc[:, 0].str.lower()
>> position
0 senior
1 senior
2 senior
3 junior
4 junior
5 junior
Name: 0, dtype: object
And finally map the role_id using the mappings dictionary:
df1['role_id'] = position.map(mappings)
>> df1
user_id position role_id
0 201 Senior Engineer 10
1 207 Senior System Architect 10
2 223 Senior account manage 10
3 212 Junior Manager 20
4 112 junior Engineer 20
5 311 junior python developer 20
I have lists that are categorized by name, such as:
dining = ['CARLS', 'SUBWAY', 'PIZZA']
bank = ['TRANSFER', 'VENMO', 'SAVE AS YOU GO']
and I want to update a new column to the category name if any of those strings are found in the other column. An example from my other question here, I have the following data set (an example bank transactions list):
import pandas as pd
import numpy as np
dining = ['CARLS', 'SUBWAY', 'PIZZA']
bank = ['TRANSFER', 'VENMO', 'SAVE AS YOU GO']
data = [
[-68.23 , 'PAYPAL TRANSFER'],
[-12.46, 'RALPHS #0079'],
[-8.51, 'SAVE AS YOU GO'],
[25.34, 'VENMO CASHOUT'],
[-2.23 , 'PAYPAL TRANSFER'],
[-64.29 , 'PAYPAL TRANSFER'],
[-7.06, 'SUBWAY'],
[-7.03, 'CARLS JR'],
[-2.35, 'SHELL OIL'],
[-35.23, 'CHEVRON GAS']
]
df = pd.DataFrame(data, columns=['amount', 'details'])
df['category'] = np.nan
df
amount details category
0 -68.23 PAYPAL TRANSFER NaN
1 -12.46 RALPHS #0079 NaN
2 -8.51 SAVE AS YOU GO NaN
3 25.34 VENMO CASHOUT NaN
4 -2.23 PAYPAL TRANSFER NaN
5 -64.29 PAYPAL TRANSFER NaN
6 -7.06 SUBWAY NaN
7 -7.03 CARLS JR NaN
8 -2.35 SHELL OIL NaN
9 -35.23 CHEVRON GAS NaN
Is there an efficient way for me update the category column to either 'dining' or 'bank' based on if the strings in the list are found in data.details?
I.e. Desired Output:
amount details category
0 -68.23 PAYPAL TRANSFER bank
1 -12.46 RALPHS #0079 NaN
2 -8.51 SAVE AS YOU GO bank
3 25.34 VENMO CASHOUT bank
4 -2.23 PAYPAL TRANSFER bank
5 -64.29 PAYPAL TRANSFER bank
6 -7.06 SUBWAY dining
7 -7.03 CARLS JR dining
8 -2.35 SHELL OIL NaN
9 -35.23 CHEVRON GAS NaN
From my previous question, so far I'm assuming I need to work with a new list that I create by using str.extract.
We can do this with np.select since we have multiple conditions:
dining = '|'.join(dining)
bank = '|'.join(bank)
conditions = [
df['details'].str.contains(f'({dining})'),
df['details'].str.contains(f'({bank})')
]
choices = ['dining', 'bank']
df['category'] = np.select(conditions, choices, default=np.NaN)
amount details category
0 -68.23 PAYPAL TRANSFER bank
1 -12.46 RALPHS #0079 nan
2 -8.51 SAVE AS YOU GO bank
3 25.34 VENMO CASHOUT bank
4 -2.23 PAYPAL TRANSFER bank
5 -64.29 PAYPAL TRANSFER bank
6 -7.06 SUBWAY dining
7 -7.03 CARLS JR dining
8 -2.35 SHELL OIL nan
9 -35.23 CHEVRON GAS nan
You can do with findall + dict map
sub = {**dict.fromkeys(dining, 'dining'), **dict.fromkeys(bank, 'bank')}
df.details.str.findall('|'.join(sub)).str[0].map(sub)
Out[146]:
0 bank
1 NaN
2 bank
3 bank
4 bank
5 bank
6 dining
7 dining
8 NaN
9 NaN
Name: details, dtype: object
#df['category'] = df.details.str.findall('|'.join(sub)).str[0].map(sub)
My dataset is based on the results of Food Inspections in the City of Chicago.
import pandas as pd
df = pd.read_csv("C:/~/Food_Inspections.csv")
df.head()
Out[1]:
Inspection ID DBA Name \
0 1609238 JR'SJAMAICAN TROPICAL CAFE,INC
1 1609245 BURGER KING
2 1609237 DUNKIN DONUTS / BASKIN ROBINS
3 1609258 CHIPOTLE MEXICAN GRILL
4 1609244 ATARDECER ACAPULQUENO INC.
AKA Name License # Facility Type Risk \
0 NaN 2442496.0 Restaurant Risk 1 (High)
1 BURGER KING 2411124.0 Restaurant Risk 2 (Medium)
2 DUNKIN DONUTS / BASKIN ROBINS 1717126.0 Restaurant Risk 2 (Medium)
3 CHIPOTLE MEXICAN GRILL 1335044.0 Restaurant Risk 1 (High)
4 ATARDECER ACAPULQUENO INC. 1910118.0 Restaurant Risk 1 (High)
Here is how often each of the facilities appear in the dataset:
df['Facility Type'].value_counts()
Out[3]:
Restaurant 14304
Grocery Store 2647
School 1155
Daycare (2 - 6 Years) 367
Bakery 316
Children's Services Facility 262
Daycare Above and Under 2 Years 248
Long Term Care 169
Daycare Combo 1586 142
Catering 123
Liquor 78
Hospital 68
Mobile Food Preparer 67
Golden Diner 65
Mobile Food Dispenser 51
Special Event 25
Shared Kitchen User (Long Term) 22
Daycare (Under 2 Years) 18
I am trying to create a new set of data containing those rows where its Facility Type has over 50 occurrences in the dataset. How would I approach this?
Please note the list of facility counts is MUCH LARGER as I have cut out most of the information as it did not contribute to the question at hand (so simply removing occurrences of "Special Event", " Shared Kitchen User", and "Daycare" is not what I'm looking for).
IIUC then you want to filter:
df.groupby('Facility Type').filter(lambda x: len(x) > 50)
Example:
In [9]:
df = pd.DataFrame({'type':list('aabcddddee'), 'value':np.random.randn(10)})
df
Out[9]:
type value
0 a -0.160041
1 a -0.042310
2 b 0.530609
3 c 1.238046
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
In [10]:
df.groupby('type').filter(lambda x: len(x) > 1)
Out[10]:
type value
0 a -0.160041
1 a -0.042310
4 d -0.754779
5 d -0.197309
6 d 1.704829
7 d -0.706467
8 e -1.039818
9 e 0.511638
Not tested, but should work.
FT=df['Facility Type'].value_counts()
df[df['Facility Type'].isin(FT.index[FT>50])]
So I have a data frame where the headings I want do not currently line up:
In [1]: df = pd.read_excel('example.xlsx')
print (df.head(10))
Out [1]: Portfolio Asset Country Quantity
Unique Identifier Number of fund B24 B65 B35 B44
456 2 General Type A UNITED KINGDOM 1
123 3 General Type B US 2
789 2 General Type C UNITED KINGDOM 4
4852 4 General Type C UNITED KINGDOM 4
654 1 General Type A FRANCE 3
987 5 General Type B UNITED KINGDOM 2
321 1 General Type B GERMANY 1
951 3 General Type A UNITED KINGDOM 2
357 4 General Type C UNITED KINGDOM 3
As we can see; above the first 2 column headings there are 2 blank cells and below the next 4 column headings are "B" numbers which I don't care about.
So 2 questions; How can I shift up the first 2 columns without having a column heading to identify them with (due to the blank cells above)?
And how can I delete just Row 2 of the remaining columns and have the data below move up to take the place of the "B" numbers?
I found some similar questions already asked python: shift column in pandas dataframe up by one but nothing that solves the particular intricacies above I don't think.
Also I'm quite new to Python and Pandas so if this is really basic I apologise!
IIUC you can use:
#create df from multiindex in columns
df1 = pd.DataFrame([x for x in df.columns.values])
print df1
0 1
0 Unique Identifier
1 Number of fund
2 Portfolio B24
3 Asset B65
4 Country B35
5 Quantity B44
#if len of string < 4, give value from column 0 to column 1
df1.loc[df1.iloc[:,1].str.len() < 4, 1] = df1.iloc[:,0]
print df1
0 1
0 Unique Identifier
1 Number of fund
2 Portfolio Portfolio
3 Asset Asset
4 Country Country
5 Quantity Quantity
#set columns by first columns of df1
df.columns = df1.iloc[:,1]
print df
0 Unique Identifier Number of fund Portfolio Asset Country \
0 456 2 General Type A UNITED KINGDOM
1 123 3 General Type B US
2 789 2 General Type C UNITED KINGDOM
3 4852 4 General Type C UNITED KINGDOM
4 654 1 General Type A FRANCE
5 987 5 General Type B UNITED KINGDOM
6 321 1 General Type B GERMANY
7 951 3 General Type A UNITED KINGDOM
8 357 4 General Type C UNITED KINGDOM
0 Quantity
0 1
1 2
2 4
3 4
4 3
5 2
6 1
7 2
8 3
EDIT by comments:
print df.columns
Index([u'Portfolio', u'Asset', u'Country', u'Quantity'], dtype='object')
#set first row by columns names
df.iloc[0,:] = df.columns
#reset_index
df = df.reset_index()
#set columns from first row
df.columns = df.iloc[0,:]
df.columns.name= None
#remove first row
print df.iloc[1:,:]
Unique Identifier Number of fund Portfolio Asset Country Quantity
1 456 2 General Type A UNITED KINGDOM 1
2 123 3 General Type B US 2
3 789 2 General Type C UNITED KINGDOM 4
4 4852 4 General Type C UNITED KINGDOM 4
5 654 1 General Type A FRANCE 3
6 987 5 General Type B UNITED KINGDOM 2
7 321 1 General Type B GERMANY 1
8 951 3 General Type A UNITED KINGDOM 2
9 357 4 General Type C UNITED KINGDOM 3