I have two separates files, one from our service providers and the other is internal (HR).
The service providers write the names of our employer in different ways, there are those who write it in firstname lastname format, or first letter of the firstname and the last name or lastname firstname...while the HR file includes separately the first and last name.
DF1
Full Name
0 B.pitt
1 Mr Nickolson Jacl
2 Johnny, Deep
3 Streep Meryl
DF2
First Last
0 Brad Pitt
1 Jack Nicklson
2 Johnny Deep
3 Streep Meryl
My idea is to use str.contains to look for the first letter of the first name and the last name. I've succed to do it with static values using the following code:
df1[['Full Name']][df1['Full Name'].str.contains('B')
& df1['Full Name'].str.contains('pitt')]
Which gives the following result:
Full Name
0 B.pitt
The challenge is comparing the two datasets... Any advise on that please?
Regards
if you are just checking if it exists or no this could be useful:
because it is rare to have 2 exactly the same family name, I recommend to just split your Df1 and compare families, then for ensuring you can differ first names too
you can easily do it with a for:
for i in range('your index'):
if df1_splitted[i].str.contain('family you searching for'):
print("yes")
if you need to compare in other aspects just let me know
I suggest to use next module for parsing names:
pip install nameparser
Then you can process your data frames :
from nameparser import HumanName
import pandas as pd
df1 = pd.DataFrame({'Full Name':['B.pitt','Mr Nickolson Jack','Johnny, Deep','Streep Meryl']})
df2 = pd.DataFrame({'First':['Brad', 'Jack','Johnny', 'Streep'],'Last':['Pitt','Nicklson','Deep','Meryl']})
names1 = [HumanName(name) for name in df1['Full Name']]
names2 = [HumanName(str(row[0]+" "+ str(row[1]))) for i,row in df2.iterrows()]
After that you can try comparing HumanName instances which have parsed fileds. it looks like this:
<HumanName : [
title: ''
first: 'Brad'
middle: ''
last: 'Pitt'
suffix: ''
nickname: '' ]
I have used this approach for processing thousands of names and merging them to same names from other documents and results were good.
More about module can be found at https://nameparser.readthedocs.io/en/latest/
Hey you could use fuzzy string matching with fuzzywuzzy
First create Full Name for df2
df2_ = df2[['First', 'Last']].agg(lambda a: a[0] + ' ' + a[1], axis=1).rename('Full Name').to_frame()
Then merge the two dataframes by index
merged_df = df2_.merge(df1, left_index=True, right_index=True)
Now you can apply fuzz.token_sort_ratio so you get the similarity
merged_df['similarity'] = merged_df[['Full Name_x', 'Full Name_y']].apply(lambda r: fuzz.token_sort_ratio(*r), axis=1)
This results in the following dataframe. You can now filter or sort it by similarity.
Full Name_x Full Name_y similarity
0 Brad Pitt B.pitt 80
1 Jack Nicklson Mr Nickolson Jacl 80
2 Johnny Deep Johnny, Deep 100
3 Streep Meryl Streep Meryl 100
Related
I need to split a column called Creative where each cell contains samples such as:
pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)
Where each two-letter code preceding each bubbled section ( ) is the title of the desired column, and are the same in every row. The only data that changes is what is inside the bubbles. I want the data to look like:
pn
io
ta
pt
cn
cs
2021
302
Yes
Blue
John
Doe
I tried
df[['Creative', 'Creative Size']] = df['Creative'].str.split('cs(',expand=True)
and
df['Creative Size'] = df['Creative Size'].str.replace(')','')
but got an error, error: missing ), unterminated subpattern at position 2, assuming it has something to do with regular expressions.
Is there an easy way to split these ? Thanks.
Use extract with named capturing groups (see here):
import pandas as pd
# toy example
df = pd.DataFrame(data=[["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)"]], columns=["Creative"])
# extract with a named capturing group
res = df["Creative"].str.extract(
r"pn\((?P<pn>\d+)\)io\((?P<io>\d+)\)ta\((?P<ta>\w+)\)pt\((?P<pt>\w+)\)cn\((?P<cn>\w+)\)cs\((?P<cs>\w+)\)",
expand=True)
print(res)
Output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
I'd use regex to generate a list of dictionaries via comprehensions. The idea is to create a list of dictionaries that each represent rows of the desired dataframe, then constructing a dataframe out of it. I can build it in one nested comprehension:
import re
rows = [{r[0]:r[1] for r in re.findall(r'(\w{2})\((.+)\)', c)} for c in df['Creative']]
subtable = pd.DataFrame(rows)
for col in subtable.columns:
df[col] = subtable[col].values
Basically, I regex search for instances of ab(*) and capture the two-letter prefix and the contents of the parenthesis and store them in a list of tuples. Then I create a dictionary out of the list of tuples, each of which is essentially a row like the one you display in your question. Then, I put them into a data frame and insert each of those columns into the original data frame. Let me know if this is confusing in any way!
David
Try with extractall:
names = df["Creative"].str.extractall("(.*?)\(.*?\)").loc[0][0].tolist()
output = df["Creative"].str.extractall("\((.*?)\)").unstack()[0].set_axis(names, axis=1)
>>> output
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
1 2020 301 No Red Jane Doe
Input df:
df = pd.DataFrame({"Creative": ["pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)",
"pn(2020)io(301)ta(No)pt(Red)cn(Jane)cs(Doe)"]})
We can use str.findall to extract matching column name-value pairs
pd.DataFrame(map(dict, df['Creative'].str.findall(r'(\w+)\((\w+)')))
pn io ta pt cn cs
0 2021 302 Yes Blue John Doe
Using regular expressions, different way of packaging final DataFrame:
import re
import pandas as pd
txt = 'pn(2021)io(302)ta(Yes)pt(Blue)cn(John)cs(Doe)'
data = list(zip(*re.findall('([^\(]+)\(([^\)]+)\)', txt))
df = pd.DataFrame([data[1]], columns=data[0])
I'm just wondering how one might overcome the below error.
AttributeError: 'list' object has no attribute 'str'
What I am trying to do is create a new column "PrivilegedAccess" and in this column I want to write "True" if any of the names in the first_names column match the ones outlined in the "Search_for_These_values" list and "False" if they don't
Code
## Create list of Privileged accounts
Search_for_These_values = ['Privileged','Diagnostics','SYS','service account'] #creating list
pattern = '|'.join(Search_for_These_values) # joining list for comparision
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF.columns=[['first_name']].str.contains(pattern)
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF['PrivilegedAccess'].map({True: 'True', False: 'False'})
SAMPLE DATA:
uid last_name first_name language role email_address department
0 121 Chad Diagnostics English Team Lead Michael.chad#gmail.com Data Scientist
1 253 Montegu Paulo Spanish CIO Paulo.Montegu#gmail.com Marketing
2 545 Mitchel Susan English Team Lead Susan.Mitchel#gmail.com Data Scientist
3 555 Vuvko Matia Polish Marketing Lead Matia.Vuvko#gmail.com Marketing
4 568 Sisk Ivan English Supply Chain Lead Ivan.Sisk#gmail.com Supply Chain
5 475 Andrea Patrice Spanish Sales Graduate Patrice.Andrea#gmail.com Sales
6 365 Akkinapalli Cherifa French Supply Chain Assistance Cherifa.Akkinapalli#gmail.com Supply Chain
Note that the dtype of the first_name column is "object" and the dataframe is multi index (not sure how to change from multi index)
Many thanks
It seems you need select one column for str.contains and then use map or convert boolean to strings:
Search_for_These_values = ['Privileged','Diagnostics','SYS','service account'] #creating list
pattern = '|'.join(Search_for_These_values)
PrivilegedAccounts_DF = pd.DataFrame({'first_name':['Privileged 111',
'aaa SYS',
'sss']})
print (PrivilegedAccounts_DF.columns)
Index(['first_name'], dtype='object')
print (PrivilegedAccounts_DF.loc[0, 'first_name'])
Privileged 111
print (type(PrivilegedAccounts_DF.loc[0, 'first_name']))
<class 'str'>
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF['first_name'].str.contains(pattern).astype(str)
print (PrivilegedAccounts_DF)
first_name PrivilegedAccess
0 Privileged 111 True
1 aaa SYS True
2 sss False
EDIT:
There is problem one level MultiIndex, need:
PrivilegedAccounts_DF = pd.DataFrame({'first_name':['Privileged 111',
'aaa SYS',
'sss']})
#simulate problem
PrivilegedAccounts_DF.columns = [PrivilegedAccounts_DF.columns.tolist()]
print (PrivilegedAccounts_DF)
first_name
0 Privileged 111
1 aaa SYS
2 sss
#check columns
print (PrivilegedAccounts_DF.columns)
MultiIndex([('first_name',)],
)
Solution is join values, e.g. by empty string:
PrivilegedAccounts_DF.columns = PrivilegedAccounts_DF.columns.map(''.join)
So now columns names are correct:
print (PrivilegedAccounts_DF.columns)
Index(['first_name'], dtype='object')
PrivilegedAccounts_DF['PrivilegedAccess'] = PrivilegedAccounts_DF['first_name'].str.contains(pattern).astype(str)
print (PrivilegedAccounts_DF)
There might be a more elegant solution, but this should work (without using patterns):
PrivilegedAccounts_DF.loc[PrivilegedAccounts_DF['first_name'].isin(Search_for_These_values), "PrivilegedAccess"]=True
PrivilegedAccounts_DF.loc[~PrivilegedAccounts_DF['first_name'].isin(Search_for_These_values), "PrivilegedAccess"]=False
I have some tables from the Bureau of Labor Statistics that I converted to cvs files in Python. The 'Item' column has some rows with multiple '.' . I'm trying to iterate through these rows and replace these '.' with '' .
I've tried:
for row in age_df_1989['Item']:
if '.' in row:
age_df_1989['Item'].replace('.','')
Any ideas on what I can do for this?
No assigning age_df_1989['Item'].replace('.','') to a variable won't change the original data, you need to do this:
for row in age_df_1989['Item']:
if '.' in row:
row['Item'] = row['Item'].replace('.','')
Try apply
age_df_1989['Item'] = age_df_1989['Item'].apply(lambda x: x.replace('.', '')
Simple & faster than a for loop
Use the vectorised str method replace: This is much faster than the iterrows or for loop or apply option.
You can do something as simple as
df['column name'] = df['column name'].str.replace('old value','new value')
For your example, do this:
age_df_1989['Item'] = age_df_1989['Item'].str.replace('.', '')
Here's an example output of this:
c = ['Name','Item']
d = [['Bob','Good. Bad. Ugly.'],
['April','Today. Tomorrow'],
['Amy','Grape. Peach. Banana.'],
['Linda','Pink. Red. Yellow.']]
import pandas as pd
age_df_1989 = pd.DataFrame(d, columns = c)
print (age_df_1989)
age_df_1989['Item'] = age_df_1989['Item'].str.replace('.', '')
print (age_df_1989)
Dataframe: age_df_1989 : Original
Name Item
0 Bob Good. Bad. Ugly.
1 April Today. Tomorrow
2 Amy Grape. Peach. Banana.
3 Linda Pink. Red. Yellow.
Dataframe: age_df_1989 : After the replace command
Name Item
0 Bob Good Bad Ugly
1 April Today Tomorrow
2 Amy Grape Peach Banana
3 Linda Pink Red Yellow
Here is my problem (I'm working on python) :
I have a Dataframe with columns: Index(['job_title', 'company', 'job_label', 'description'], dtype='object')
And I have a list of words that contains 300 skills:
keywords = ["C++","Data Analytics","python","R", ............ "Django"]
I need to match those keywords with each of the jobs descriptions and obtain a new dataframe saying if is true or false that C++ is in job description[0]...job description[1], job description[2] and so on.
My new dataframe will be:
columns : ['job_title', 'company', 'description', "C++", "Data Analytics",
....... "Django"]
Where each column of keywords said true or false if it match(is found) or not on the job description.
There might be another ways to structure the dataframe (I'm listening suggestions).
Hope I'm clear with my question. I try regex but I can't make it iterate trough each row, I try with a loop using "fnmatch" library and I can't make it work. The best approach so far was:
df["microservice"]= df.description.str.contains("microservice")
df["cloud-based architecture"] = df.description.str.contains("cloud-based architecture")
df["service oriented architecture"] = df.description.str.contains("service oriented architecture")
However, First I could not manage to make it loop trough each rows of description column, so i have to input 300 times the code with each word (it doesn't make sense). Second, trough this way, I have problems with few words such as "R" because it find the letter R in each description, so it will pull true in each of them.
Iterate over list of keywords and extract each column from the description one:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if name in x else False)
EDIT:
That doesn't solve the problem with R. To do so you could add a space to make sure it's isolated so the code would be:
for name in keywords:
df[name] = df['description'].apply(lambda x: True if ' '+str(name)+' ' in x else False)
But that's really ugly and not optimised. Regular expression should do the trick but I have to look back into it: found it! [ ]*+[str(name)]+[.?!] is better! (and more appropriate)
One way is to build a regex string to identify any keyword in your string... this example is case insensitive and will find any substring matches - not just whole words...
import pandas as pd
import re
keywords = ['python', 'C++', 'admin', 'Developer']
rx = '(?i)(?P<keywords>{})'.format('|'.join(re.escape(kw) for kw in keywords))
Then with a sample DF of:
df = pd.DataFrame({
'job_description': ['C++ developer', 'traffic warden', 'Python developer', 'linux admin', 'cat herder']
})
You can find all keywords for the relevant column...
matches = df['job_description'].str.extractall(rx)
Which gives:
keyword
match
0 0 C++
1 developer
2 0 Python
1 developer
3 0 admin
Then you want to get a list of "dummies" and take the max (so you always get a 1 where a word was found) using:
dummies = pd.get_dummies(matches).max(level=0)
Which gives:
keyword_C++ keyword_Python keyword_admin keyword_developer
0 1 0 0 1
2 0 1 0 1
3 0 0 1 0
You then left join that back to your original DF:
result = df.join(dummies, how='left')
And the result is:
job_description keyword_C++ keyword_Python keyword_admin keyword_developer
0 C++ developer 1.0 0.0 0.0 1.0
1 traffic warden NaN NaN NaN NaN
2 Python developer 0.0 1.0 0.0 1.0
3 linux admin 0.0 0.0 1.0 0.0
4 cat herder NaN NaN NaN NaN
skill = "C++", or any of the others
frame = an instance of
Index(['job_title', 'company', 'job_label', 'description'],
dtype='object')
jobs = a list/np.array of frames, which is probably your input
A naive implementation could look a bit like this:
for skill in keywords:
for frame in jobs:
if skill in frame["description"]: # or more exact matching, but this is what's in the question
# exists
But you need to put more work into what output structure you are going to use. Just having an output array of 300 columns most of which just contain a False isn't going to be a good plan. I've never worked with Panda's myself, but if it were normal numpy arrays (which panda's DataFrames are under the hood), I would add a column "skills" that can enumerate them.
You can leverage .apply() like so (#Jacco van Dorp made a solid suggestion of storing all of the found skills inside a single column, which I agree is likely the best approach to your problem):
df = pd.DataFrame([['Engineer','Firm','AERO1','Work with python and Django'],
['IT','Dell','ITD4','Work with Django and R'],
['Office Assistant','Dental','OAD3','Coordinate schedules'],
['QA Engineer','Factory','QA2','Work with R and python'],
['Mechanic','Autobody','AERO1','Love the movie Django']],
columns=['job_title','company','job_label','description'])
Which yields:
job_title company job_label description
0 Engineer Firm AERO1 Work with python and Django
1 IT Dell ITD4 Work with Django and R
2 Office Assistant Dental OAD3 Coordinate schedules
3 QA Engineer Factory QA2 Work with R and python
4 Mechanic Autobody AERO1 Love the movie Django
Then define your skill set and your list comprehension to pass to .apply():
skills = ['python','R','Django']
df['skills'] = df.apply(lambda x: [i for i in skills if i in x['description'].split()], axis=1)
Which yields this column:
skills
0 [python, Django]
1 [R, Django]
2 []
3 [python, R]
4 [Django]
If you are still interested in having individual columns for each skill, I can edit my answer to provide that as well.
I would like to create a new column in a dataframe containing pieces of 3 different columns.I would like the first 5 letters of the last name, after removing non alphabeticals, if it is that long else just the last name, the first 2 letters of the first name and a code appended to the end.
The code below doesnt work but thats where I am and it isnt close to working
df['namecode'] = df.Last.str.replace('[^a-zA-Z]', '')[:5]+df.First.str.replace('[^a-zA-Z]', '')[:2]+str(jr['code'])
Name lastname code namecode
jeff White 0989 Whiteje0989
Zach Bunt 0798 Buntza0798
ken Black 5764 Blackke5764
Here is one approach.
Use pandas str.slice instead of trying to do string indexing.
For example:
import pandas as pd
df = pd.DataFrame(
{
'First': ['jeff', 'zach', 'ken'],
'Last': ['White^', 'Bun/t', 'Bl?ack'],
'code': ['0989', '0798', '5764']
}
)
print(df['Last'].str.replace('[^a-zA-Z]', '').str.slice(0,5)
+ df['First'].str.slice(0,2) + df['code'])
#0 Whiteje0989
#1 Buntza0798
#2 Blackke5764
#dtype: object