With the following dataframe, I'm trying to create a new guest_1 column that takes the first two words in each item of the guest column. At the bottom, you can see my desired output.
Is there some sort of "if doesn't exist, then..." logic I can apply here?
I've tried the following, but the obvious difficulty is accounting for a person with a single word for a name.
df.guest_1 = data.guest.str.split().str.get(0) + ' ' + data.guest.str.split().str.get(1)
df = pd.DataFrame(
{'date': ['2018-11-21','2018-02-26'],
'guest': ['Anthony Scaramucci & Michael Avenatti', 'Robyn'],
})
df.guest_1 = ['Anthony Scaramucci', 'Robyn']
You can split, slice, and join. This will gracefully handle out-of-bounds slices:
df.guest.str.split().str[:2].str.join(' ')
df['guest_1'] = df.guest.str.split().str[:2].str.join(' ')
df
date guest guest_1
0 2018-11-21 Anthony Scaramucci & Michael Avenatti Anthony Scaramucci
1 2018-02-26 Robyn Robyn
Related
I have two data frames:
import pandas as pd
first_df = pd.DataFrame({'Full Name': ['Mulligan Nick & Mary', 'Tsang S C', 'Hattie J A C '],
'Address': ['270 Claude Road', '13 Sunnyridge Place', '18A Empire Road']})
second_df = pd.DataFrame({'Owner' : ['David James Mulligan', 'Brenda Joy Mulligan ', 'Helen Kwok Hattie'],
'Add Match': ['19 Dexter Avenue', 'Claude Road ', 'Building NO 512']})
Is there anyway to match only the first string in Full Name column to the last string in Owner column.
If there is a match, I then want to compare Address against Add match to see if there are any like values. If the first condition passes but the second condition fails, this would not be added into the new data frame.
Using a left join results in:
new_df = first_df.merge(second_df, how='left', left_on = ['Full Name', 'Address'], right_on = ['Owner', 'Add Match'])
print(new_df.head())
Full Name Address Owner Add Match
0 Mulligan Nick & Mary 270 Claude Road NaN NaN
1 Tsang S C 13 Sunnyridge Place NaN NaN
2 Hattie J A C 18A Empire Road NaN NaN
However the output wanted would look more like this:
new_df
Name Address
---- --------
Brenda Joy Mulligan Claude Road
You could take advantage of the difflib module from Python standard library to find similarities between different columns.
For instance, you can define the following function:
from difflib import SequenceMatcher
def compare_df(left, right, col: str):
left[f"{col}_match_ratio"] = 0
for value in left[col]:
best_ratio = 0
for other in right[col]:
result = SequenceMatcher(None, str(value), str(other)).ratio()
if result > best_ratio:
best_ratio = result
left.loc[left[col] == value, f"{col}_match_ratio"] = round(best_ratio, 2)
Then:
you just have to make sure that the column you want to compare on have the same name in both dfs
you call df_compare(first_df, second_df, "Owner") which will add "Owner_match_ratio" column to second_df
finally, you filter second df on the desired minimum match ratio (70 % for instance) like this: new_df = second_df.loc[second_df["Owner_match_ratio"] > 0.7, :]
Inspired by this answer you could employ a similar solution.
TL;DR
first_df[['last_name', 'start_name']] = first_df['Full Name'].str.split(' ', 1, expand=True)
second_df['last_name'] = second_df['Owner'].str.split(' ').str[-1]
df_final = first_df.merge(second_df, how='inner', left_on=['last_name'], right_on=['last_name'])
address_matches = df_final.apply(lambda x: True if difflib.get_close_matches(x['Address'], [x['Add Match']], n=1, cutoff=0.8) else False, axis=1)
df_final = df_final[address_matches].drop(columns=['last_name', 'start_name', 'Full Name', 'Address']).rename(columns={'Owner':'Name', 'Add Match': 'Address'})
Step-by-step
Initially, you extract the last name keys you want.
first_df[['last_name', 'start_name']] = first_df['Full Name'].str.split(' ', 1, expand=True)
second_df['last_name'] = second_df['Owner'].str.split(' ').str[-1]
PS: Here we are using built-in string methods from pandas/numpy combo given your instructions. But if it fits you better you could also apply the similarity methods (e.g., difflib.get_close_matches) shown down below for the address part.
Next, you perform an inner join of these dataframes to match the last_name key.
df_temp = first_df.merge(second_df, how='inner', left_on=['last_name'], right_on=['last_name'])
Then you apply the difflib.get_close_matches with the desired similarity (I used cutoff=0.8 because above this value there were no values returned) method to mark which rows contain matches and subsequently get only the rows you want.
matches_mask = df_final.apply(lambda x: True if difflib.get_close_matches(x['Address'], [x['Add Match']], n=1, cutoff=0.8) else False, axis=1)
df_final = df_final[matches_mask].drop(columns=['last_name', 'start_name'])
Full Name Address Owner Add Match
Mulligan Nick & Mary 270 Claude Road Brenda Joy Mulligan Claude Road
Finally, to get match the format of the results posted at the end of your question you drop or rename some columns.
df_final.drop(columns=['Full Name', 'Address']).rename(columns={'Owner':'Name', 'Add Match': 'Address'})
Owner Add Match
Brenda Joy Mulligan Claude Road
I have a question regarding text file handling. My text file prints as one column. The column has data scattered throughout the rows and visually looks great & somewhat uniform however, still just one column. Ultimately, I'd like to append the row where the keyword is found to the end of the top previous row until data is one long row. Then I'll use str.split() to cut up sections into columns as I need.
In Excel (code below-Top) I took this same text file and removed headers, aligned left, and performed searches for keywords. When found, Excel has a nice feature called offset where you can place or append the cell value basically anywhere using this offset(x,y).value from the active-cell start position. Once done, I would delete the row. This allowed my to get the data into a tabular column format that I could work with.
What I Need:
The below Python code will cycle down through each row looking for the keyword 'Address:'. This part of the code works. Once it finds the keyword, the next line should append the row to the end of the previous row. This is where my problem is. I can not find a way to get the active row number into a variable so I can use in place of the word [index] for the active row. Or [index-1] for the previous row.
Excel Code of similar task
Do
Set Rng = WorkRng.Find("Address", LookIn:=xlValues)
If Not Rng Is Nothing Then
Rng.Offset(-1, 2).Value = Rng.Value
Rng.Value = ""
End If
Loop While Not Rng Is Nothing
Python Equivalent
import pandas as pd
from pandas import DataFrame, Series
file = {'Test': ['Last Name: Nobody','First Name: Tommy','Address: 1234 West Juniper St.','Fav
Toy', 'Notes','Time Slot' ] }
df = pd.DataFrame(file)
Test
0 Last Name: Nobody
1 First Name: Tommy
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I've tried the following:
for line in df.Test:
if line.startswith('Address:'):
df.loc[[index-1],:].values = df.loc[index-1].values + ' ' + df.loc[index].values
Line above does not work with index statement
else:
pass
# df.loc[[1],:] = df.loc[1].values + ' ' + df.loc[2].values # copies row 2 at the end of row 1,
# works with static row numbers only
# df.drop([2,0], inplace=True) # Deletes row from df
Expected output:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I am trying to wrap my head around the entire series vectorization approach but still stuck trying loops that I'm semi familiar with. If there is a way to achieve this please point me in the right direction.
As always, I appreciate your time and your knowledge. Please let me know if you can help with this issue.
Thank You,
Use Series.shift on Test then use Series.str.startswith to create a boolean mask, then use boolean indexing with this mask to update the values in Test column:
s = df['Test'].shift(-1)
m = s.str.startswith('Address', na=False)
df.loc[m, 'Test'] += (' ' + s[m])
Result:
Test
0 Last Name: Nobody
1 First Name: Tommy Address: 1234 West Juniper St.
2 Address: 1234 West Juniper St.
3 Fav Toy
4 Notes
5 Time Slot
I have two separates files, one from our service providers and the other is internal (HR).
The service providers write the names of our employer in different ways, there are those who write it in firstname lastname format, or first letter of the firstname and the last name or lastname firstname...while the HR file includes separately the first and last name.
DF1
Full Name
0 B.pitt
1 Mr Nickolson Jacl
2 Johnny, Deep
3 Streep Meryl
DF2
First Last
0 Brad Pitt
1 Jack Nicklson
2 Johnny Deep
3 Streep Meryl
My idea is to use str.contains to look for the first letter of the first name and the last name. I've succed to do it with static values using the following code:
df1[['Full Name']][df1['Full Name'].str.contains('B')
& df1['Full Name'].str.contains('pitt')]
Which gives the following result:
Full Name
0 B.pitt
The challenge is comparing the two datasets... Any advise on that please?
Regards
if you are just checking if it exists or no this could be useful:
because it is rare to have 2 exactly the same family name, I recommend to just split your Df1 and compare families, then for ensuring you can differ first names too
you can easily do it with a for:
for i in range('your index'):
if df1_splitted[i].str.contain('family you searching for'):
print("yes")
if you need to compare in other aspects just let me know
I suggest to use next module for parsing names:
pip install nameparser
Then you can process your data frames :
from nameparser import HumanName
import pandas as pd
df1 = pd.DataFrame({'Full Name':['B.pitt','Mr Nickolson Jack','Johnny, Deep','Streep Meryl']})
df2 = pd.DataFrame({'First':['Brad', 'Jack','Johnny', 'Streep'],'Last':['Pitt','Nicklson','Deep','Meryl']})
names1 = [HumanName(name) for name in df1['Full Name']]
names2 = [HumanName(str(row[0]+" "+ str(row[1]))) for i,row in df2.iterrows()]
After that you can try comparing HumanName instances which have parsed fileds. it looks like this:
<HumanName : [
title: ''
first: 'Brad'
middle: ''
last: 'Pitt'
suffix: ''
nickname: '' ]
I have used this approach for processing thousands of names and merging them to same names from other documents and results were good.
More about module can be found at https://nameparser.readthedocs.io/en/latest/
Hey you could use fuzzy string matching with fuzzywuzzy
First create Full Name for df2
df2_ = df2[['First', 'Last']].agg(lambda a: a[0] + ' ' + a[1], axis=1).rename('Full Name').to_frame()
Then merge the two dataframes by index
merged_df = df2_.merge(df1, left_index=True, right_index=True)
Now you can apply fuzz.token_sort_ratio so you get the similarity
merged_df['similarity'] = merged_df[['Full Name_x', 'Full Name_y']].apply(lambda r: fuzz.token_sort_ratio(*r), axis=1)
This results in the following dataframe. You can now filter or sort it by similarity.
Full Name_x Full Name_y similarity
0 Brad Pitt B.pitt 80
1 Jack Nicklson Mr Nickolson Jacl 80
2 Johnny Deep Johnny, Deep 100
3 Streep Meryl Streep Meryl 100
I have a dataframe which has some duplicate tags separated by commas in the "Tags" column, is there a way to remove the duplicate strings from the series. I want the output in 400 to have just Museum, Drinking, Shopping.
I can't split on a comma & remove them because there are some tags in the series that have similar words like for example: [Museum, Art Museum, Shopping] so splitting and dropping multiple museum strings would affect the unique 'Art Museum' string.
Desired Output
You can split by comma and convert to a set(),which removes duplicates, after removing leading/trailing white space with str.strip(). Then, you can df.apply() this to your column.
df['Tags']=df['Tags'].apply(lambda x: ', '.join(set([y.strip() for y in x.split(',')])))
You can create a function that removes duplicates from a given string. Then apply this function to your column Tags.
def remove_dup(strng):
'''
Input a string and split them
'''
return ', '.join(list(dict.fromkeys(strng.split(', '))))
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
DEMO:
import pandas as pd
my_dict = {'Tags':["Museum, Art Museum, Shopping, Museum",'Drink, Drink','Shop','Visit'],'Country':['USA','USA','USA', 'USA']}
df = pd.DataFrame(my_dict)
df['Tags'] = df['Tags'].apply(lambda x: remove_dup(x))
df
Output:
Tags Country
0 Museum, Art Museum, Shopping USA
1 Drink USA
2 Shop USA
3 Visit USA
Without some code example, I've thrown together something that would work.
import pandas as pd
test = [['Museum', 'Art Museum', 'Shopping', "Museum"]]
df = pd.DataFrame()
df[0] = test
df[0]= df.applymap(set)
Out[35]:
0
0 {Museum, Shopping, Art Museum}
One approach that avoids apply
# in your code just s = df['Tags']
s = pd.Series(['','', 'Tour',
'Outdoors, Beach, Sports',
'Museum, Drinking, Drinking, Shopping'])
(s.str.split(',\s+', expand=True)
.stack()
.reset_index()
.drop_duplicates(['level_0',0])
.groupby('level_0')[0]
.agg(','.join)
)
Output:
level_0
0
1
2 Tour
3 Outdoors,Beach,Sports
4 Museum,Drinking,Shopping
Name: 0, dtype: object
there maybe mach fancier way doing these kind of stuffs.
but will do the job.
make it lower-case
data['tags'] = data['tags'].str.lower()
split every row in tags col by comma it will return a list of string
data['tags'] = data['tags'].str.split(',')
map function str.strip to every element of list (remove trailing spaces).
apply set function return set of current words and remove duplicates
data['tags'] = data['tags'].apply(lambda x: set(map(str.strip , x)))
I have a column in pandas dataframe which has items like following,
SubBrand
Sam William Mathew
Jonty Rodes
Chris Gayle
I want to create a new column (SubBrand_new) such as
SubBrand_new
0 SWM
1 JR
2 CG
I am using this piece of code,
df1["SubBrand_new"] = "".join([x[0] for x in (df1["SubBrand"].str.split())])
but not able to get what I am looking for. Can anybody help?
We can do split with expand and sum i.e
df['SubBrand'].str.split(expand=True).apply(lambda x : x.str[0]).fillna('').sum(1)
0 SWM
1 JR
2 CG
dtype: object
You want to apply a function to every line and return a new column with its result. This kind of operation can be applied with the .apply() method, a simple = attribution will not do the trick. A solution in the spirit of your code would be:
df = pd.DataFrame({'Name': ['Marcus Livius Drussus',
'Lucius Cornelius Sulla',
'Gaius Julius Caesar']})
df['Abrev'] = df.Name.apply(lambda x: "".join([y[0] for y in (x.split())]))
Which yields
df
Name Abrev
0 Marcus Levius Drussus MLD
1 Lucius Cornelius Sulla LCS
2 Gaius Julius Caesar GJC
EDIT:
I compared it to the other solution, thinking that the apply() method with join() would be pretty slow. I was surprised to find that it is in fact faster. Setting:
N = 3000000
bank = pd.util.testing.rands_array(3,N)
vec = [bank[3*i] + ' ' + bank[3*i+1] + ' ' + bank[3*i+2] for i in range(N/3)]
df = pd.DataFrame({'Name': vec})
I find:
df.Name.apply(lambda x: "".join([y[0] for y in (x.split())]))
executed in 581ms
df.Name.str.split(expand=True).apply(lambda x : x.str[0]).fillna('').sum(1)
executed in 2.81s