I had a requirement where I had to replace ' " ' with ' ' (remove the double quote). So I tried the following:
Approach 1
test.Name = test.Name.replace('"', '')
test_label.name = test_label.name.replace('"', '')
Both the dataframe had same values so if I try to see the difference between values of both the columns I should get null. But to my surprise it was not null. I tried this:
set(test.Name) - set(test_label.name)
{'Assaf Khalil, Mrs. Mariana (Miriam")"',
'Cotterill, Mr. Henry Harry""',
'Coutts, Mrs. William (Winnie Minnie" Treanor)"',
'Daly, Miss. Margaret Marcella Maggie""',
'Dean, Miss. Elizabeth Gladys Millvina""',
'Hocking, Miss. Ellen Nellie""',
'Johnston, Master. William Arthur Willie""',
'Johnston, Mrs. Andrew G (Elizabeth Lily" Watson)"',
'Katavelas, Mr. Vassilios (Catavelas Vassilios")"',
'Khalil, Mrs. Betros (Zahie Maria" Elias)"',
'Lindeberg-Lind, Mr. Erik Gustaf (Mr Edward Lingrey")"',
'McCarthy, Miss. Catherine Katie""',
'Moubarek, Mrs. George (Omine Amenia" Alexander)"',
'Nakid, Mrs. Said (Waika Mary" Mowad)"',
'Nourney, Mr. Alfred (Baron von Drachstedt")"',
'Riihivouri, Miss. Susanna Juhantytar Sanni""',
'Riordan, Miss. Johanna Hannah""',
'Rosenshine, Mr. George (Mr George Thorne")"',
'Thomas, Mrs. Alexander (Thamine Thelma")"',
'Wells, Mrs. Arthur Henry (Addie" Dart Trevaskis)"',
'Wheeler, Mr. Edwin Frederick""',
'Willer, Mr. Aaron (Abi Weller")"'}
I could still see " in the values which means replace didn't work. So I tried another approach.
Approach 2
test.Name = test.Name.str.replace('"', '', regex=False)
test_label.name = test_label.name.str.replace('"', '', regex=False)
set(test.Name) - set(test_label.name)
set()
The second approach returned what I expected. So my question is why didn't df.col.replace() the values?
By inspection we can determine the type of df.Name and df.Name.str:
print(type(df.Name)) # <class 'pandas.core.series.Series'>
print(type(df.Name.str)) # <class 'pandas.core.strings.StringMethods'>
Then, we can find the documentation for Series and StringMethods here and here, respectively. The following are the signatures for their respective replace methods:
Series.str.replace(self, pat, repl, n=-1, case=None, flags=0, regex=True)
Series.replace(self, to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
Note that Series.replace's regex argument default value is False and the one for Series.str.replace is True. So, if you want both functions to have the result you expect, which is to remove the double quote marks you have to set the regex argument to True for the Series.replace method.
Here is an example comparing the results of Series.replace with regex = False and regex = True with that of Series.str.replace:
import pandas as pd
data = {
'Name':
[
'Assaf Khalil, Mrs. Mariana (Miriam")"',
'Cotterill, Mr. Henry Harry""',
'Coutts, Mrs. William (Winnie Minnie" Treanor)"',
'Daly, Miss. Margaret Marcella Maggie""',
'Dean, Miss. Elizabeth Gladys Millvina""',
'Hocking, Miss. Ellen Nellie""',
'Johnston, Master. William Arthur Willie""',
'Johnston, Mrs. Andrew G (Elizabeth Lily" Watson)"',
'Katavelas, Mr. Vassilios (Catavelas Vassilios")"',
'Khalil, Mrs. Betros (Zahie Maria" Elias)"',
'Lindeberg-Lind, Mr. Erik Gustaf (Mr Edward Lingrey")"',
'McCarthy, Miss. Catherine Katie""',
'Moubarek, Mrs. George (Omine Amenia" Alexander)"',
'Nakid, Mrs. Said (Waika Mary" Mowad)"',
'Nourney, Mr. Alfred (Baron von Drachstedt")"',
'Riihivouri, Miss. Susanna Juhantytar Sanni""',
'Riordan, Miss. Johanna Hannah""',
'Rosenshine, Mr. George (Mr George Thorne")"',
'Thomas, Mrs. Alexander (Thamine Thelma")"',
'Wells, Mrs. Arthur Henry (Addie" Dart Trevaskis)"',
'Wheeler, Mr. Edwin Frederick""',
'Willer, Mr. Aaron (Abi Weller")"'
]
}
df1 = pd.DataFrame.from_dict(data)
df2 = pd.DataFrame.from_dict(data)
df3 = pd.DataFrame.from_dict(data)
df1.Name = df1.Name.replace('"', '', regex = True)
df2.Name = df2.Name.replace('"', '', regex = False)
df3.Name = df3.Name.str.replace('"', '')
print("df1 equals df2?:", df1.equals(df2))
print("df1 equals df3?:", df1.equals(df3))
print(set(df1.Name) - set(df2.Name))
print(set(df1.Name) - set(df3.Name))
Output:
df1 equals df2?: False
df1 equals df3?: True
{'Moubarek, Mrs. George (Omine Amenia Alexander)', 'McCarthy, Miss. Catherine Katie', 'Cotterill, Mr. Henry Harry', 'Katavelas, Mr. Vassilios (Catavelas Vassilios)', 'Coutts, Mrs. William (Winnie Minnie Treanor)', 'Hocking, Miss. Ellen Nellie', 'Wheeler, Mr. Edwin Frederick', 'Thomas, Mrs. Alexander (Thamine Thelma)', 'Johnston, Mrs. Andrew G (Elizabeth Lily Watson)', 'Dean, Miss. Elizabeth Gladys Millvina', 'Willer, Mr. Aaron (Abi Weller)', 'Nourney, Mr. Alfred (Baron von Drachstedt)', 'Wells, Mrs. Arthur Henry (Addie Dart Trevaskis)', 'Assaf Khalil, Mrs. Mariana (Miriam)', 'Daly, Miss. Margaret Marcella Maggie', 'Johnston, Master. William Arthur Willie', 'Riihivouri, Miss. Susanna Juhantytar Sanni', 'Rosenshine, Mr. George (Mr George Thorne)', 'Nakid, Mrs. Said (Waika Mary Mowad)', 'Riordan, Miss. Johanna Hannah', 'Lindeberg-Lind, Mr. Erik Gustaf (Mr Edward Lingrey)', 'Khalil, Mrs. Betros (Zahie Maria Elias)'}
set()
Related
For example I want to find all the people that has "Abbott" in their name
0 Abbing, Mr. Anthony
1 Abbott, Mr. Rossmore Edward
2 Abbott, Mrs. Stanton (Rosa Hunt)
3 Abelson, Mr. Samuel
4 Abelson, Mrs. Samuel (Hannah Wizosky)
...
886 de Mulder, Mr. Theodore
887 de Pelsmaeker, Mr. Alfons
888 del Carlo, Mr. Sebastiano
889 van Billiard, Mr. Austin Blyler
890 van Melkebeke, Mr. Philemon
Name: Name, Length: 891, dtype: object
df.loc[name in df["Name"]]
I tried this and it didn't work
'False: boolean label can not be used without a boolean index'
You can use str.contains with the column you are interested in searching
>>> import pandas as pd
>>> df = pd.DataFrame(data={'Name': ['Smith', 'Jones', 'Smithson']})
>>> df
Name
0 Smith
1 Jones
2 Smithson
>>> df[df['Name'].str.contains('Smith')]
Name
0 Smith
2 Smithson
DaraFrame
Decision which came to my mind is:
dataset['Name'].loc[dataset['Sex'] == 'female'].value_counts().idxmax()
But here is not such ordinary decision because there are names of female's husband after Mrs and i need to somehowes split it
Input data:
df = pd.DataFrame({'Name': ['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry', 'Moran, Mr. James', 'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard', 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)', 'Nasser, Mrs. Nicholas (Adele Achem)'],
'Sex': ['male', 'female', 'female', 'female', 'male', 'male', 'male', 'male', 'female', 'female'],
})
Task 4: Name the most popular female name on the ship.
'some code'
Output: Anna #The most popular female name
Task 5: Name the most popular male name on the ship.
'some code'
Output: Wilhelm #The most popular male name
Quick and dirty would be something like:
from collections import Counter
# Random list of names
your_lst = ["Mrs Braun", "Allen, Mr. Timothy J", "Allen, Mr. Henry William"]
# Split names by space, and flatten the list.
your_lst_flat = [item for sublist in [x.split(" ") for x in your_lst ] for item in sublist]
# Count occurrences. With this you will get a count of all the values, including Mr and Mrs. But you can just ignore these.
Counter(your_lst_flat).most_common()
IIUC, you can use a regex to extract either the first name, or if Mrs. the name after the parentheses:
s = df['Name'].str.extract(r'((?:(?<=Mr. )|(?<=Miss. )|(?<=Master. ))\w+|(?<=\()\w+)',
expand=False)
s.groupby(df['Sex']).value_counts()
output:
Sex Name
female Adele 1
Elisabeth 1
Florence 1
Laina 1
Lily 1
male Gosta 1
James 1
Owen 1
Timothy 1
William 1
Name: Name, dtype: int64
regex demo
once you have s, to get the most frequent female name(s):
s[df['Sex'].eq('female')].mode()
Code sample :
def parse_first_name_female(name):
first = name.str.extract(r"Mrs\.\s+[^(]*\((\w+)", expand=False)
first.loc[first.isna()] = name.str.extract(r"\.\s+(\w+)", expand=False)
return first
female_names = parse_first_name_female(dataset.loc[dataset['Sex']=='female', 'Name'])
This code returns:
print(female_names)
1 Florence
2 Laina
3 Lily
8 Elisabeth
9 Adele
...
880 Imanita
882 Gerda
885 Margaret
887 Margaret
888 Catherine
I need this code just first name to return('Florence')
Are you looking for .iloc?
def parse_first_name_female(name):
first = name.str.extract(r"Mrs\.\s+[^(]*\((\w+)", expand=False)
first.loc[first.isna()] = name.str.extract(r"\.\s+(\w+)", expand=False)
return first
female_names = parse_first_name_female(dataset.loc[dataset['Sex']=='female', 'Name'])
print(female_names.iloc[0])
# Output
Florence
Setup:
df = pd.DataFrame({'Name': ['Braund, Mr. Owen Harris', 'Cumings, Mrs. John Bradley (Florence Briggs Thayer)', 'Heikkinen, Miss. Laina', 'Futrelle, Mrs. Jacques Heath (Lily May Peel)', 'Allen, Mr. William Henry', 'Moran, Mr. James', 'McCarthy, Mr. Timothy J', 'Palsson, Master. Gosta Leonard', 'Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)', 'Nasser, Mrs. Nicholas (Adele Achem)'],
'Sex': ['male', 'female', 'female', 'female', 'male', 'male', 'male', 'male', 'female', 'female'],
})
I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.
My dataframe:
Passengers
1 Sally Muller, President, Mark Smith, Vicepresident, John Doe, Chief of Staff
2 Sally Muller, President, Mark Smith, Vicepresident
3 Sally Muller, President, Mark Smith, Vicepresident, John Doe, Chief of Staff
4 Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Parker, Special Effects
5 Sally Muller, President, John Doe, Chief of Staff, Peter Parker, Special Effects, Lydia Johnson, Vice Chief of Staff
...
desired form of df:
Passengers
1 Sally Muller, Mark Smith, John Doe
2 Sally Muller, Mark Smith
3 Sally Muller, Mark Smith, John Doe
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller, John Doe, Peter Parker, Lydia Johnson
...
Up to now I did it with endless handmade copy/paste regex list:
df = df.replace(r'President,','', regex=True)
df = df.replace(r'Vicepresident,','', regex=True)
df = df.replace(r'Chief of Staff,','', regex=True)
df = df.replace(r'Special Effects,','', regex=True)
df = df.replace(r'Vice Chief of Staff,','', regex=True)
...
Is there a more comfortable way to do this?
Edit
More accurate example of original df:
Passengers
1 Sally Muller, President, EL Mark Smith, John Doe, Chief of Staff, Peter Gordon, Director of Central Command
2 Sally Muller, President, EL Mark Smith, Vicepresident
3 Sally Muller, President, EL Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Gordon, Dir CC
4 Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Parker, Special Effects
5 President Sally Muller, John Doe Chief of Staff, Peter Parker, Special Effects, Lydia Johnson , Vice Chief of Staff
...
desired form of df:
Passengers
1 Sally Muller, Mark Smith, John Doe, Peter Gordon
2 Sally Muller, Mark Smith
3 Sally Muller, Mark Smith, John Doe, Peter Gordon
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller, John Doe, Peter Parker, Lydia Johnson
...
Up to now I did it with endless handmade copy/paste regex list:
df = df.replace(r'President','', regex=True)
df = df.replace(r'Director of Central Command,','', regex=True)
df = df.replace(r'Dir CC','', regex=True)
df = df.replace(r'Vicepresident','', regex=True)
df = df.replace(r'Chief of Staff','', regex=True)
df = df.replace(r'Special Effects','', regex=True)
df = df.replace(r'Vice Chief of Staff','', regex=True)
...
messy output is like:
Passengers
1 Sally Muller, , Mark Smith, John Doe, , Peter Gordon,
2 Sally Muller, Mark Smith,
3 Sally Muller, Mark Smith,, John Doe, Peter Gordon
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller,, John Doe, Peter Parker , Lydia Johnson,
...
If every passenger has their title, then you can use str.split + explode, then select every second item starting from the first item, then groupby the index and join back:
out = df['Passengers'].str.split(',').explode()[::2].groupby(level=0).agg(', '.join)
or str.split + explode and apply a lambda that does the selection + join
out = df['Passengers'].str.split(',').apply(lambda x: ', '.join(x[::2]))
Output:
0 Sally Muller, Mark Smith, John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia...
Edit:
If not everyone has a title, then you can create a set of titles and split and filter out the titles. If the order of the names don't matter in each row, then you can use set difference and cast each set to a list in a list comprehension:
titles = {'President', 'Vicepresident', 'Chief of Staff', 'Special Effects', 'Vice Chief of Staff'}
out = pd.Series([list(set(x.split(', ')) - titles) for x in df['Passengers']])
If order matters, then you can use a nested list comprehension:
out = pd.Series([[i for i in x.split(', ') if i not in titles] for x in df['Passengers']])
This is one case where apply is actually faster that explode:
df2 = df['Passengers'].apply(lambda x: ', '.join(x.split(', ')[::2])) #.to_frame() # if dataframe needed
output:
Passengers
0 Sally Muller, Mark Smith, John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia Jo...
We can create a full regex pattern match on every string you need to remove and replace.
This can handle situations were the passengers will not have a title.
df2 = df['Passengers'].str.replace("(President)|(Vicepresident)|(Chief of Staff)|(Special Effects)|(Vice Chief of Staff)", "",regex=True).replace("( ,)", "", regex=True).str.strip().str.rstrip(",")
I have dataset. Here is the column of 'Name':
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
...
151 Pears, Mrs. Thomas (Edith Wearne)
152 Meo, Mr. Alfonzo
153 van Billiard, Mr. Austin Blyler
154 Olsen, Mr. Ole Martin
155 Williams, Mr. Charles Duane
and need to extract first name, status, and second name. When I try this on simple string, its ok:
full_name="Braund, Mr. Owen Harris"
first_name=full_name.split(',')[0]
second_name=full_name.split('.')[1]
print('First name:',first_name)
print('Second name:',second_name)
status = full_name.replace(first_name, '').replace(',','').split('.')[0]
print('Status:',status)
>First name: Braund
>Second name: Owen Harris
>Status: Mr
But after trying to do this with pandas, I fail with the status:
df['first_Name'] = df['Name'].str.split(',').str.get(0) #its ok, worsk well
But after this:
status= df['Name'].str.replace(df['first_Name'], '').replace(',','').split('.').str.get(0)
I get a mistake:
>>TypeError: 'Series' objects are mutable, thus they cannot be hashed
What are possible solutions?
Edit:Thanks for the answers and extract columns. I do
def extract_name_data(row):
row.str.extract('(?P<first_name>[^,]+), (?P<status>\w+.) (?P<second_name>[^(]+\w) ?')
last_name = row['second_name']
title = row['status']
first_name = row['first_name']
return first_name, second_name, status
and get
AttributeError: 'str' object has no attribute 'str'
What can be done? Row is meaned to be df['Name']
You could use str.extract with named capturing groups:
df['Name'].str.extract('(?P<first_name>[^,]+), (?P<status>\w+.) (?P<second_name>[^(]+\w) ?')
output:
first_name status second_name
0 Braund Mr. Owen Harris
1 Cumings Mrs. John Bradley
2 Heikkinen Miss. Laina
3 Futrelle Mrs. Jacques Heath
4 Allen Mr. William Henry
5 Pears Mrs. Thomas
6 Meo Mr. Alfonzo
7 van Billiard Mr. Austin Blyler
8 Olsen Mr. Ole Martin
9 Williams Mr. Charles Duane
You can also place your original codes with slight modification into Pandas .apply() function for it to work, as follows:
Just replace your variable names in Python with the column names in Pandas.
For example, replace full_name with x['Name'] and first_name with x['first_Name'] within the lambda function of .apply() function:
df['status'] = df.apply(lambda x: x['Name'].replace(x['first_Name'], '').replace(',','').split('.')[0], axis=1)
Though may not be the most efficient way of doing it, it's a way to easily modify your existing codes in Python into a workable version in Pandas.
Result:
print(df)
Name first_Name status
0 Braund, Mr. Owen Harris Braund Mr
1 Cumings, Mrs. John Bradley (Florence Briggs Th... Cumings Mrs
2 Heikkinen, Miss. Laina Heikkinen Miss
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Futrelle Mrs
4 Allen, Mr. William Henry Allen Mr
151 Pears, Mrs. Thomas (Edith Wearne) Pears Mrs
152 Meo, Mr. Alfonzo Meo Mr
153 van Billiard, Mr. Austin Blyler van Billiard Mr
154 Olsen, Mr. Ole Martin Olsen Mr
155 Williams, Mr. Charles Duane Williams Mr