I am looking for an efficient way to remove unwanted parts from strings in a DataFrame column.
My dataframe:
Passengers
1 Sally Muller, President, Mark Smith, Vicepresident, John Doe, Chief of Staff
2 Sally Muller, President, Mark Smith, Vicepresident
3 Sally Muller, President, Mark Smith, Vicepresident, John Doe, Chief of Staff
4 Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Parker, Special Effects
5 Sally Muller, President, John Doe, Chief of Staff, Peter Parker, Special Effects, Lydia Johnson, Vice Chief of Staff
...
desired form of df:
Passengers
1 Sally Muller, Mark Smith, John Doe
2 Sally Muller, Mark Smith
3 Sally Muller, Mark Smith, John Doe
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller, John Doe, Peter Parker, Lydia Johnson
...
Up to now I did it with endless handmade copy/paste regex list:
df = df.replace(r'President,','', regex=True)
df = df.replace(r'Vicepresident,','', regex=True)
df = df.replace(r'Chief of Staff,','', regex=True)
df = df.replace(r'Special Effects,','', regex=True)
df = df.replace(r'Vice Chief of Staff,','', regex=True)
...
Is there a more comfortable way to do this?
Edit
More accurate example of original df:
Passengers
1 Sally Muller, President, EL Mark Smith, John Doe, Chief of Staff, Peter Gordon, Director of Central Command
2 Sally Muller, President, EL Mark Smith, Vicepresident
3 Sally Muller, President, EL Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Gordon, Dir CC
4 Mark Smith, Vicepresident, John Doe, Chief of Staff, Peter Parker, Special Effects
5 President Sally Muller, John Doe Chief of Staff, Peter Parker, Special Effects, Lydia Johnson , Vice Chief of Staff
...
desired form of df:
Passengers
1 Sally Muller, Mark Smith, John Doe, Peter Gordon
2 Sally Muller, Mark Smith
3 Sally Muller, Mark Smith, John Doe, Peter Gordon
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller, John Doe, Peter Parker, Lydia Johnson
...
Up to now I did it with endless handmade copy/paste regex list:
df = df.replace(r'President','', regex=True)
df = df.replace(r'Director of Central Command,','', regex=True)
df = df.replace(r'Dir CC','', regex=True)
df = df.replace(r'Vicepresident','', regex=True)
df = df.replace(r'Chief of Staff','', regex=True)
df = df.replace(r'Special Effects','', regex=True)
df = df.replace(r'Vice Chief of Staff','', regex=True)
...
messy output is like:
Passengers
1 Sally Muller, , Mark Smith, John Doe, , Peter Gordon,
2 Sally Muller, Mark Smith,
3 Sally Muller, Mark Smith,, John Doe, Peter Gordon
4 Mark Smith, John Doe, Peter Parker
5 Sally Muller,, John Doe, Peter Parker , Lydia Johnson,
...
If every passenger has their title, then you can use str.split + explode, then select every second item starting from the first item, then groupby the index and join back:
out = df['Passengers'].str.split(',').explode()[::2].groupby(level=0).agg(', '.join)
or str.split + explode and apply a lambda that does the selection + join
out = df['Passengers'].str.split(',').apply(lambda x: ', '.join(x[::2]))
Output:
0 Sally Muller, Mark Smith, John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia...
Edit:
If not everyone has a title, then you can create a set of titles and split and filter out the titles. If the order of the names don't matter in each row, then you can use set difference and cast each set to a list in a list comprehension:
titles = {'President', 'Vicepresident', 'Chief of Staff', 'Special Effects', 'Vice Chief of Staff'}
out = pd.Series([list(set(x.split(', ')) - titles) for x in df['Passengers']])
If order matters, then you can use a nested list comprehension:
out = pd.Series([[i for i in x.split(', ') if i not in titles] for x in df['Passengers']])
This is one case where apply is actually faster that explode:
df2 = df['Passengers'].apply(lambda x: ', '.join(x.split(', ')[::2])) #.to_frame() # if dataframe needed
output:
Passengers
0 Sally Muller, Mark Smith, John Doe
1 Sally Muller, Mark Smith
2 Sally Muller, Mark Smith, John Doe
3 Mark Smith, John Doe, Peter Parker
4 Sally Muller, John Doe, Peter Parker, Lydia Jo...
We can create a full regex pattern match on every string you need to remove and replace.
This can handle situations were the passengers will not have a title.
df2 = df['Passengers'].str.replace("(President)|(Vicepresident)|(Chief of Staff)|(Special Effects)|(Vice Chief of Staff)", "",regex=True).replace("( ,)", "", regex=True).str.strip().str.rstrip(",")
Related
Let's say I have a pandas dataframe that looks like this:
import pandas as pd
data = {'name': ['Tom, Jeffrey, Henry', 'Nick, James', 'Chris', 'David, Oscar']}
df = pd.DataFrame(data)
df
name
0 Tom, Jeffrey, Henry
1 Nick, James
2 Chris
3 David, Oscar
I know I can split the names into separate columns using the comma as separator, like so:
df[["name1", "name2", "name3"]] = df["name"].str.split(", ", expand=True)
df
name name1 name2 name3
0 Tom, Jeffrey, Henry Tom Jeffrey Henry
1 Nick, James Nick James None
2 Chris Chris None None
3 David, Oscar David Oscar None
However, if the name column would have a row that contains 4 names, like below, the code above will yield a ValueError: Columns must be same length as key
data = {'name': ['Tom, Jeffrey, Henry', 'Nick, James', 'Chris', 'David, Oscar', 'Jim, Jones, William, Oliver']}
# Create DataFrame
df = pd.DataFrame(data)
df
name
0 Tom, Jeffrey, Henry
1 Nick, James
2 Chris
3 David, Oscar
4 Jim, Jones, William, Oliver
How can automatically split the name column into n-number of separate columns based on the ',' separator? The desired output would be this:
name name1 name2 name3 name4
0 Tom, Jeffrey, Henry Tom Jeffrey Henry None
1 Nick, James Nick James None None
2 Chris Chris None None None
3 David, Oscar David Oscar None None
4 Jim, Jones, William, Oliver Jim Jones William Oliver
Use DataFrame.join for new DataFrame with rename for new columns names:
f = lambda x: f'name{x+1}'
df = df.join(df["name"].str.split(", ", expand=True).rename(columns=f))
print (df)
name name1 name2 name3 name4
0 Tom, Jeffrey, Henry Tom Jeffrey Henry None
1 Nick, James Nick James None None
2 Chris Chris None None None
3 David, Oscar David Oscar None None
4 Jim, Jones, William, Oliver Jim Jones William Oliver
I have dataset. Here is the column of 'Name':
0 Braund, Mr. Owen Harris
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
2 Heikkinen, Miss. Laina
3 Futrelle, Mrs. Jacques Heath (Lily May Peel)
4 Allen, Mr. William Henry
...
151 Pears, Mrs. Thomas (Edith Wearne)
152 Meo, Mr. Alfonzo
153 van Billiard, Mr. Austin Blyler
154 Olsen, Mr. Ole Martin
155 Williams, Mr. Charles Duane
and need to extract first name, status, and second name. When I try this on simple string, its ok:
full_name="Braund, Mr. Owen Harris"
first_name=full_name.split(',')[0]
second_name=full_name.split('.')[1]
print('First name:',first_name)
print('Second name:',second_name)
status = full_name.replace(first_name, '').replace(',','').split('.')[0]
print('Status:',status)
>First name: Braund
>Second name: Owen Harris
>Status: Mr
But after trying to do this with pandas, I fail with the status:
df['first_Name'] = df['Name'].str.split(',').str.get(0) #its ok, worsk well
But after this:
status= df['Name'].str.replace(df['first_Name'], '').replace(',','').split('.').str.get(0)
I get a mistake:
>>TypeError: 'Series' objects are mutable, thus they cannot be hashed
What are possible solutions?
Edit:Thanks for the answers and extract columns. I do
def extract_name_data(row):
row.str.extract('(?P<first_name>[^,]+), (?P<status>\w+.) (?P<second_name>[^(]+\w) ?')
last_name = row['second_name']
title = row['status']
first_name = row['first_name']
return first_name, second_name, status
and get
AttributeError: 'str' object has no attribute 'str'
What can be done? Row is meaned to be df['Name']
You could use str.extract with named capturing groups:
df['Name'].str.extract('(?P<first_name>[^,]+), (?P<status>\w+.) (?P<second_name>[^(]+\w) ?')
output:
first_name status second_name
0 Braund Mr. Owen Harris
1 Cumings Mrs. John Bradley
2 Heikkinen Miss. Laina
3 Futrelle Mrs. Jacques Heath
4 Allen Mr. William Henry
5 Pears Mrs. Thomas
6 Meo Mr. Alfonzo
7 van Billiard Mr. Austin Blyler
8 Olsen Mr. Ole Martin
9 Williams Mr. Charles Duane
You can also place your original codes with slight modification into Pandas .apply() function for it to work, as follows:
Just replace your variable names in Python with the column names in Pandas.
For example, replace full_name with x['Name'] and first_name with x['first_Name'] within the lambda function of .apply() function:
df['status'] = df.apply(lambda x: x['Name'].replace(x['first_Name'], '').replace(',','').split('.')[0], axis=1)
Though may not be the most efficient way of doing it, it's a way to easily modify your existing codes in Python into a workable version in Pandas.
Result:
print(df)
Name first_Name status
0 Braund, Mr. Owen Harris Braund Mr
1 Cumings, Mrs. John Bradley (Florence Briggs Th... Cumings Mrs
2 Heikkinen, Miss. Laina Heikkinen Miss
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Futrelle Mrs
4 Allen, Mr. William Henry Allen Mr
151 Pears, Mrs. Thomas (Edith Wearne) Pears Mrs
152 Meo, Mr. Alfonzo Meo Mr
153 van Billiard, Mr. Austin Blyler van Billiard Mr
154 Olsen, Mr. Ole Martin Olsen Mr
155 Williams, Mr. Charles Duane Williams Mr
I have a report that contains Invoice IDs and approvers. Invoices can have multiple approvers, which results in the IDs being duplicated (this is fine). What I want to do, is check each group of Invoice IDs to see if either of 2 approvers are in the list of approvers associated with that ID. If they are, then I want to keep all of the rows of that ID. I think my question is similar to this one: Drop all rows in a group if none of the rows match a specific condition however no one has answered that one yet. Below is an example of what I'm going for.
**Invoice Id** **Approver**
149877RV Jane Doe
149877RV Joe Manchin
149877RV Michael Frank
149877RV Kevin Holder
149877RV Michael Frank
149877RV Michael Frank
149877RV James Doe
Michael Frank and Kevin Holder are the names I'm looking for. Since they are both present here (in my scenario it can be either one of them) I want to keep all of these rows.
150210 Jim Halpert
150210 Mike Smith
150210 FP&A
150210 Michael Scott
Since neither Michael Frank nor Kevin Holder are on this list, I want to remove all of these rows.
I haven't been able to find a solution that allows me to keep all rows as I'm describing.
This should work:
import pandas as pd
df = pd.DataFrame({'invoice_ID': ['149877RV' ,'149877RV' ,'149877RV','149877RV','149877RV','149877RV','149877RV'],
'Approver': ['Jane Doe','Joe Manchin','Michael Frank','Kevin Holder','Michael Frank','Michael Frank','James Doe']})
mask=df['Approver'].isin(['Michael Frank','Kevin Holder'])
df1=df.loc[mask]
df1=df1.drop_duplicates(subset=['invoice_ID'])
mask2=df['invoice_ID'].isin(df1['invoice_ID'])
final_list=df.loc[mask2]
I think you should check your condition for each available invoice id.
Then check your condition and return the results. I merged your data in order to make it a little bit more clear
data = {"invoice_id": ["149877RV", "149877RV", "149877RV", "149877RV", "149877RV", "149877RV", "149877RV", "150210", "150210", "150210", "150210"],
"approver": ["Jane Doe", "Joe Manchin", "Michael Frank", "Kevin Holder", "Michael Frank", "Michael Frank", "Jane Doe", "Jim Halpert", "Mike Smith", "FP&A", "Michael Scott"]}
df = pd.DataFrame.from_dict(data)
approver_1, approver_2 = "Michael Frank", "Kevin Holder"
results = []
for id in set(df["invoice_id"]):
approvers_per_id = set(df.loc[df.loc[:, "invoice_id"] == id, "approver"])
if approver_1 in approvers_per_id and approver_2 in approvers_per_id:
results.append(df.loc[df.loc[:, "invoice_id"] == id])
print(results)
This returns
[ invoice_id approver
0 149877RV Jane Doe
1 149877RV Joe Manchin
2 149877RV Michael Frank
3 149877RV Kevin Holder
4 149877RV Michael Frank
5 149877RV Michael Frank
6 149877RV Jane Doe]
This question already has answers here:
Python: Random selection per group
(11 answers)
Closed 4 years ago.
Let's say I have a pandas DataFrame named df that looks like this
father_name child_name
Robert Julian
Robert Emily
Robert Dan
Carl Jack
Carl Rose
John Lucy
John Mark
John Alysha
Paul Christopher
Paul Thomas
Robert Kevin
Carl Elisabeth
where I know for sure that each father has at least 2 children.
I would like to obtain a DataFrame where each father has exactly 2 of his children, and those two children are selected at random. An example output would be
father_name child_name
Robert Emily
Robert Kevin
Carl Jack
Carl Elisabeth
John Alysha
John Mark
Paul Thomas
Paul Christopher
How can I do that?
You can apply DataFrame.sample on the grouped data. It takes the parameter n which you can set to 2
df.groupby('father_name').child_name.apply(lambda x: x.sample(n=2))\
.reset_index(1, drop = True).reset_index()
father_name child_name
0 Carl Elisabeth
1 Carl Jack
2 John Mark
3 John Lucy
4 Paul Thomas
5 Paul Christopher
6 Robert Emily
7 Robert Julian
I have data that is in 3 columns (name, question, response) that resulted from judging a student research symposium. There are 2 possible choices for level (graduate/undergraduate), and 5 for college (science, education, etc.)
What I need to do is take the average of the numerical responses for a given name, and sort by average numerical score for that person to output a table containing:
College Level Rank
science graduate 1st
science graduate 2nd
science graduate 3rd
science undergrad 1st
...
education graduate 1st
...
education undergrad 3rd
Here's a sample data table:
name question response
Smith, John Q1 10
Smith, John Q2 7
Smith, John Q3 10
Smith, John Q4 8
Smith, John Q5 10
Smith, John Q8 8
Smith, John level graduate
Smith, John colleg science
Smith, John Q9 4
Jones, Mary Q3 10
Jones, Mary Q2 10
Jones, Mary Q1 10
Jones, Mary Q4 10
Jones, Mary level undergraduate
Jones, Mary college education
Jones, Mary Q6 10
Jones, Mary Q7 10
Jones, Mary Q9 10
A talented student did this for us in excel using pivot tables, but I'm sure this can be done using pandas, and I'm very curious how to do it (I'm quite new to pandas). The tricky part is all the 'useful' information is in that 3rd column.
convert response column to numeric, the string will be na then groupby and aggregate
import pandas as pd
import numpy as np
import StringIO
data = '''name;question;response
Smith, John;Q1;10
Smith, John;Q2;7
Smith, John;Q3;10
Smith, John;Q4;8
Smith, John;Q5;10
Smith, John;Q8;8
Smith, John;level;graduate
Smith, John;colleg;science
Smith, John;Q9;4
Jones, Mary;Q3;10
Jones, Mary;Q2;10
Jones, Mary;Q1;10
Jones, Mary;Q4;10
Jones, Mary;level;undergraduate
Jones, Mary;college;education
Jones, Mary;Q6;10
Jones, Mary;Q7;10
Jones, Mary;Q9;10'''
df = pd.read_csv(StringIO.StringIO(data), delimiter=';')
df['response'] = pd.to_numeric(df['response'], errors='coerce')
df.groupby('name').agg(np.mean).reset_index().sort_values(by='response')
output
name response
1 Smith, John 8.142857
0 Jones, Mary 10.000000
You can start pivoting the dataframe in order to obtain the info for the college and level and the question marks in different dataframes:
pivoted = df.pivot_table(index='name',
columns='question',
values='response',
fill_value=np.nan, # Changing this value to 0 would consider
# unanswered questions as 0 score
aggfunc=lambda x: x)
categories = pivoted[['college','level']]
questions = pivoted.drop(['college','level'],axis=1)
And set the question mark average to each student in the categories dataframe:
categories['points'] = questions.astype(float).mean(axis=1,skipna=True)
The skipna=True combined with the fill_value=np.nan makes that unanswered questions will not compute in the average, thus, a student with only one answer which is 10 will have mean 10. As commented, fill_value=0 modifies this behaviour.
Eventually, the values can be sorted using sort_values in order to have a ranking for each category:
categories.sort_values(['college','level','points'],ascending=False)