How to split a string in a pandas dataframe with capital letters - python

I am playing around with some NFL data and I have a column in a dataframe that looks like:
0 Lamar JacksonL. Jackson BAL
1 Patrick Mahomes IIP. Mahomes KC
2 Dak PrescottD. Prescott DAL
3 Josh AllenJ. Allen BUF
4 Russell WilsonR. Wilson SEA
There are 3 bits of information in each cell - FullName, ShortName and Team whihc i am hoping to create new columns for.
Expected output:
FullName ShortName Team
0 Lamar Jackson L. Jackson BAL
1 Patrick Mahomes II P. Mahomes KC
2 Dak Prescott D. Prescott DAL
3 Josh Allen J. Allen BUF
4 Russell Wilson R. Wilson SEA
Ive managed to get the Team but I'm not quite sure how to do all three in the one line.
I was thinking of splitting the string by finding the previous character from the fullstop however there are some names that appear such as:
Anthony McFarland Jr.A. McFarland PIT
which have multiple full stops.
Anyone have an idea of the best way to approach this? Thanks!

The pandas Series str.extract method is what you're looking for. This regex works for all of the cases you've presented, though there may be some other edge cases.
df = pd.DataFrame({
"bad_col": ["Lamar JacksonL. Jackson BAL", "Patrick Mahomes IIP. Mahomes KC",
"Dak PrescottD. Prescott DAL", "Josh AllenJ. Allen BUF",
"Josh AllenJ. Allen SEA", "Anthony McFarland Jr.A. McFarland PIT"],
})
print(df)
bad_col
0 Lamar JacksonL. Jackson BAL
1 Patrick Mahomes IIP. Mahomes KC
2 Dak PrescottD. Prescott DAL
3 Josh AllenJ. Allen BUF
4 Josh AllenJ. Allen SEA
5 Anthony McFarland Jr.A. McFarland PIT
pattern = r"(?P<full_name>.+)(?=[A-Z]\.)(?P<short_name>[A-Z]\.\s.*)\s(?P<team>[A-Z]+)"
new_df = df["bad_col"].str.extract(pattern, expand=True)
print(new_df)
full_name short_name team
0 Lamar Jackson L. Jackson BAL
1 Patrick Mahomes II P. Mahomes KC
2 Dak Prescott D. Prescott DAL
3 Josh Allen J. Allen BUF
4 Josh Allen J. Allen SEA
5 Anthony McFarland Jr. A. McFarland PIT
Breaking down that regex:
(?P<full_name>.+)(?=[A-Z]\.)(?P<short_name>[A-Z]\.\s.*)\s(?P<team>[A-Z]+)
(?P<full_name>.+)(?=[A-Z]\.)
captures any letter UNTIL we see a capital letter followed by a fullstop/period we use a lookahead (?=...) to not consume the capital letter and fullstop because this part of the string belongs to the short name
(?P<short_name>[A-Z]\.\s.*.)\s
captures a capital letter (the players first initial), then a fullstop (the period that comes after their first initial), then a space (between first initial and last name), then all characters until we hit a space (the players last name). The space is not included in the capture group.
(?P<team>[A-Z]+)
capture all of the remaining capital letters in the string (ends up being the players team)
You've probably noticed that I've used named capture groups as denoted by the (?Ppattern) structure. In pandas, the name of the capture group becomes the name of the column and whatever is captured in that group becomes the values in that column.
Now to join the new dataframe back to our original one to come full circle:
df = df.join(new_df)
print(df)
bad_col full_name short_name \
0 Lamar JacksonL. Jackson BAL Lamar Jackson L. Jackson
1 Patrick Mahomes IIP. Mahomes KC Patrick Mahomes II P. Mahomes
2 Dak PrescottD. Prescott DAL Dak Prescott D. Prescott
3 Josh AllenJ. Allen BUF Josh Allen J. Allen
4 Josh AllenJ. Allen SEA Josh Allen J. Allen
5 Anthony McFarland Jr.A. McFarland PIT Anthony McFarland Jr. A. McFarland
team
0 BAL
1 KC
2 DAL
3 BUF
4 SEA
5 PIT

My guess is that short names would not contain fullstops. So, you can search for the first fullstop from the end of the line. So, from one character before that fullstop until the first space is your short name. Anything coming before one letter before that fullstop is going to be FullName.

This might help.
import re
name = 'Anthony McFarland Jr.A. McFarland PIT'
short_name = re.findall(r'(\w\.\s[\w]+)\s[\w]{3}', name)[0]
full_name = name.replace(short_name, "")[:-4]
team = name[-3:]
print(short_name)
print(full_name)
print(team)
Ouput:
A. McFarland
Anthony McFarland Jr.
PIT

import pandas as pd
import numpy as np
df = pd.DataFrame({'players':['Lamar JacksonL. Jackson BAL', 'Patrick Mahomes IIP. Mahomes KC',
'Anthony McFarland Jr.A. McFarland PIT']})
def splitName(name):
last_period_pos = np.max(np.where(np.array(list(name)) == '.'))
full_name = name[:(last_period_pos - 1)]
short_name_team = name[(last_period_pos - 1):]
team_pos = np.max(np.where(np.array(list(short_name_team)) == ' '))
short_name = short_name_team[:team_pos]
team = short_name_team[(team_pos + 1):]
return full_name, short_name, team
df['full_name'], df['short_name'], df['team'] = zip(*df.players.apply(splitName))

Related

Merging strings of people's names in pandas

I have two datasets that I want to merge based off the persons name. One data set player_nationalities has their full name:
Player, Nationality
Kylian Mbappé, France
Wissam Ben Yedder, France
Gianluigi Donnarumma, Italy
The other dataset player_ratings shortens their first names with a full stop and keeps the other name(s).
Player, Rating
K. Mbappé, 93
W. Ben Yedder, 89
G. Donnarumma, 91
How do I merge these tables based on the column Player and avoid merging people with the same last name? This is my attempt:
df = pd.merge(player_nationality, player_ratings, on='Player', how='left')
Player, Nationality, Rating
K. Mbappé, France, NaN
W. Ben Yedder, France, NaN
G. Donnarumma, Italy, NaN
You would need to normalize the keys in both DataFrames in order to merge them.
One idea would be to create a function to process the full name in player_nationalities and merge on the processed value for player name. eg:
def convert_player_name(name):
try:
first_name, last_name = name.split(' ', maxsplit=1)
return f'{first_name[0]}. {last_name}'
except ValueError:
return name
player_nationalities['processed_name'] = [convert_player_name(name) for name in player_nationalities['Player']]
df_merged = player_nationalities.merge(player_ratings, left_on='processed_name', right_on='Player')
[out]
Player_x Nationality processed_name Player_y Rating
0 Kylian Mbappé France K. Mbappé K. Mbappé 93
1 Wissam Ben Yedder France W. Ben Yedder W. Ben Yedder 89
2 Gianluigi Donnarumma Italy G. Donnarumma G. Donnarumma 91

Replacement Values into the integer on dataset columns

House Number
Street
First Name
Surname
Age
Relationship to Head of House
Marital Status
Gender
Occupation
Infirmity
Religion
0
1
Smith Radial
Grace
Patel
46
Head
Widowed
Female
Petroleum engineer
None
Catholic
1
1
Smith Radial
Ian
Nixon
24
Lodger
Single
Male
Publishing rights manager
None
Christian
2
2
Smith Radial
Frederick
Read
87
Head
Divorced
Male
Retired TEFL teacher
None
Catholic
3
3
Smith Radial
Daniel
Adams
58
Head
Divorced
Male
Therapist, music
None
Catholic
4
3
Smith Radial
Matthew
Hall
13
Grandson
NaN
Male
Student
None
NaN
5
3
Smith Radial
Steven
Fletcher
9
Grandson
NaN
Male
Student
None
NaN
6
4
Smith Radial
Alison
Jenkins
38
Head
Single
Female
Physiotherapist
None
Catholic
7
4
Smith Radial
Kelly
Jenkins
12
Daughter
NaN
Female
Student
None
NaN
8
5
Smith Radial
Kim
Browne
69
Head
Married
Female
Retired Estate manager/land agent
None
Christian
9
5
Smith Radial
Oliver
Browne
69
Husband
Married
Male
Retired Merchandiser, retail
None
None
Hello,
I have a dataset that you can see below. When I tried to convert Age to int. I got that error: ValueError: invalid literal for int() with base 10: '43.54302670766108'
This means there is float data inside that data. I tried to replace '.' to '0' then tried to convert but I failed. Could you help me to do that?
df['Age'] = df['Age'].replace('.','0')
df['Age'] = df['Age'].astype('int')
I still got the same error. I think replace line is not working. Do you know why?
Thanks
Try:
df['Age'] = df['Age'].replace('\..*$', '', regex=True).astype(int)
Or, more drastic:
df['Age'] = df['Age'].replace('^(?:.*\D.*)?$', '0', regex=True).astype(int)
You do not need to manipulate the strings; you might first convert values to float then to int like:
df["Age"] = df["Age"].astype('float').astype('int')

Find and replace in dataframe from another dataframe

I have two dataframes, here are snippets of both below. I am trying to find and replace the artists names in the second dataframe with the id's in the first dataframe. Is there a good way to do this?
id fullName
0 1 Colin McCahon
1 2 Robert Henry Dickerson
2 3 Arthur Dagley
Artists
0 Arthur Dagley, Colin McCahon, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 Robert Henry Dickerson
3 Steve Carr
Desired output:
Artists
0 3, 1, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 2
3 Steve Carr
You mean check with replace
df1.Artists.replace(dict(zip(df.fullName,df.id.astype(str))),regex=True)
0 3, 1, Maria Cruz
1 Fiona Gilmore, Peter Madden, Nicholas Spratt, ...
2 2
3 Steve Carr
Name: Artists, dtype: object
Convert your first dataframe into a dictionary:
d = Series(name_df.id.astype(str),index=name_df.fullName).to_dict()
Then use .replace():
artists_df["Artists"] = artists_df["Artists"].replace(d, regex=True)

randomly select n rows from each block - pandas DataFrame [duplicate]

This question already has answers here:
Python: Random selection per group
(11 answers)
Closed 4 years ago.
Let's say I have a pandas DataFrame named df that looks like this
father_name child_name
Robert Julian
Robert Emily
Robert Dan
Carl Jack
Carl Rose
John Lucy
John Mark
John Alysha
Paul Christopher
Paul Thomas
Robert Kevin
Carl Elisabeth
where I know for sure that each father has at least 2 children.
I would like to obtain a DataFrame where each father has exactly 2 of his children, and those two children are selected at random. An example output would be
father_name child_name
Robert Emily
Robert Kevin
Carl Jack
Carl Elisabeth
John Alysha
John Mark
Paul Thomas
Paul Christopher
How can I do that?
You can apply DataFrame.sample on the grouped data. It takes the parameter n which you can set to 2
df.groupby('father_name').child_name.apply(lambda x: x.sample(n=2))\
.reset_index(1, drop = True).reset_index()
father_name child_name
0 Carl Elisabeth
1 Carl Jack
2 John Mark
3 John Lucy
4 Paul Thomas
5 Paul Christopher
6 Robert Emily
7 Robert Julian

How to create a single value out of a list within a 3D dataframe in pandas?

this is my first post here and a noob in programming so I apologize in advance for any unintended lack of syntax and appropriate technical vocabulary.
I have the following dataframe (excerpt) which, correct me if I'm wrong, is a 3D pandas df with lists (...arrays?) within the df having different lengths?
df=
Genre Cast
0 Action, Drama, Comedy, horror Brad P., Denzel W.
1 Crime Al P., Robert De N., Angelina J., Lupita N.
2 Action, Sci-fi, Adventure Mark W., Jamie F., Mila K.
3 Drama, Crime Jessica C.,Emma S.
4 Thriller, Action, Comedy, Romance Jennifer L., Tom H., Charlize H., Vin D., Denzel W.
5 Thriller, Drama, Adventure Tupac, George C., Kevin S.
Now I want to oversimplify the list contained within the Genre to one string that I'll set as the Main_Genre
eg: if Main_Genre=[Action, Crime, Drama, Thriller, etc.] with the most important genre as Action > Crime > Drama > Thriller > etc. then I want my df to look like
df=
Genre Cast
0 Action Brad P., Denzel W.
1 Crime Al P., Robert De N., Angelina J., Lupita N.
2 Action Mark W., Jamie F., Mila K.
3 Crime Jessica C.,Emma S.
4 Action Jennifer L., Tom H., Charlize H., Vin D., Denzel W.
5 Drama Tupac, George C., Kevin S.
Is that even possible in Pandas? Should I "fill" my lists with NaN for eg. so that they all have the same lengths within the df (how would that be done)?
Any help would be appreciated, I'm not even sure where to start on that one!
Thanks
here is my approach clearly explained:
df.Genre.str.split(', ').apply(lambda x : sorted(x, key=lambda y: order_list.index(y))[0])
split all the gender into a list, you get a list or all gender,
then sort all the list with the respect to the order of the genre in an order list( a list where all the genre is ordered)
and finally get the first item in that list.
here is an example of your order list :
order_list = ['Action' , 'Crime' , 'Drama' , 'Thriller', 'Comedy','Romance', 'horror', 'Sci-fi', 'Adventure ', 'Adventure']

Categories

Resources