Pandas: efficient way to replace entire string with a substring

Pandas: efficient way to replace entire string with a substring - python

I have a dataframe that looks like this:
df = pd.DataFrame({
'name': ['John','William', 'Nancy', 'Susan', 'Robert', 'Lucy', 'Blake', 'Sally', 'Bruce'],
'injury': ['right hand broken', 'lacerated left foot', 'foot broken', 'right foot fractured', '', 'sprained finger', 'chest pain', 'swelling in arm', 'laceration to arms, hands, and foot']
})
name injury
0 John right hand broken
1 William lacerated left foot
2 Nancy foot broken
3 Susan right foot fractured
4 Robert
5 Lucy sprained finger
6 Blake chest pain
7 Sally swelling in arm
8 Bruce lacerations to arm, hands, and foot <-- this is a weird case, since there are multiple body parts
Notably, some of the values in the injury column are blank.
I want to replace the values in the injury column with only the affected body part. In my case, that would be hand, foot, finger, and chest, arm. There are dozens more... this is a small example.
The desired dataframe would look like this:
name injury
0 John hand
1 William foot
2 Nancy foot
3 Susan foot
4 Robert
5 Lucy finger
6 Blake chest
7 Sally arm
8 Bruce arm, hand, foot
I could do something like this:
df.loc[df['injury'].str.contains('hand'), 'injury'] = 'hand'
df.loc[df['injury'].str.contains('foot'), 'injury'] = 'foot'
df.loc[df['injury'].str.contains('finger'), 'injury'] = 'finger'
df.loc[df['injury'].str.contains('chest'), 'injury'] = 'chest'
df.loc[df['injury'].str.contains('arm'), 'injury'] = 'arm'
But, this might not be the most elegant way.
Is there a more elegant way to do this? (e.g. using a dictionary)
(any advice on that last case with multiple body parts would be appreciated)
Thank you!

I think you should maintain a list of text, and using apply function:
body_parts = ['hand', 'foot', 'finger', 'chest', 'arm']
def test(value):
body_text = []
for body_part in body_parts:
if body_part in value:
body_text.append(body_part)
if body_text:
return ', '.join(body_text)
return value
df['injury'] = df['injury'].apply(test)
return:
name injury
0 John hand
1 William foot
2 Nancy foot
3 Susan foot
4 Robert
5 Lucy finger
6 Blake chest
7 Sally arm
8 Bruce hand, foot, arm

The standard way to get the first match of a regex on a string column is to use .extract(), please see the quickstart 10 minutes to pandas: working with text data.
df['injury'].str.extract('(arm|chest|finger|foot|hand)', expand=False)
0 hand
1 foot
2 foot
3 foot
4 NaN
5 finger
6 chest
7 arm
8 arm
Name: injury, dtype: object
Note row 4 returned NaN rather than '' (but it's trivial to apply .fillna('') to the result). More importantly in row 8 we'll only return the first match, not all matches. You need to decide how you want to handle this. See .extractall()

selected_words = ["hand", "foot", "finger", "chest", "arms", "arm", "hands"]
df["injury"] = (
df["injury"]
.str.replace(",", "")
.str.split(" ", expand=False)
.apply(lambda x: ", ".join(set([i for i in x if i in selected_words])))
)
print(df)
name injury
0 John hand
1 William foot
2 Nancy foot
3 Susan foot
4 Robert
5 Lucy finger
6 Blake chest
7 Sally arm
8 Bruce arms, foot, hands

Related

Get all the times an items value comes up in a data frame

So I have a data Frame with a list of teams and their goals. I have googled different ways to solve but I cant find a way that I want it done.
I have tried value_counts() And it seems to get all the Goals for each team I cant find a way to add them together.
HomeTeam TeamsGoals
Liverpool 4
0
3
6
1
matchdata.groupby("HomeTeam")["FullTimeHomeTeamsGoals"].value_counts()
I have tried many diffrent thing but I cant get the right output
My dataSet looks something like this:
HOME AWAY HOMEGOALS AWAYGOALS
Liverpool Man City 5 3
Man u Man City 0 2
LiverPool Man u 6 2
Man u LiverPool 0 2
Man City Man U 7 4
Man City Liverpool 2 2
wanted output:
HOME ToalScoreHome
Liverpool 11
Man City 9
Man u 0

Just use groupby and sum
matchdata.groupby('HOME')['HOMEGOALS'].sum()
HOME
LiverPool 6
Liverpool 5
Man City 9
Man u 0
Name: HOMEGOALS, dtype: int64
Or if LiverPool really has a differ case Liverpool then
matchdata.groupby(matchdata['HOME'].str.lower())['HOMEGOALS'].sum()
HOME
liverpool 11
man city 9
man u 0
Name: HOMEGOALS, dtype: int64

The reason that this does not work as you would expect is because LiverPool != Liverpool, so groupby won't do what you expect it to do, which makes sense why. Convert that, and try again:
df.replace({'LiverPool':'Liverpool'},inplace=True)
df.groupby('HOME',as_index=False)['HOMEGOALS'].sum()
HOME HOMEGOALS
0 Liverpool 11
1 Man City 9
2 Man u 0
Note you might need to do this for your other values if they are misspellings between them.

Replace a word or set of letters from a string in a dataframe only if the string starts with that word

Assuming I have the following toy model df:
Line Sentence
1 A MAN TAUGHT ME HOW TO DANCE.
2 WE HAVE TO CHOOSE A CAKE.
3 X RAYS CAN BE HARMFUL.
4 MY HERO IS MALCOLM X FROM THE USA.
5 THE BEST ACTOR IS JENNIFER A FULTON.
6 A SOUND THAT HAS A BIG IMPACT.
If I were to do the following:
df['Sentence'] = df['Sentence'].str.replace('A ',' ')
This would remove all characters 'A ' from all sentences. However, I only need the 'A ' removed from string sentences that start with 'A '. Similarly, I would like to remove the 'X ' from Line 3, and not from Malcolm X in Line 4.
The final output df should look like the following:
Line Sentence
1 MAN TAUGHT ME HOW TO DANCE.
2 WE HAVE TO CHOOSE A CAKE.
3 RAYS CAN BE HARMFUL.
4 MY HERO IS MALCOLM X FROM THE USA.
5 THE BEST ACTOR IS JENNIFER A FULTON.
6 SOUND THAT HAS A BIG IMPACT.

You can use regular expression:
df["Sentence"] = df["Sentence"].str.replace(r"^(?:A|X)(?=\s)", "", regex=True)
print(df)
Prints:
Line Sentence
0 1 MAN TAUGHT ME HOW TO DANCE.
1 2 WE HAVE TO CHOOSE A CAKE.
2 3 RAYS CAN BE HARMFUL.
3 4 MY HERO IS MALCOLM X FROM THE USA.
4 5 THE BEST ACTOR IS JENNIFER A FULTON.
5 6 SOUND THAT HAS A BIG IMPACT.

Use Regex to match only start of strings:
df['Sentence'] = df['Sentence'].str.replace(r'^([AX] )', '', regex=True)
df:
Line Sentence
0 1 MAN TAUGHT ME HOW TO DANCE.
1 2 WE HAVE TO CHOOSE A CAKE.
2 3 RAYS CAN BE HARMFUL.
3 4 MY HERO IS MALCOLM X FROM THE USA.
4 5 THE BEST ACTOR IS JENNIFER A FULTON.
5 6 SOUND THAT HAS A BIG IMPACT.

str.replace, startofstring,value, space. Code below
df.Sentence==df.Sentence.str.replace('^A\s+|^X\s+', '')
Sentence
0 MAN TAUGHT ME HOW TO DANCE.
1 WE HAVE TO CHOOSE A CAKE.
2 RAYS CAN BE HARMFUL.
3 MY HERO IS MALCOLM X FROM THE USA.
4 HE BEST ACTOR IS JENNIFER A FULTON.
5 SOUND THAT HAS A BIG IMPACT.

How to count paragraphs from each article from dataframe?

I want to count paragraphs from data frames. However, it turns out that my result gets zero inside the list. Does anybody know how to fix it? Thank you so much.
Here is my code:
def count_paragraphs(df):
paragraph_count = []
linecount = 0
for i in df.text:
if i in ('\n','\r\n'):
if linecount == 0:
paragraphcount = paragraphcount + 1
return paragraph_count
count_paragraphs(df)
df.text
0 On Saturday, September 17 at 8:30 pm EST, an e...
1 Story highlights "This, though, is certain: to...
2 Critical Counties is a CNN series exploring 11...
3 McCain Criticized Trump for Arpaio’s Pardon… S...
4 Story highlights Obams reaffirms US commitment...
5 Obama weighs in on the debate\n\nPresident Bar...
6 Story highlights Ted Cruz refused to endorse T...
7 Last week I wrote an article titled “Donald Tr...
8 Story highlights Trump has 45%, Clinton 42% an...
9 Less than a day after protests over the police...
10 I woke up this morning to find a variation of ...
11 Thanks in part to the declassification of Defe...
12 The Democrats are using an intimidation tactic...
13 Dolly Kyle has written a scathing “tell all” b...
14 The Haitians in the audience have some newswor...
15 The man arrested Monday in connection with the...
16 Back when the news first broke about the pay-t...
17 Chicago Environmentalist Scumbags\n\nLeftists ...
18 Well THAT’S Weird. If the Birther movement is ...
19 Former President Bill Clinton and his Clinton ...
Name: text, dtype: object

Use Series.str.count:
def count_paragraphs(df):
return df.text.str.count(r'\n\n').tolist()
count_paragraphs(df)

This is my answer and It works!
def count_paragraphs(df):
paragraph_count = []
for i in range(len(df)):
paragraph_count.append(df.text[i].count('\n\n'))
return paragraph_count
count_paragraphs(df)

python - How to concatenate consecutive rows without losing the order of the records in pandas?

I am trying to concatenate two rows based on a categorical variable when the categories of the variable in the consecutive rows are same. Here is my data below for example :
SNo user Text
0 1 Sam Hello
1 1 John Hi
2 1 Sam How are you?
3 1 John I am good
4 1 John How about you?
5 1 John How is it going?
6 1 Sam Its going good
7 1 Sam Thanks
8 2 Mary Howdy?
9 2 Jake Hey!!
10 2 Jake What a surprise
11 2 Mary Good to see you here :)
12 2 Jake Ha ha. Hectic life
13 2 Mary I know right..
14 2 Mary How's Amy doing?
15 2 Mary How are the kids?
16 2 Jake All is good! :)
Here, if my previous value of user column is same as my current value of user column but different from the next value in that column, then, I'd concatenate the values from the column Text for that user. I need to do this until there are no more multiple occurrence of that particular user. A sample output is given below :
SNo user Text
1 Sam Hello
1 John Hi
1 Sam How are you?
1 John I am good-How about you?-How is it going?
1 Sam Its going good-Thanks
2 Mary Howdy?
2 Jake Hey!!-What a surprise
2 Mary Good to see you here :)
2 Jake Ha ha. Hectic life
2 Mary I know right..-How's Amy doing?-How are the kids?
2 Jake All is good! :)
I tried using df.groupby() and then .agg() to get the concatenation done but unable to apply the above mentioned condition over it. Hence the output is combining all occurrences of an user for a chat.
df = sample_data.groupby(["SNo","user"]).agg({'Text': '-'.join}).reset_index() # incorrect though
df
Moreover, I am trying to avoid for loop like a plague and trying out a vectorised solution.
Sample data :
data_dict = {'S. No.': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2], 'user': ['Sam', 'John', 'Sam', 'John', 'John', 'John', 'Sam', 'Sam', 'Mary', 'Jake', 'Jake', 'Mary', 'Jake ', 'Mary', 'Mary', 'Mary', 'Jake'], 'Text': ['Hello', 'Hi', 'How are you?', 'I am good', 'How about you?', 'How is it going?', 'Its going good', 'Thanks', 'Howdy?', 'Hey!!', 'What a surprise', 'Good to see you here :)', 'Ha ha. Hectic life', 'I know right..', "How's Amy doing?", 'How are the kids?', 'All is good! :)']}
sample_data = pd.DataFrame(data_dict)

You want to compare user with its shift and cumsum for changes. Then you can groupby:
blocks = df['user'].ne(df['user'].shift()).cumsum()
(df.groupby(['SNo', blocks])
.agg({'user':'first','Text': '-'.join})
.reset_index('user', drop=True)
)
Output:
user Text
SNo
1 Sam Hello
1 John Hi
1 Sam How are you?
1 John I am good-How about you?-How is it going?
1 Sam Its going good-Thanks
2 Mary Howdy?
2 Jake Hey!!-What a surprise
2 Mary Good to see you here :)
2 Jake Ha ha. Hectic life
2 Mary I know right..-How's Amy doing?-How are the kids?
2 Jake All is good! :)

Python Pandas: How do I return members of groupby

I´m doing some resarch on a dataframe for people that are relative. But I can´t manage when I find brothers, I can´t find a way to write them down all on a specific column. Here follow an example:
cols = ['Name','Father','Brother']
df = pd.DataFrame({'Brother':'',
'Father':['Erick Moon','Ralph Docker','Erick Moon','Stewart Adborn'],
'Name':['John Smith','Rodolph Ruppert','Mathew Common',"Patrick French"]
},columns=cols)
df
Name Father Brother
0 John Smith Erick Moon
1 Rodolph Ruppert Ralph Docker
2 Mathew Common Erick Moon
3 Patrick French Stewart Adborn
What I want is this:
Name Father Brother
0 John Smith Erick Moon Mathew Common
1 Rodolph Ruppert Ralph Docker
2 Mathew Common Erick Moon John Smith
3 Patrick French Stewart Adborn
I apreciate any help!

Here is an idea you can try, firstly create a Brother column with all brothers as a list including itself and then remove itself separately. The code could probably be optimized but where you can start from:
import numpy as np
import pandas as pd
df['Brother'] = df.groupby('Father')['Name'].transform(lambda g: [g.values])
def deleteSelf(row):
row.Brother = np.delete(row.Brother, np.where(row.Brother == row.Name))
return(row)
df.apply(deleteSelf, axis = 1)
# Name Father Brother
# 0 John Smith Erick Moon [Mathew Common]
# 1 Rodolph Ruppert Ralph Docker []
# 2 Mathew Common Erick Moon [John Smith]
# 3 Patrick French Stewart Adborn []

def same_father(me, data):
hasdad = data.Father == data.at[me, 'Father']
notme = data.index != me
isbro = hasdad & notme
return data.loc[isbro].index.tolist()
df2 = df.set_index('Name')
getbro = lambda x: same_father(x.name, df2)
df2['Brother'] = df2.apply(getbro, axis=1)
I think this should work.(untested)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: efficient way to replace entire string with a substring - python

Related

Get all the times an items value comes up in a data frame

Replace a word or set of letters from a string in a dataframe only if the string starts with that word

How to count paragraphs from each article from dataframe?

python - How to concatenate consecutive rows without losing the order of the records in pandas?

Python Pandas: How do I return members of groupby

Categories

Resources