pandas: is one column a substring of another column - python

I'm trying to test if one column (surname) is a substring of another column (name) in a dataframe (el). I've tried the following but python doesn't like it
el.name.str.contains(el.surname)
I can see plenty of examples of how to search for a literal substring, but not where the substring is a column.
Going mad on this one, help please!
Dave

You may use
import pandas as pd
dct = {'surname': ['Smith', 'Miller', 'Mayer'],
'name': ['Dr. John Smith', 'Nobody', 'Prof. Dr. Mayer']}
df = pd.DataFrame(dct)
df['is_part_of_name'] = df.apply(lambda x: x["surname"] in x["name"], axis=1)
print(df)
Which yields
surname name is_part_of_name
0 Smith Dr. John Smith True
1 Miller Nobody False
2 Mayer Prof. Dr. Mayer True

Related

Pandas: a Pythonic way to create a hyperlink from a value stored in another column of the dataframe

I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee

How can I add/merge values from one existing column to another column - Python - Pandas - Jupyter Notebook

Good Morning,
This is my code
data = {'Names_Males_GroupA': ['Robert', 'Andrew', 'Gordon', 'Steve'], 'Names_Females_GroupA': ['Brenda', 'Sandra', 'Karen', 'Megan'], 'Name_Males_GroupA': ['David', 'Patricio', 'Noe', 'Daniel']}
df = pd.DataFrame(data)
df
Since Name_Males_GroupA has an error (missing and 's')
I need to move all the values to the correct column which is Names_Males_GroupA
In other words: I want to Add the names David, Patricio, Noe and Daniel below the names Robert, Andrew, Gordon and Steve.
After that I can delete the wrong column.
Thank you.
If I understand you correctly, you can try
df = pd.concat([df.iloc[:, :2], df.iloc[:, 2].to_frame('Names_Males_GroupA')], ignore_index=True)
print(df)
Names_Males_GroupA Names_Females_GroupA
0 Robert Brenda
1 Andrew Sandra
2 Gordon Karen
3 Steve Megan
4 David NaN
5 Patricio NaN
6 Noe NaN
7 Daniel NaN
I would break them apart and put them back together with a pd.concat
data = {'Names_Males_GroupA': ['Robert', 'Andrew', 'Gordon', 'Steve'], 'Names_Females_GroupA': ['Brenda', 'Sandra', 'Karen', 'Megan'], 'Name_Males_GroupA': ['David', 'Patricio', 'Noe', 'Daniel']}
df = pd.DataFrame(data)
df1 = df[['Name_Males_GroupA', 'Names_Females_GroupA']]
df1.columns = ['Names_Males_GroupA', 'Names_Females_GroupA']
df = df[['Names_Males_GroupA', 'Names_Females_GroupA']]
pd.concat([df, df1])

Split a column in Python pandas

I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this
Index Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
--------------------------------------------
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this
index
Country
Name
PERSID
SSO
STARTDATE
WAVE
0
USA
John
12345
John123
20210101
WAVE39
1
UK
Jane
25478
Jane123
20210101
WAVE40
Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's.
I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.
But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'
How should I approach this? what alternatives do I have to do it?
Thank you in advance for any help or recommendation
You have a few options to consider to do this. Here's how I would do it.
I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.
import pandas as pd
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})
df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)
print (df)
With this approach, it does not matter if a row of data is missing. It will sort itself out.
The output of this will be:
Original DataFrame:
Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
2 CA Jill PERSID|12345
STARTDATE|20210201
WAVE|WAVE41
Updated DataFrame:
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Note that Jill did not have a SSO value. It set the value to NaN by default.
First generate your dataframe
df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])
Then split the last cell using lambda
df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))
Lastly concat the dataframes together.
df = pd.concat([df1, df2], axis=1)
Quick solution
remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]
Use regex named capture groups with pd.String.str.extract
def ng(x):
return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'
fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']
pat = ''.join(map(ng, fields))
df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Setup
Credit to #JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])

Merging multiple rows into one in dataframe

I have a dataframe which looks like:
df =
|Name Nationality Family etc.....
0|John Born in Spain. Wife
1|nan But live in England son
2|nan nan daughter
Some columns only have one row but others have multiple answers over a few rows, how could i merge the rows onto each other so it would look something like the below:
df =
|Name Nationality Family etc....
0|John Born in Spain. But live in England Wife Son Daughter
Perhaps this will do it for you:
import pandas as pd
# your dataframe
df = pd.DataFrame(
{'Name': ['John', np.nan, np.nan],
'Nationality': ['Born in Spain.', 'But live in England', np.nan],
'Family': ['Wife', 'son', 'daughter']})
def squeeze_df(df):
new_df = {}
for col in df.columns:
new_df[col] = [df[col].str.cat(sep=' ')]
return pd.DataFrame(new_df)
squeeze_df(df)
# >> out:
# Name Nationality Family
# 0 John Born in Spain. But live in England Wife son daughter
I made the assumption that you only need to do this for one single person (i.e. squeezing/joining the rows of the dataframe into a single row). Also, what does "etc...." mean? For example, will you have integer or floating point values in the dataframe?

Find words and create new value in different column pandas dataframe with regex

suppose I have a dataframe which contains:
df = pd.DataFrame({'Name':['John', 'Alice', 'Peter', 'Sue'],
'Job': ['Dentist', 'Blogger', 'Cook', 'Cook'],
'Sector': ['Health', 'Entertainment', '', '']})
and I want to find all 'cooks', whether in capital letters or not and assign them to the column 'Sector' with a value called 'gastronomy', how do I do that? And without overwriting the other entries in the column 'Sector'? Thanks!
Here's one approach:
df.loc[df.Job.str.lower().eq('cook'), 'Sector'] = 'gastronomy'
print(df)
Name Job Sector
0 John Dentist Health
1 Alice Blogger Entertainment
2 Peter Cook gastronomy
3 Sue Cook gastronomy
Using Series.str.match with regex and a regex flag for not case sensitive (?i):
df.loc[df['Job'].str.match('(?i)cook'), 'Sector'] = 'gastronomy'
Output
Name Job Sector
0 John Dentist Health
1 Alice Blogger Entertainment
2 Peter Cook gastronomy
3 Sue Cook gastronomy

Categories

Resources