I'm trying to test if one column (surname) is a substring of another column (name) in a dataframe (el). I've tried the following but python doesn't like it
el.name.str.contains(el.surname)
I can see plenty of examples of how to search for a literal substring, but not where the substring is a column.
Going mad on this one, help please!
Dave
You may use
import pandas as pd
dct = {'surname': ['Smith', 'Miller', 'Mayer'],
'name': ['Dr. John Smith', 'Nobody', 'Prof. Dr. Mayer']}
df = pd.DataFrame(dct)
df['is_part_of_name'] = df.apply(lambda x: x["surname"] in x["name"], axis=1)
print(df)
Which yields
surname name is_part_of_name
0 Smith Dr. John Smith True
1 Miller Nobody False
2 Mayer Prof. Dr. Mayer True
Related
I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee
Good Morning,
This is my code
data = {'Names_Males_GroupA': ['Robert', 'Andrew', 'Gordon', 'Steve'], 'Names_Females_GroupA': ['Brenda', 'Sandra', 'Karen', 'Megan'], 'Name_Males_GroupA': ['David', 'Patricio', 'Noe', 'Daniel']}
df = pd.DataFrame(data)
df
Since Name_Males_GroupA has an error (missing and 's')
I need to move all the values to the correct column which is Names_Males_GroupA
In other words: I want to Add the names David, Patricio, Noe and Daniel below the names Robert, Andrew, Gordon and Steve.
After that I can delete the wrong column.
Thank you.
If I understand you correctly, you can try
df = pd.concat([df.iloc[:, :2], df.iloc[:, 2].to_frame('Names_Males_GroupA')], ignore_index=True)
print(df)
Names_Males_GroupA Names_Females_GroupA
0 Robert Brenda
1 Andrew Sandra
2 Gordon Karen
3 Steve Megan
4 David NaN
5 Patricio NaN
6 Noe NaN
7 Daniel NaN
I would break them apart and put them back together with a pd.concat
data = {'Names_Males_GroupA': ['Robert', 'Andrew', 'Gordon', 'Steve'], 'Names_Females_GroupA': ['Brenda', 'Sandra', 'Karen', 'Megan'], 'Name_Males_GroupA': ['David', 'Patricio', 'Noe', 'Daniel']}
df = pd.DataFrame(data)
df1 = df[['Name_Males_GroupA', 'Names_Females_GroupA']]
df1.columns = ['Names_Males_GroupA', 'Names_Females_GroupA']
df = df[['Names_Males_GroupA', 'Names_Females_GroupA']]
pd.concat([df, df1])
I'm sorry if I can't explain properly the issue I'm facing since I don't really understand it that much. I'm starting to learn Python and to practice I try to do projects that I face in my day to day job, but using Python. Right now I'm stuck with a project and would like some help or guidance, I have a dataframe that looks like this
Index Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
--------------------------------------------
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
(I apologize since I can't create a table on this post since the separator of the ids is a | ) but you get the idea, every person has 4 IDs and they are all on the same "cell" of the dataframe, each ID separated from its value by pipes, I need to split those ID's from their values, and put them on separate columns so I get something like this
index
Country
Name
PERSID
SSO
STARTDATE
WAVE
0
USA
John
12345
John123
20210101
WAVE39
1
UK
Jane
25478
Jane123
20210101
WAVE40
Now, adding to the complexity of the table itself, I have another issues, for example, the order of the ID's won't be the same for everyone and some of them will be missing some of the ID's.
I honestly have no idea where to begin, the first thing I thought about trying was to split the IDs column by spaces and then split the result of that by pipes, to create a dictionary, convert it to a dataframe and then join it to my original dataframe using the index.
But as I said, my knowledge in python is quite pathetic, so that failed catastrophically, I only got to the first step of that plan with a Client_ids = df.IDs.str.split(), that returns a series with the IDs separated one from each other like ['PERSID|12345', 'SSO|John123', 'STARTDATE|20210101', 'WAVE|Wave39'] but I can't find a way to split it again because I keep getting an error saying the the list object doesn't have attribute 'split'
How should I approach this? what alternatives do I have to do it?
Thank you in advance for any help or recommendation
You have a few options to consider to do this. Here's how I would do it.
I will split the values in IDs by \n and |. Then create a dictionary with key:value for each split of values of |. Then join it back to the dataframe and drop the IDs and temp columns.
import pandas as pd
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
df['temp'] = df['IDs'].str.split('\n|\|').apply(lambda x: {k:v for k,v in zip(x[::2],x[1::2])})
df = df.join(pd.DataFrame(df['temp'].values.tolist(), df.index))
df = df.drop(columns=['IDs','temp'],axis=1)
print (df)
With this approach, it does not matter if a row of data is missing. It will sort itself out.
The output of this will be:
Original DataFrame:
Country Name IDs
0 USA John PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39
1 UK Jane PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40
2 CA Jill PERSID|12345
STARTDATE|20210201
WAVE|WAVE41
Updated DataFrame:
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Note that Jill did not have a SSO value. It set the value to NaN by default.
First generate your dataframe
df1 = pd.DataFrame([["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """
PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""]], columns=['Country', 'Name', 'IDs'])
Then split the last cell using lambda
df2 = pd.DataFrame(list(df.apply(lambda r: {p:q for p,q in [x.split("|") for x in r.IDs.split()]}, axis=1).values))
Lastly concat the dataframes together.
df = pd.concat([df1, df2], axis=1)
Quick solution
remove_word = ["PERSID", "SSO" ,"STARTDATE" ,"WAVE"]
for i ,col in enumerate(remove_word):
df[col] = df.IDs.str.replace('|'.join(remove_word), '', regex=True).str.split("|").str[i+1]
Use regex named capture groups with pd.String.str.extract
def ng(x):
return f'(?:{x}\|(?P<{x}>[^\n]+))?\n?'
fields = ['PERSID', 'SSO', 'STARTDATE', 'WAVE']
pat = ''.join(map(ng, fields))
df.drop('IDs', axis=1).join(df['IDs'].str.extract(pat))
Country Name PERSID SSO STARTDATE WAVE
0 USA John 12345 John123 20210101 WAVE39
1 UK Jane 25478 Jane123 20210101 WAVE40
2 CA Jill 12345 NaN 20210201 WAVE41
Setup
Credit to #JoeFerndz for sample df.
NOTE: this sample has missing values in some 'IDs'.
df = pd.DataFrame([
["USA", "John","""PERSID|12345
SSO|John123
STARTDATE|20210101
WAVE|WAVE39"""],
["UK", "Jane", """PERSID|25478
SSO|Jane123
STARTDATE|20210101
WAVE|WAVE40"""],
["CA", "Jill", """PERSID|12345
STARTDATE|20210201
WAVE|WAVE41"""]], columns=['Country', 'Name', 'IDs'])
I have a dataframe which looks like:
df =
|Name Nationality Family etc.....
0|John Born in Spain. Wife
1|nan But live in England son
2|nan nan daughter
Some columns only have one row but others have multiple answers over a few rows, how could i merge the rows onto each other so it would look something like the below:
df =
|Name Nationality Family etc....
0|John Born in Spain. But live in England Wife Son Daughter
Perhaps this will do it for you:
import pandas as pd
# your dataframe
df = pd.DataFrame(
{'Name': ['John', np.nan, np.nan],
'Nationality': ['Born in Spain.', 'But live in England', np.nan],
'Family': ['Wife', 'son', 'daughter']})
def squeeze_df(df):
new_df = {}
for col in df.columns:
new_df[col] = [df[col].str.cat(sep=' ')]
return pd.DataFrame(new_df)
squeeze_df(df)
# >> out:
# Name Nationality Family
# 0 John Born in Spain. But live in England Wife son daughter
I made the assumption that you only need to do this for one single person (i.e. squeezing/joining the rows of the dataframe into a single row). Also, what does "etc...." mean? For example, will you have integer or floating point values in the dataframe?
suppose I have a dataframe which contains:
df = pd.DataFrame({'Name':['John', 'Alice', 'Peter', 'Sue'],
'Job': ['Dentist', 'Blogger', 'Cook', 'Cook'],
'Sector': ['Health', 'Entertainment', '', '']})
and I want to find all 'cooks', whether in capital letters or not and assign them to the column 'Sector' with a value called 'gastronomy', how do I do that? And without overwriting the other entries in the column 'Sector'? Thanks!
Here's one approach:
df.loc[df.Job.str.lower().eq('cook'), 'Sector'] = 'gastronomy'
print(df)
Name Job Sector
0 John Dentist Health
1 Alice Blogger Entertainment
2 Peter Cook gastronomy
3 Sue Cook gastronomy
Using Series.str.match with regex and a regex flag for not case sensitive (?i):
df.loc[df['Job'].str.match('(?i)cook'), 'Sector'] = 'gastronomy'
Output
Name Job Sector
0 John Dentist Health
1 Alice Blogger Entertainment
2 Peter Cook gastronomy
3 Sue Cook gastronomy