So lets say I have a pandas data frame:
In [1]: import pandas as pd
...
In [4]: df
Out[3]:
Person_ID First_Name Last_Name Phone_Number Email
1 A456 John Doe None None
2 A456 John Doe 123-123-1234 john.doe#test.com
3 A890 Joe Dirt 321-321-4321 None
4 A890 Joe Dirt None joe#email.com
and I would like to cook up some transformation to turn it into this:
Person_ID First_Name Last_Name Phone_Number Email
1 A456 John Doe 123-123-1234 john.doe#test.com
2 A890 Joe Dirt 321-321-4321 joe#email.com
i.e. I would like to be able to take a data frame which may have multiple rows of data corresponding to the same person (and sharing a Person_ID) but missing entries in different places and combine those entries intro a row containing all of the information.
Importantly, this is not just a task of filtering out the rows with more None values in them as for instance in my toy example, line 3 and 4 have an equal number of None's but the data is populated in different places.
Would anyone have advice on how to go about doing this?
Try groupby with fillna with methods ffill and bfill and then drop_duplicates:
df1 = df.groupby('Person_ID').apply(lambda x: x.ffill().bfill()).drop_duplicates()
print (df1)
Person_ID First_Name Last_Name Phone_Number Email
1 A456 John Doe 123-123-1234 john.doe#test.com
3 A890 Joe Dirt 321-321-4321 joe#email.com
You could
In [1571]: df.groupby('Person_ID', as_index=False).apply(
lambda x: x.ffill().bfill().iloc[0])
Out[1571]:
Person_ID First_Name Last_Name Phone_Number Email
0 A456 John Doe 123-123-1234 john.doe#test.com
1 A890 Joe Dirt 321-321-4321 joe#email.com
Related
I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee
I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!
You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary
I have a quick one that I am struggling with.
Table 1 has a lot of user information in addition to an email column and a unique ID column.
Table 2 has only a unique ID column and an email column. These emails can be different from table 1, but do not have to be.
I am attempting to merge them such that table 1 expands only to include new rows when there is a new email from table 2 on the same unique id.
Example:
Table 1:
id email first_name last_name
1 jo# joe king
2 john# johnny maverick
3 Tom# Tom J
Table 2:
id email
2 johnmk#
3 TomT#
8 Jared#
Desired Output:
id email first_name last_name
1 jo# joe king
2 john# johnny maverick
2 johnmk# johnny maverick
3 Tom# Tom J
3 TomT# Tom J
I would have expected pd.merge(table1, table2, on = 'id', how = 'left') to do this, but this just generates the email columns with the suffix _x, _y.
How can I make the merge?
IIUC, you can try pd.concat with a boolean mask using isn for df2 , with groupby.ffill:
out = pd.concat((df1,df2[df2['id'].isin(df1['id'])]),sort=False)
out.update(out.groupby("id").ffill())
out = out.sort_values("id")#.reset_index(drop=True)
id email first_name last_name
0 1 jo# joe king
1 2 john# johnny maverick
0 2 johnmk# johnny maverick
2 3 Tom# Tom J
1 3 TomT# Tom J
I have a df that looks like this
first_name last_name
John Doe
Kelly Stevens
Dorey Chang
and another that looks like this
name email
John Doe jdoe23#gmail.com
Kelly M Stevens kelly.stevens#hotmail.com
D Chang chang79#yahoo.com
To merge these 2 tables, such that the end result is
first_name last_name email
John Doe jdoe23#gmail.com
Kelly Stevens kelly.stevens#hotmail.com
Dorey Chang chang79#yahoo.com
I can't merge on name, but all emails contain each persons last name even if the overall format is different. Is there a way to merge these using only a partial string match?
I've tried things like this with no success:
df1['email']= df2[df2['email'].str.contains(df['last_name'])==True]
IIUC, you can do with merge on the result of an extract:
df1.merge(df2.assign(last_name=df2['name'].str.extract(' (\w+)$'))
.drop('name', axis=1),
on='last_name',
how='left')
Output:
first_name last_name email
0 John Doe jdoe23#gmail.com
1 Kelly Stevens kelly.stevens#hotmail.com
2 Dorey Chang chang79#yahoo.com
I have two dataframes with the same columns that I need to combine:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
and
first_name last_name
0 Billy Bonder
1 Brian Black
2 Bran Balwner
When I do this:
df_new = pd.concat([df1, df1])
I get this:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
0 Billy Bonder
1 Brian Black
2 Bran Balwner
Is there a way to have the left column have a unique number like this?
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
3 Billy Bonder
4 Brian Black
5 Bran Balwner
If not, how can I add a new key column with numbers from 1 to whatever the row count is?
As said earlier by #MaxU you can use ignore_index=True.
If you want to keep the index of your first table you can use the parameter ignore_index=True after the [dataframe1, dataframe2].
You can check if the indexes are being repeated with the paremeter verify_integrity=True it will return a boolean (you never know when you'll have to check.
But be careful because this procedure can be a little slow depending on the size of you Dataframe