How to arrange duplicate rows of data into one row - python

I am trying to arrange duplicate data into one row using Python.
Let me show you example:
The "Original" dataframe has duplicate data.
The "Goal" is what I am trying to accomplish.
How do I go about doing this?
If I use Pandas, how would it look like?
By the way, I am getting original data from csv file.
PatientID Model# Ear SerNum FName LName PName PPhone
P99999 300 Left 1234567 John Doe Jane Doe (999) 111-2222
P99999 400 Right 2345678 John Doe Jane Doe (999) 111-2222
PID ModleL SerNumL ModelR SerNumR FName LName PName PPhone
P99999 300 1234567 400 2345678 John Doe J.Doe (999) 111-2222

First we split our data into left and right. After that we use pandas.DataFrame.merge to bring our data back together and give the correct suffixes:
df_L = df[df.Ear == 'Left'].drop('Ear',axis=1)
df_R = df[df.Ear == 'Right'].drop('Ear', axis=1)
print(df_L, '\n')
print(df_R)
PatientID Model# SerNum FName LName PName PPhone
0 P99999 300 1234567 John Doe Jane Doe (999) 111-2222
PatientID Model# SerNum FName LName PName PPhone
1 P99999 400 2345678 John Doe Jane Doe (999) 111-2222
Now we can merge back and give the correct suffixes:
df = pd.merge(df_L, df_R.iloc[:, :3], on = 'PatientID', suffixes=['Left', 'Right'])
print(df)
PatientID Model#Left SerNumLeft FName LName PName PPhone \
0 P99999 300 1234567 John Doe Jane Doe (999) 111-2222
Model#Right SerNumRight
0 400 2345678

Best source is the official source:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html
You may also want to learn about multiindex, levels, etc.
I prefer join:
import pandas as pd
columns = ['PatientID', 'Model#', 'Ear', 'SerNum', 'FName', 'LName', 'PName', 'PPhone']
data = [[
'P99999', '300', 'Left', '1234567', 'John', 'Doe', 'Jane Doe', '(999) 111-2222'],
['P99999', '400', 'Right', '2345678', 'John', 'Doe', 'Jane Doe', '(999) 111-2222']]
df = pd.DataFrame(data=data, columns=columns)
df = df.set_index('PatientID')
df = df[df['Ear'] == 'Left'].drop('Ear', axis=1).join(df[df['Ear'] == 'Right'].drop('Ear', axis=1), lsuffix='_left', rsuffix='_right').reset_index()
Output:
PatientID Model#_left SerNum_left ... LName_right PName_right PPhone_right
0 P99999 300 1234567 ... Doe Jane Doe (999) 111-2222
EDIT:
1. Fixed, forgot to drop the column :)
2. Now with your data :)

This is more like a pivot problem , so I use pivot_table here
s=df.pivot_table(index=['PatientID','FName','LName','PName','PPhone'],columns='Ear',values=['Model#','SerNum'],aggfunc='first')
s.columns=s.columns.map(' '.join)
s.reset_index(inplace=True)
s
PatientID FName LName ... Model# Right SerNum Left SerNum Right
0 P99999 John Doe ... 400 1234567 2345678
[1 rows x 9 columns]

Related

Pandas: a Pythonic way to create a hyperlink from a value stored in another column of the dataframe

I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee

Merge based on partial string match in pandas dfs

I have a df that looks like this
first_name last_name
John Doe
Kelly Stevens
Dorey Chang
and another that looks like this
name email
John Doe jdoe23#gmail.com
Kelly M Stevens kelly.stevens#hotmail.com
D Chang chang79#yahoo.com
To merge these 2 tables, such that the end result is
first_name last_name email
John Doe jdoe23#gmail.com
Kelly Stevens kelly.stevens#hotmail.com
Dorey Chang chang79#yahoo.com
I can't merge on name, but all emails contain each persons last name even if the overall format is different. Is there a way to merge these using only a partial string match?
I've tried things like this with no success:
df1['email']= df2[df2['email'].str.contains(df['last_name'])==True]
IIUC, you can do with merge on the result of an extract:
df1.merge(df2.assign(last_name=df2['name'].str.extract(' (\w+)$'))
.drop('name', axis=1),
on='last_name',
how='left')
Output:
first_name last_name email
0 John Doe jdoe23#gmail.com
1 Kelly Stevens kelly.stevens#hotmail.com
2 Dorey Chang chang79#yahoo.com

How to create conditional clause if column in dataframe is empty?

I have a df that looks like this:
fname lname
joe smith
john smith
jane#jane.com
jacky /jax jack
a#a.com non
john (jack) smith
Bob J. Smith
I want to create logic that says that if lname is empty, and if there are two OR three strings in fname seperate the second string OR third string and push it into lname column. If email address in fname leave as is, and if slashes or parenthesis in the fname column and no value in lname leave as is.
new df:
fname lname
joe smith
john smith
jane#jane.com
jacky /jax jack
a#a.com non
john (jack) smith
Bob J. smith
Code so far to seperate two strings:
df[['lname']] = df['name'].loc[df['fname'].str.split().str.len() == 2].str.split(expand=True)
With the following sample dataframe:
df = pd.DataFrame({'fname': ['joe', 'john smith', 'jane#jane.com', 'jacky /jax', 'a#a.com', 'john (jack)', 'Bob J. Smith'],
'lname': ['smith', '', '', 'jack', 'non', 'smith', '']})
You can use np.where():
conditions = (df['lname']=='') & (df['fname'].str.split().str.len()>1)
df['lname'] = np.where(conditions, df['fname'].str.split().str[-1].str.lower(), df['lname'])
Yields:
fname lname
0 joe smith
1 john smith smith
2 jane#jane.com
3 jacky /jax jack
4 a#a.com non
5 john (jack) smith
6 Bob J. Smith smith
To remove the last string from the fname column of the rows that had their lname column populated:
df['fname'] = np.where(conditions, df['fname'].str.split().str[:-1].str.join(' '), df['fname'])
Yields:
fname lname
0 joe smith
1 john smith
2 jane#jane.com
3 jacky /jax jack
4 a#a.com non
5 john (jack) smith
6 Bob J. smith
If I understand correctly you have a dataframe with columns fname and lname. If so then you can modify empty rows in column lname with:
condition = (df.loc[:, 'lname'] == '') & (df.loc[:, 'fname'].str.contains(' '))
df.loc[condition, 'lname'] = df.loc[condition, 'fname'].str.split().str[-1]
The code works for the sample data you have provided in the question but should be improved to be used in more general case.
To modify column fname you may use:
df.loc[condition, 'fname'] = df.loc[condition, 'fname'].str.split().str[:-1].str.join(sep=' ')

How to combine sparse rows in a pandas dataframe

So lets say I have a pandas data frame:
In [1]: import pandas as pd
...
In [4]: df
Out[3]:
Person_ID First_Name Last_Name Phone_Number Email
1 A456 John Doe None None
2 A456 John Doe 123-123-1234 john.doe#test.com
3 A890 Joe Dirt 321-321-4321 None
4 A890 Joe Dirt None joe#email.com
and I would like to cook up some transformation to turn it into this:
Person_ID First_Name Last_Name Phone_Number Email
1 A456 John Doe 123-123-1234 john.doe#test.com
2 A890 Joe Dirt 321-321-4321 joe#email.com
i.e. I would like to be able to take a data frame which may have multiple rows of data corresponding to the same person (and sharing a Person_ID) but missing entries in different places and combine those entries intro a row containing all of the information.
Importantly, this is not just a task of filtering out the rows with more None values in them as for instance in my toy example, line 3 and 4 have an equal number of None's but the data is populated in different places.
Would anyone have advice on how to go about doing this?
Try groupby with fillna with methods ffill and bfill and then drop_duplicates:
df1 = df.groupby('Person_ID').apply(lambda x: x.ffill().bfill()).drop_duplicates()
print (df1)
Person_ID First_Name Last_Name Phone_Number Email
1 A456 John Doe 123-123-1234 john.doe#test.com
3 A890 Joe Dirt 321-321-4321 joe#email.com
You could
In [1571]: df.groupby('Person_ID', as_index=False).apply(
lambda x: x.ffill().bfill().iloc[0])
Out[1571]:
Person_ID First_Name Last_Name Phone_Number Email
0 A456 John Doe 123-123-1234 john.doe#test.com
1 A890 Joe Dirt 321-321-4321 joe#email.com

Merging two columns with different information, python

I have a dataframe with one column of last names, and one column of first names. How do I merge these columns so that I have one column with first and last names?
Here is what I have:
First Name (Column 1)
John
Lisa
Jim
Last Name (Column 2)
Smith
Brown
Dandy
This is what I want:
Full Name
John Smith
Lisa Brown
Jim Dandy.
Thank you!
Try
df.assign(name = df.apply(' '.join, axis = 1)).drop(['first name', 'last name'], axis = 1)
You get
name
0 bob smith
1 john smith
2 bill smith
Here's a sample df:
df
first name last name
0 bob smith
1 john smith
2 bill smith
You can do the following to combine columns:
df['combined']= df['first name'] + ' ' + df['last name']
df
first name last name combined
0 bob smith bob smith
1 john smith john smith
2 bill smith bill smith

Categories

Resources