Add UUID's to pandas DF - python

Say I have a pandas DataFrame like so:
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
Name
0 John Doe
1 Jane Smith
2 John Doe
3 Jane Smith
4 Jack Dawson
5 John Doe
And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:
df:
Name UUID
0 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
1 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
2 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
3 Jane Smith a709bd1a-5f98-4d29-81a8-09de6e675b56
4 Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
5 John Doe 6d07cb5f-7faa-4893-9bad-d85d3c192f52
The uuid's should be generated from the uuid.uuid4() function.
My current idea is to use a groupby("Name").cumcount() to identify which rows have the same name and which are different. Then I'd create a dictionary with a key of the cumcount and a value of the uuid and use that to add the uuids to the DF.
While that would work, I'm wondering if there's a more efficient way to do this?

Grouping the data frame and applying uuid.uuid4 will be more efficient than looping through the groups. Since you want to keep the original shape of your data frame you should use pandas function transform.
Using your sample data frame, we'll add a column in order to have a series to apply transform to. Since uuid.uuid4 doesn't take any argument it really doesn't matter what the column is.
df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df.loc[:, "UUID"] = 1
Now to use transform:
import uuid
df.loc[:, "UUID"] = df.groupby("Name").UUID.transform(lambda g: uuid.uuid4())
+----+--------------+--------------------------------------+
| | Name | UUID |
+----+--------------+--------------------------------------+
| 0 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 1 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 2 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
| 3 | Jane Smith | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 4 | Jack Dawson | 6b843d0f-ba3a-4880-8a84-d98c4af09cc3 |
| 5 | John Doe | c032c629-b565-4903-be5c-81bf05804717 |
+----+--------------+--------------------------------------+
uuid.uuid4 will be called as many times as there are distinct groups

How about this
names = df['Name'].unique()
for name in names:
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()
could shorten it to
for name in df['Name'].unique():
df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()

Related

Pandas: a Pythonic way to create a hyperlink from a value stored in another column of the dataframe

I have the following toy dataset df:
import pandas as pd
data = {
'id' : [1, 2, 3],
'name' : ['John Smith', 'Sally Jones', 'William Lee']
}
df = pd.DataFrame(data)
df
id name
0 1 John Smith
1 2 Sally Jones
2 3 William Lee
My ultimate goal is to add a column that represents a Google search of the value in the name column.
I do this using:
def create_hyperlink(search_string):
return f'https://www.google.com/search?q={search_string}'
df['google_search'] = df['name'].apply(create_hyperlink)
df
id name google_search
0 1 John Smith https://www.google.com/search?q=John Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones
2 3 William Lee https://www.google.com/search?q=William Lee
Unfortunately, newly created google_search column is returning a malformed URL. The URL should have a "+" between the first name and last name.
The google_search column should return the following:
https://www.google.com/search?q=John+Smith
It's possible to do this using split() and join().
foo = df['name'].str.split()
foo
0 [John, Smith]
1 [Sally, Jones]
2 [William, Lee]
Name: name, dtype: object
Now, joining them:
df['bar'] = ['+'.join(map(str, l)) for l in df['foo']]
df
id name google_search foo bar
0 1 John Smith https://www.google.com/search?q=John Smith [John, Smith] John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally Jones [Sally, Jones] Sally+Jones
2 3 William Lee https://www.google.com/search?q=William Lee [William, Lee] William+Lee
Lastly, creating the updated google_search column:
df['google_search'] = df['bar'].apply(create_hyperlink)
df
Is there a more elegant, streamlined, Pythonic way to do this?
Thanks!
Rather than reinvent the wheel and modify your string manually, use a library that's guaranteed to give you the right result :
from urllib.parse import quote_plus
def create_hyperlink(search_string):
return f"https://www.google.com/search?q={quote_plus(search_string)}"
Use Series.str.replace:
df['google_search'] = 'https://www.google.com/search?q=' + \
df.name.str.replace(' ','+')
print(df)
id name google_search
0 1 John Smith https://www.google.com/search?q=John+Smith
1 2 Sally Jones https://www.google.com/search?q=Sally+Jones
2 3 William Lee https://www.google.com/search?q=William+Lee

In, Python trying to remove duplicate word in dataframe, but get error

I'm trying to remove a duplicate word in a cell
Current Desired
0 John and Jane John and Jane
1 John and John John
2 John John
3 Jane and Jane Jane
I have tried the following, desired column gets filled with o d i c t _ k e y s ( [ ' n a n ' ] ):
from collections import OrderedDict
df['Current'] = (df['Desired'].astype(str).str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.astype(str).str.join(' '))
I have also tried this, but the desired column gets filled with nan
df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
Let us do split with set then join back
df['out'] = df.Current.str.split(' and ').map(lambda x : ' and '.join(set(x)))
df
Out[876]:
Current out
0 John and Jane Jane and John
1 John and John John
2 John John
3 Jane and Jane Jane

How to text to columns in pandas and create new columns?

I have a csv file as show below, the names column has names separated with commas, I want to spilt them on comma and append them to new columns and create the same csv, similar to the text to columns in excel, the problem is some rows have random number of names.
| Address | Name |
| 1st st | John, Smith |
|2nd st. | Andrew, Jane, Aaron|
my pandas code look something like this
df1 = pd.read_csv('sample.csv')
df1['Name'] = df1['Name'].str.split(',', expand=True)
df1.to_csv('results.csv',index=None)
offcourse this doesn't work because columns must be same length as key. The expected output is
| Address | Name | | |
| 1st st | John |Smith| |
|2nd st. | Andrew| Jane| Aaron|
count the max number of commas, then accordingly assign to new columns.
max_commas = df['name'].str.split(',').transform(len).max()
df[[f'name_{x}' for x in range(max_commas)]] = df['name'].str.split(',', expand=True)
input df:
col name
0 1st st john, smith
1 2nd st andrew, jane, aron
2 3rd st harry, philip, anna, james
output:
col name name_0 name_1 name_2 name_3
0 1st st john, smith john smith None None
1 2nd st andrew, jane, aron andrew jane aron None
2 3rd st harry, philip, anna, james harry philip anna james
You can do
out = df.join(df['Name'].str.split(', ',expand=True).add_prefix('name_'))

Python/Pandas - If Column A equals X or Y, then assign value from Col B. If not, then assign Col C. How to write in Python?

I'm having trouble formulating this statement in Pandas that would be very simple in excel. I have a dataframe sample as follows:
colA colB colC
10 0 27:15 John Doe
11 0 24:33 John Doe
12 1 29:43 John Doe
13 Inactive John Doe None
14 N/A John Doe None
Obviously the dataframe is much larger than this, with 10,000+ rows, so I'm trying to find an easier way to do this. I want to create a column that checks if colA is equal to 0 or 1. If so, then equals colC. If not, then equals colC. In excel, I would simply create a new column (new_col) and write
=IF(OR(A2<>0,A2<>1),B2,C2)
And then drag fill the entire sheet.
I'm sure this is fairly simple but I cannot for the life of me figure this out.
Result should look like this
colA colB colC new_col
10 0 27:15 John Doe John Doe
11 0 24:33 John Doe John Doe
12 1 29:43 John Doe John Doe
13 Inactive John Doe None John Doe
14 N/A John Doe None John Doe
np.where should do the trick.
df['new_col'] = np.where(df['colA'].isin([0, 1]), df['colB'], df['colC'])
Here is a solution that adds your results to a list given your conditions, then adds the list back in the dataframe as D column.
your_results=[]
for i,data in enumerate(df["colA"]):
if data==0 or data==1:
your_results.append(df["colC"][i])
else:
your_results.append(df["colB"][i])
df["colD"]=your_results

Python, Pandas: Assign cells in dataframe to groups

Good morning to everyone!
I'm hoping the title does fit in some way to my question!? I have a dataframe that looks something like this:
Dataframe (before):
id | name | position
1 | jane doe | position_1
2 | john doe | position_2
3 | john smith | position_3
Also I'm having multiple lists of groups:
group_1 = ['position_3', 'position_18', 'position_45']
group_2 = ['position_2', 'position_9']
group_7 = ['position_1']
Now I'm wondering what is the best way to achieve another column inside my dataframe with the assigned group? For example:
Dataframe (after):
id | name | position | group
1 | jane doe | position_1 | group_7
2 | john doe | position_2 | group_2
3 | john smith | position_3 | group_1
Notes:
every position is unique and will never apear in more than one group!
You can create a mapping dictionary in which key is the position and value is the group name, then map this dictionary onto the column position:
dct = {'group_1': group_1, 'group_2': group_2, 'group_7': group_7}
mapping_dct = {pos:grp for grp, positions in dct.items() for pos in positions}
df['group'] = df['position'].map(mapping_dct)
>>> df
id name position group
0 1 jane doe position_1 group_7
1 2 john doe position_2 group_2
2 3 john smith position_3 group_1
id=[1,2,3]
name=['jane doe','john doe','john smith']
position=['position_1','position_2','position_3']
df=pd.DataFrame({'id':id,'name':name,'position':position,})
group_1 = ['position_3', 'position_18', 'position_45']
group_2 = ['position_2', 'position_9']
group_7 = ['position_1']
dct = {'group_1': group_1, 'group_2': group_2, 'group_7': group_7}
def lookup(itemParam):
keys=[]
for key,item in dct.items():
if itemParam in item:
keys.append(key)
return keys
mylist=[*map(lookup,df['position'])]
mylist=[x[0] for x in mylist]
df['group']=mylist
print(df.head())
output:
id name position group
0 1 jane doe position_1 group_7
1 2 john doe position_2 group_2
2 3 john smith position_3 group_1

Categories

Resources