Pandas Number of Unique Values from 2 Fields - python

I am trying to find the number of unique values that cover 2 fields. So for example, a typical example would be last name and first name. I have a data frame.
When I do the following, I just get the number of unique fields for each column, in this case, Last and First. Not a composite.
df[['Last Name','First Name']].nunique()
Thanks!

Groupby both columns first, and then use nunique
>>> df.groupby(['First Name', 'Last Name']).nunique()

IIUC, you could use value_counts() for that:
df[['Last Name','First Name']].value_counts().size
3
For another example, if you start with this extended data frame that contains some dups:
Last Name First Name
0 Smith Bill
1 Johnson Bill
2 Smith John
3 Curtis Tony
4 Taylor Elizabeth
5 Smith Bill
6 Johnson Bill
7 Smith Bill
Then value_counts() gives you the counts by unique composite last-first name:
df[['Last Name','First Name']].value_counts()
Last Name First Name
Smith Bill 3
Johnson Bill 2
Curtis Tony 1
Smith John 1
Taylor Elizabeth 1
Then the length of that frame will give you the number of unique composite last-first names:
df[['Last Name','First Name']].value_counts().size
5

Related

How do I append a column in a dataframe and and give each unique string a number?

I'm looking to append a column in a pandas data frame that is similar to the following "Identifier" column:
Name. Age Identifier
Peter Pan 13 PanPe
James Jones 24 JonesJa
Peter Pan 22 PanPe
Chris Smith 19 SmithCh
I need the "Identifier" column to look like:
Identifier
PanPe01
JonesJa01
PanPe02
SmithCh01
How would I number each original string with 01? And if there are duplicates (for example Peter Pan), then the following duplicate strings (after the original 01) will have 02, 03, and so forth?
I've been referred to the following theory:
combo="PanPe"
Counts={}
if combo in counts:
count=counts[combo]
counts[combo]=count+1
else:
counts[combo]=1
However, getting a good example of code would be ideal, as I am relatively new to Python, and would love to know the syntax as how to implement an entire column iterated through this process, instead of just one string as shown above with "PanPe".
You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Output:
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
You can use cumcount here:
df['new_Identifier']=df['Identifier'] + (df.groupby('Identifier').cumcount() + 1).astype(str).str.pad(2, 'left', '0') #thanks #dm2 for the str.pad part
Name Age Identifier new_Identifier
0 Peter Pan 13 PanPe PanPe01
1 James Jones 24 JonesJa JonesJa01
2 Peter Pan 22 PanPe PanPe02
3 Chris Smith 19 SmithCh SmithCh01
Thank you #dm2 and #Bushmaster

How to append a new row in a dataframe by searching for an existing column value without iterating?

I'm trying to find the best way to create new rows for every 1 row when a certain value is contained in a column.
Example Dataframe
Index
Person
Drink_Order
1
Sam
Jack and Coke
2
John
Coke
3
Steve
Dr. Pepper
I'd like to search the DataFrame for Jack and Coke, remove it and add 2 new records as Jack and Coke are 2 different drink sources.
Index
Person
Drink_Order
2
John
Coke
3
Steve
Dr. Pepper
4
Sam
Jack Daniels
5
Sam
Coke
Example Code that I want to replace as my understanding is you should never modify rows you are iterating
for index, row in df.loc[df['Drink_Order'].str.contains('Jack and Coke')].iterrows():
df.loc[len(df)]=[row['Person'],'Jack Daniels']
df.loc[len(df)]=[row['Person'],'Coke']
df = df[df['Drink_Order']!= 'Jack and Coke']
Split using and. That will result in a list. Explode list to get each element in a list appear as an individual row. Then conditionally rename Jack to Jack Daniels
df= df.assign(Drink_Order=df['Drink_Order'].str.split('and')).explode('Drink_Order')
df['Drink_Order']=np.where(df['Drink_Order'].str.contains('Jack'),'Jack Daniels',df['Drink_Order'])
Index Person Drink_Order
0 1 Sam Jack Daniels
0 1 Sam Coke
1 2 John Coke
2 3 Steve Dr. Pepper

Converting 2 columns of names into 4 columns of names using Pandas

I have an Excel file that consists of two columns: last_name, first_name. The list is sorted by years of experience. I would like to create an new Excel file (or text file) that prints the names two-by-two.
Last First
Smith Joe
Jones Mary
Johnson Ken
etc
and converts it to
Smith Joe Jones Mary
Johnson Ken etc.
effectively printing every other name on the same row as the name above.
I have reached the point where the names can be printed into a single set of columns, but I can't move every other name to adjacent columns.
Thanks
TRY:
result = pd.concat([df.iloc[::2].reset_index(drop=True),
df.iloc[1::2].reset_index(drop=True)], 1)
OUTPUT:
Last First Last First
0 Smith Joe Jones Mary
1 Johnson Ken etc None

Pandas - Generate Unique ID based on row values

I would like to generate an integer-based unique ID for users (in my df).
Let's say I have:
index first last dob
0 peter jones 20000101
1 john doe 19870105
2 adam smith 19441212
3 john doe 19870105
4 jenny fast 19640822
I would like to generate an ID column like so:
index first last dob id
0 peter jones 20000101 1244821450
1 john doe 19870105 1742118427
2 adam smith 19441212 1841181386
3 john doe 19870105 1742118427
4 jenny fast 19640822 1687411973
10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID).
I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier.
I can't use groupby/cat code type methods in case the order of the
rows change.
The dataset won't grow beyond 50k rows.
Safe to assume there won't be a first, last, dob duplicate.
Feel like I may be tackling this the wrong way as I can't find much literature on it!
Thanks
You can try using hash function.
df['id'] = df[['first', 'last']].sum(axis=1).map(hash)
Please note the hash id is greater than 10 digits and is a unique integer sequence.
Here's a way of doing using numpy
import numpy as np
np.random.seed(1)
# create a list of unique names
names = df[['first', 'last']].agg(' '.join, 1).unique().tolist()
# generte ids
ids = np.random.randint(low=1e9, high=1e10, size = len(names))
# maps ids to names
maps = {k:v for k,v in zip(names, ids)}
# add new id column
df['id'] = df[['first', 'last']].agg(' '.join, 1).map(maps)
index first last dob id
0 0 peter jones 20000101 9176146523
1 1 john doe 19870105 8292931172
2 2 adam smith 19441212 4108641136
3 3 john doe 19870105 8292931172
4 4 jenny fast 19640822 6385979058
You can apply the below function on your data frame column.
def generate_id(s):
return abs(hash(s)) % (10 ** 10)
df['id'] = df['first'].apply(generate_id)
In case find out some values are not in exact digits, something like below you can do it -
def generate_id(s, size):
val = str(abs(hash(s)) % (10 ** size))
if len(val) < size:
diff = size - len(val)
val = str(val) + str(generate_id(s[:diff], diff))
return int(val)

How to combine two dataframes and have unique key column using Pandas?

I have two dataframes with the same columns that I need to combine:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
and
first_name last_name
0 Billy Bonder
1 Brian Black
2 Bran Balwner
When I do this:
df_new = pd.concat([df1, df1])
I get this:
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
0 Billy Bonder
1 Brian Black
2 Bran Balwner
Is there a way to have the left column have a unique number like this?
first_name last_name
0 Alex Anderson
1 Amy Ackerman
2 Allen Ali
3 Billy Bonder
4 Brian Black
5 Bran Balwner
If not, how can I add a new key column with numbers from 1 to whatever the row count is?
As said earlier by #MaxU you can use ignore_index=True.
If you want to keep the index of your first table you can use the parameter ignore_index=True after the [dataframe1, dataframe2].
You can check if the indexes are being repeated with the paremeter verify_integrity=True it will return a boolean (you never know when you'll have to check.
But be careful because this procedure can be a little slow depending on the size of you Dataframe

Categories

Resources