I would like to generate an integer-based unique ID for users (in my df).
Let's say I have:
index first last dob
0 peter jones 20000101
1 john doe 19870105
2 adam smith 19441212
3 john doe 19870105
4 jenny fast 19640822
I would like to generate an ID column like so:
index first last dob id
0 peter jones 20000101 1244821450
1 john doe 19870105 1742118427
2 adam smith 19441212 1841181386
3 john doe 19870105 1742118427
4 jenny fast 19640822 1687411973
10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID).
I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier.
I can't use groupby/cat code type methods in case the order of the
rows change.
The dataset won't grow beyond 50k rows.
Safe to assume there won't be a first, last, dob duplicate.
Feel like I may be tackling this the wrong way as I can't find much literature on it!
Thanks
You can try using hash function.
df['id'] = df[['first', 'last']].sum(axis=1).map(hash)
Please note the hash id is greater than 10 digits and is a unique integer sequence.
Here's a way of doing using numpy
import numpy as np
np.random.seed(1)
# create a list of unique names
names = df[['first', 'last']].agg(' '.join, 1).unique().tolist()
# generte ids
ids = np.random.randint(low=1e9, high=1e10, size = len(names))
# maps ids to names
maps = {k:v for k,v in zip(names, ids)}
# add new id column
df['id'] = df[['first', 'last']].agg(' '.join, 1).map(maps)
index first last dob id
0 0 peter jones 20000101 9176146523
1 1 john doe 19870105 8292931172
2 2 adam smith 19441212 4108641136
3 3 john doe 19870105 8292931172
4 4 jenny fast 19640822 6385979058
You can apply the below function on your data frame column.
def generate_id(s):
return abs(hash(s)) % (10 ** 10)
df['id'] = df['first'].apply(generate_id)
In case find out some values are not in exact digits, something like below you can do it -
def generate_id(s, size):
val = str(abs(hash(s)) % (10 ** size))
if len(val) < size:
diff = size - len(val)
val = str(val) + str(generate_id(s[:diff], diff))
return int(val)
Related
I have a dataframe with a single column which contains the names of people as shown below.
name
--------------
john doe
john david doe
doe henry john
john henry
I want to count the number of time each two words appear together in a name regardless of the order. In this example, the words john and doe appear in the same same three times names john doe, john henry doe and doe john.
Expected output
name1 | name2 | count
----------------------
david | doe | 1
doe | henry | 1
doe | john | 3
henry | john | 2
Notice that name1 is the word that comes first in alphabetical order. Currently I have a brute force solution.
Create a list of all unique words in the dataframe
For each unique word W in this list, filter the records in the original data frame which contain this W
From the filtered records, count frequency of other words. This gives the number of time W appears with various other words
Question: This works fine for small number of records but is not efficient if we have a large number of records as it runs in quadratic complexity. How can it generate the output in a faster way? Is there any function or package that can give these counts?
Note: I tried using n-gram extraction from NLP packages but this over estimates the counts because it internally appends all the names to form a long string due to which the last word on a name and the first word of the next name shows up as a a sequence of words in the appended string which adds up to the count.
Here's a solution which is still quadratic but over smaller n and also hides most of it in compiled code (which should hopefully execute faster):
from itertools import combinations
combs = df['name'].apply(lambda x:list(combinations(sorted(x.split()),2)))
counts = Counter(combs.explode())
res = pd.Series(counts).rename_axis(['name1', 'name2']).reset_index(name='count')
Output for your sample data:
name1 name2 count
0 doe john 3
1 david doe 1
2 david john 1
3 doe henry 1
4 henry john 2
I suggest the below:
import itertools
# 1) Create combinations of 2 from all the names using itertools().
l = [a for b in df.name.tolist() for a in b]
c = {c for c in itertools.combinations(l, 2) if c[0] != c[1]}
df_counts = pd.DataFrame(c, columns=["name1", "name2"])
# 2) Create a new column iterating through rows to check of each name is contained in each list of words & sum the boolean outputs.
df_counts["counts"] = df_counts.apply(lambda row: sum([row["name1"] in l and row["name2"] in l for l in df.name.to_list()]), axis=1)
I hope this helps.
This problem is an instance of 'Association Rule Mining' with just 2 items per transaction. It has simple algorithms like 'Aperiori' as well as efficient ones like 'FP-Growth' and you can find a lot of resources to learn that.
I am trying to find the number of unique values that cover 2 fields. So for example, a typical example would be last name and first name. I have a data frame.
When I do the following, I just get the number of unique fields for each column, in this case, Last and First. Not a composite.
df[['Last Name','First Name']].nunique()
Thanks!
Groupby both columns first, and then use nunique
>>> df.groupby(['First Name', 'Last Name']).nunique()
IIUC, you could use value_counts() for that:
df[['Last Name','First Name']].value_counts().size
3
For another example, if you start with this extended data frame that contains some dups:
Last Name First Name
0 Smith Bill
1 Johnson Bill
2 Smith John
3 Curtis Tony
4 Taylor Elizabeth
5 Smith Bill
6 Johnson Bill
7 Smith Bill
Then value_counts() gives you the counts by unique composite last-first name:
df[['Last Name','First Name']].value_counts()
Last Name First Name
Smith Bill 3
Johnson Bill 2
Curtis Tony 1
Smith John 1
Taylor Elizabeth 1
Then the length of that frame will give you the number of unique composite last-first names:
df[['Last Name','First Name']].value_counts().size
5
I'm having trouble formulating this statement in Pandas that would be very simple in excel. I have a dataframe sample as follows:
colA colB colC
10 0 27:15 John Doe
11 0 24:33 John Doe
12 1 29:43 John Doe
13 Inactive John Doe None
14 N/A John Doe None
Obviously the dataframe is much larger than this, with 10,000+ rows, so I'm trying to find an easier way to do this. I want to create a column that checks if colA is equal to 0 or 1. If so, then equals colC. If not, then equals colC. In excel, I would simply create a new column (new_col) and write
=IF(OR(A2<>0,A2<>1),B2,C2)
And then drag fill the entire sheet.
I'm sure this is fairly simple but I cannot for the life of me figure this out.
Result should look like this
colA colB colC new_col
10 0 27:15 John Doe John Doe
11 0 24:33 John Doe John Doe
12 1 29:43 John Doe John Doe
13 Inactive John Doe None John Doe
14 N/A John Doe None John Doe
np.where should do the trick.
df['new_col'] = np.where(df['colA'].isin([0, 1]), df['colB'], df['colC'])
Here is a solution that adds your results to a list given your conditions, then adds the list back in the dataframe as D column.
your_results=[]
for i,data in enumerate(df["colA"]):
if data==0 or data==1:
your_results.append(df["colC"][i])
else:
your_results.append(df["colB"][i])
df["colD"]=your_results
I have two df's, one for user names and another for real names. I'd like to know how I can check if I have a real name in my first df using the data of the other, and then replace it.
For example:
import pandas as pd
df1 = pd.DataFrame({'userName':['peterKing', 'john', 'joe545', 'mary']})
df2 = pd.DataFrame({'realName':['alice','peter', 'john', 'francis', 'joe', 'carol']})
df1
userName
0 peterKing
1 john
2 joe545
3 mary
df2
realName
0 alice
1 peter
2 john
3 francis
4 joe
5 carol
My code should replace 'peterKing' and 'joe545' since these names appear in my df2. I tried using pd.contains, but I can only verify if a name appears or not.
The output should be like this:
userName
0 peter
1 john
2 joe
3 mary
Can someone help me with that? Thanks in advance!
You can use loc[row, colum], here you can see the documentation about loc method. And Series.str.contain method to select the usernames you need to replace with the real names. In my opinion, this solution is clear in terms of readability.
for real_name in df2['realName'].to_list():
df1.loc[ df1['userName'].str.contains(real_name), 'userName' ] = real_name
Output:
userName
0 peter
1 john
2 joe
3 mary
I have a huge dataset with about 60000 data. I would first use some criteria to do groupby on the whole dataset, and what I want to do next is to separate the whole dataset to many small datasets within the criteria and to run a function to each of the small dataset automatically to get a parameter for each small dataset. I have no idea on how to do this. Is there any code to make it possible?
This is what I have
Date name number
20100101 John 1
20100102 Kate 3
20100102 Kate 2
20100103 John 3
20100104 John 1
And I want it to be split into two small ones
Date name number
20100101 John 1
20100103 John 3
20100104 John 1
Date name number
20100102 Kate 3
20100102 Kate 2
I think a more efficient way than filtering the original data set using subsetting is groupby(), as a demo:
for _, g in df.groupby('name'):
print(g)
# Date name number
#0 20100101 John 1
#3 20100103 John 3
#4 20100104 John 1
# Date name number
#1 20100102 Kate 3
#2 20100102 Kate 2
So to get a list of small data frames, you can do [g for _, g in df.groupby('name')].
To expand on this answer, we can see more clearly what df.groupby() returns as follows:
for k, g in df.groupby('name'):
print(k)
print(g)
# John
# Date name number
# 0 20100101 John 1
# 3 20100103 John 3
# 4 20100104 John 1
# Kate
# Date name number
# 1 20100102 Kate 3
# 2 20100102 Kate 2
For each element returned by groupby(), it contains a key and a data frame with name which has a unique value of the key. In the above solution, we don't need the key, so we can just specify a position holder and discard it.
Unless your function is really slow, this can probably be accomplished by slicing (e.g. df_small = df[a:b] for some indices a and b). The only trick is to choose a and b. I use range in the code below but you could do it other ways:
param_list = []
n = 10000 #size of smaller dataframe
# loop up to 60000-n, n at a time
for i in range(0,60000-n,n):
# take a slice of big dataframe and apply function to get 'param'
df_small = df[i:i+n] #
param = function( df_small )
# keep our results in a list
param_list.append(param)
EDIT: Based on update, you could do something like this:
# loop through names
for i in df.name.values.unique():
# take a slice of big dataframe and apply function to get 'param'
df_small = df[df.name==i]