I have a Dataframe:
ID Name Salary($)
0 Alex Jones 44,000
1 Bob Smith 65,000
2 Peter Clarke 50,000
In order to protect the privacy of the individuals in this dataset, I want to mask the output of this Dataframe in a Jupyter notebook like this:
ID Name Salary($)
0 AXXX XXXX 44,000
1 BXX XXXXX 65,000
2 PXXXX XXXXX 50,000
Individually replacing characters in each name seems very crude to me. There must be a better approach?
You can concatenate the first character by the result of replace all characters in the remaining slice of each name using str.replace:
In[16]:
df['Name'] = df['Name'].str[0] + df['Name'].str[1:].str.replace('\w','X')
df
Out[16]:
ID Name Salary($)
0 0 AXXX XXXXX 44,000
1 1 BXX XXXXX 65,000
2 2 PXXXX XXXXXX 50,000
Related
I'm looking to combine multiple row in a dataframe into a single row based on one column
This is what my df looks like:
id Name score
0 1234 jim 34
1 5678 james 45
2 4321 Macy 56
3 1234 Jim 78
4 5678 James 80
I want to combine based on column "score" so the output would look like:
id Name score
0 1234 jim 34,78
1 5678 james 45,80
2 4321 Macy 56
Basically I want to do the reverse of the explode function. How can I achieve this using pandas dataframe?
Try agg with groupby
out = df.groupby('id',as_index=False).agg({'Name':'first','score':lambda x : ','.join(x.astype(str))})
Out[29]:
id Name score
0 1234 jim 34,78
1 4321 Macy 56
2 5678 james 45,80
I would like to generate an integer-based unique ID for users (in my df).
Let's say I have:
index first last dob
0 peter jones 20000101
1 john doe 19870105
2 adam smith 19441212
3 john doe 19870105
4 jenny fast 19640822
I would like to generate an ID column like so:
index first last dob id
0 peter jones 20000101 1244821450
1 john doe 19870105 1742118427
2 adam smith 19441212 1841181386
3 john doe 19870105 1742118427
4 jenny fast 19640822 1687411973
10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID).
I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier.
I can't use groupby/cat code type methods in case the order of the
rows change.
The dataset won't grow beyond 50k rows.
Safe to assume there won't be a first, last, dob duplicate.
Feel like I may be tackling this the wrong way as I can't find much literature on it!
Thanks
You can try using hash function.
df['id'] = df[['first', 'last']].sum(axis=1).map(hash)
Please note the hash id is greater than 10 digits and is a unique integer sequence.
Here's a way of doing using numpy
import numpy as np
np.random.seed(1)
# create a list of unique names
names = df[['first', 'last']].agg(' '.join, 1).unique().tolist()
# generte ids
ids = np.random.randint(low=1e9, high=1e10, size = len(names))
# maps ids to names
maps = {k:v for k,v in zip(names, ids)}
# add new id column
df['id'] = df[['first', 'last']].agg(' '.join, 1).map(maps)
index first last dob id
0 0 peter jones 20000101 9176146523
1 1 john doe 19870105 8292931172
2 2 adam smith 19441212 4108641136
3 3 john doe 19870105 8292931172
4 4 jenny fast 19640822 6385979058
You can apply the below function on your data frame column.
def generate_id(s):
return abs(hash(s)) % (10 ** 10)
df['id'] = df['first'].apply(generate_id)
In case find out some values are not in exact digits, something like below you can do it -
def generate_id(s, size):
val = str(abs(hash(s)) % (10 ** size))
if len(val) < size:
diff = size - len(val)
val = str(val) + str(generate_id(s[:diff], diff))
return int(val)
I come from a SQL background and new to python. I have been trying to figure out how to solve this particular problem for awhile now and am unable to come up with anything.
Here are my dataframes
from pandas import DataFrame
import numpy as np
Names1 = {'First_name': ['Jon','Bill','Billing','Maria','Martha','Emma']}
df = DataFrame(Names1,columns=['First_name'])
print(df)
names2 = {'name': ['Jo', 'Bi', 'Ma']}
df_2 = DataFrame(names2,columns=['name'])
print(df_2)
Results to this:
First_name
0 Jon
1 Bill
2 Billing
3 Maria
4 Martha
5 Emma
name
0 Jo
1 Bi
2 Ma
This code helps me identify in df which First_name starts with a tuple from df_2
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), 'true', df['First_name'])
results to this:
First_name like_flg
0 Jon true
1 Bill true
2 Billing true
3 Maria true
4 Martha true
5 Emma Emma
I would like the final output of the dataframe to set the like_flg to the value of the tuple in which the First_name field is being conditionally compared against. See below for final desired output:
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Here's what I've tried so far
df['like_flg'] = np.where(df['First_name'].str.startswith(tuple(list(df_2['name']))), tuple(list(df_2['name'])), df['First_name'])
results to this error:
`ValueError: operands could not be broadcast together with shapes (6,) (3,) (6,)`
I've also tried aligning both dataframes, however, that won't work for the use case that I'm trying to achieve.
Is there a way to conditionally align dataframes to fill in the columns that start with the tuple?
I believe the issue I'm facing is that the tuple or dataframe that I'm using as a comparison is not the same size as the dataframe that I want to append the tuple to. Please see above for the desired output.
Thank you all advance!
If your starting strings differ in length, you can use .str.extract
df['like_flag'] = df['First_name'].str.extract('^('+'|'.join(df_2.name)+')')
df['like_flag'] = df['like_flag'].fillna(df.First_name) # Fill non matches.
I modified df_2 to be
name
0 Jo
1 Bi
2 Mar
which leads to:
First_name like_flag
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Mar
4 Martha Mar
5 Emma Emma
You can use np.where,
df['like_flg'] = np.where(df.First_name.str[:2].isin(df_2.name), df.First_name.str[:2], df.First_name)
First_name like_flg
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
Do with numpy find
v=df.First_name.values.astype(str)
s=df_2.name.values.astype(str)
df_2.name.dot((np.char.find(v,s[:,None])==0))
array(['Jo', 'Bi', 'Bi', 'Ma', 'Ma', ''], dtype=object)
Then we just assign it back
df['New']=df_2.name.dot((np.char.find(v,s[:,None])==0))
df.loc[df['New']=='','New']=df.First_name
df
First_name New
0 Jon Jo
1 Bill Bi
2 Billing Bi
3 Maria Ma
4 Martha Ma
5 Emma Emma
I have data frame with text data like below,
name | address | number
1 Bob bob No.56
2 #gmail.com
3 Carly carly#world.com No.90
4 Gorge greg#yahoo
5 .com
6 No.100
and want to make it like this frame.
name | address | number
1 Bob bob#gmail.com No.56
2 Carly carly#world.com No.90
3 Gorge greg#yahoo.com No.100
I am using pandas to read file but not sure how to use merge or concat.
In case of name column consists of unique values,
print df
name address number
0 Bob bob No.56
1 NaN #gmail.com NaN
2 Carly carly#world.com No.90
3 Gorge greg#yahoo NaN
4 NaN .com NaN
5 NaN NaN No.100
df['name'] = df['name'].ffill()
print df.fillna('').groupby(['name'], as_index=False).sum()
name address number
0 Bob bob#gmail.com No.56
1 Carly carly#world.com No.90
2 Gorge greg#yahoo.com No.100
you may need ffill(), bfill(), [::-1], .groupby('name').apply(lambda x: ' '.join(x['address'])), strip(), lstrip(), rstrip(), replace() kind of thing to extend above code to more complicated data.
If you want to convert a data frame of sex rows (with possible NaN entry in each column), there might be no direct pandas methods for that.
You will need some codes to assign the value in name column, so that pandas can know the split rows of bob and #gmail.com belong to same user Bob.
You can fill each empty entry in column name with its preceding user using the fillna or ffill methods, see pandas dataframe missing data.
df ['name'] = df['name'].ffill()
# gives
name address number
0 Bob bob No.56
1 Bob #gmail.com
2 Carly carly#world.com No.90
3 Gorge greg#yahoo
4 Gorge .com
5 Gorge No.100
Then you can use the groupby and sum as the aggregation function.
df.groupby(['name']).sum().reset_index()
# gives
name address number
0 Bob bob#gmail.com No.56
1 Carly carly#world.com No.90
2 Gorge greg#yahoo.com No.100
You may find converting between NaN and white space useful, see Replacing blank values (white space) with NaN in pandas and pandas.DataFrame.fillna.
I have a huge dataset with about 60000 data. I would first use some criteria to do groupby on the whole dataset, and what I want to do next is to separate the whole dataset to many small datasets within the criteria and to run a function to each of the small dataset automatically to get a parameter for each small dataset. I have no idea on how to do this. Is there any code to make it possible?
This is what I have
Date name number
20100101 John 1
20100102 Kate 3
20100102 Kate 2
20100103 John 3
20100104 John 1
And I want it to be split into two small ones
Date name number
20100101 John 1
20100103 John 3
20100104 John 1
Date name number
20100102 Kate 3
20100102 Kate 2
I think a more efficient way than filtering the original data set using subsetting is groupby(), as a demo:
for _, g in df.groupby('name'):
print(g)
# Date name number
#0 20100101 John 1
#3 20100103 John 3
#4 20100104 John 1
# Date name number
#1 20100102 Kate 3
#2 20100102 Kate 2
So to get a list of small data frames, you can do [g for _, g in df.groupby('name')].
To expand on this answer, we can see more clearly what df.groupby() returns as follows:
for k, g in df.groupby('name'):
print(k)
print(g)
# John
# Date name number
# 0 20100101 John 1
# 3 20100103 John 3
# 4 20100104 John 1
# Kate
# Date name number
# 1 20100102 Kate 3
# 2 20100102 Kate 2
For each element returned by groupby(), it contains a key and a data frame with name which has a unique value of the key. In the above solution, we don't need the key, so we can just specify a position holder and discard it.
Unless your function is really slow, this can probably be accomplished by slicing (e.g. df_small = df[a:b] for some indices a and b). The only trick is to choose a and b. I use range in the code below but you could do it other ways:
param_list = []
n = 10000 #size of smaller dataframe
# loop up to 60000-n, n at a time
for i in range(0,60000-n,n):
# take a slice of big dataframe and apply function to get 'param'
df_small = df[i:i+n] #
param = function( df_small )
# keep our results in a list
param_list.append(param)
EDIT: Based on update, you could do something like this:
# loop through names
for i in df.name.values.unique():
# take a slice of big dataframe and apply function to get 'param'
df_small = df[df.name==i]