Let's say I have a pandas dataframe and a column 'name'. I want to anonymize the column and hide the identities. I can do something like,
df['nickname'] = 'P ' + pd.Series(pd.factorize(df['name'])[0] + 1).astype(str)
But it gives me this:
name nickname
frank miller P 1
john cena P 2
john cena P 2
rock P 3
The above is an acceptable anonymization, but NOT what I need. Is there a way I can get the desired table below? Maybe a built-in python function or someone who has already implemented anything like this?
Desired Table (with random nicknames, but same output for the same input):
name nickname
frank miller Tiko
john cena Bozo
john cena Bozo
the rock Hana
You can use the Faker package for this which generates a dummy name for you.
Installation:
# pip
pip install Faker
# anaconda
conda install -c conda-forge faker
Example:
from faker import Faker
faker = Faker()
# seed the random generator to produce the same results
Faker.seed(4321)
dict_names = {name: faker.name() for name in df['name'].unique()}
df['nickname'] = df['name'].map(dict_names)
Output
name nickname
0 frank miller Jason Brown
1 john cena Jacob Stein
2 john cena Jacob Stein
3 rock Cody Brown
You can also initialize Faker with names from certain countries:
faker = Faker(['it_IT', 'de_DE', 'sv_SE'])
dict_names = {name: faker.name() for name in df['name'].unique()}
df['nickname'] = df['name'].map(dict_names)
Output
name nickname
0 frank miller Nadeschda Finke
1 john cena Marcus Warmer
2 john cena Marcus Warmer
3 rock Sophia Squarcione
Related
Please help me with the python script to filter the below CSV.
Below is the example of the CSV dump for which I have done the initial filtration.
Last_name
Gender
Name
Phone
city
Ford
Male
Tom
123
NY
Rich
Male
Robert
21312
LA
Ford
Female
Jessica
123123
NY
Ford
Male
John
3412
NY
Rich
Other
Linda
12312
LA
Ford
Other
James
4321
NY
Smith
Male
David
123123
TX
Rich
Female
Mary
98689
LA
Rich
Female
Jennifer
86860
LA
Ford
Male
Richard
12123
NY
Smith
Other
Daniel
897097
TX
Ford
Other
Lisa
123123123
NY
import re
def gather_info (L_name):
dump_filename = "~/Documents/name_report.csv"
LN = []
with open(dump_filename, "r") as FH:
for var in FH.readlines():
if L_name in var
final = var.split(",")
print(final[1], final[2], final[3])
return LN
if __name__ == "__main__":
L_name = input("Enter the Last name: ")
la_name = gather_info(L_name)
By this, I am able to filter by the last name. for example, if I choose L_name as Ford, then I have my output as
Gender
Name
Phone
Male
Tom
123
Female
Jessica
123123
Male
John
3412
Other
James
4321
Male
Richard
12123
Other
Lisa
22412
I need help extending the script by selecting each gender and the values in the list to perform other functions, then calling the following gender and the values to achieve the same functions. for example, first, it selects the gender Male [Tom, John] and performs other functions. then selects the next gender Female [Jessica] and performs the same functions and then selects the gender Other [James, Lisa] and performs the same functions.
I would recomend using the pandas module which allows for easy filtering and grouping of data
import pandas as pd
if __name__ == '__main__':
data = pd.read_csv('name_reports.csv')
L_name = input("Enter the last name: ")
by_last_name = data[data['Last_name'] == L_name]
groups = by_last_name.groupby(['Gender'])
for group_name, group_data in groups:
print(group_name)
print(group_data)
Breaking this down into its pieces the first part is
data = pd.read_csv('name_reports.csv')
This reads the data from the csv and places it into a dataframe
Second we have
by_last_name = data[data['Last_name'] == L_name]
This filters the dataframe to only have results with Last_name equal to L_name
Next we group the data.
groups = by_last_name.groupby(['Gender'])
this groups the filtered data frames by gender
then we iterate over this. It returns a tuple with the group name and the dataframe associated with that group.
for group_name, group_data in groups:
print(group_name)
print(group_data)
This loop just prints out the data to access fields from it you can use the iterrows function
for index,row in group_data.iterrows():
print(row['city']
print(row['Phone']
print(row['Name']
And then you can use those for whatever function you want. I would recommend reading on the documentation for pandas since depending on the function you plan on using there may be a better way to do it using the library. Here is the link to the library https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Since you cannot use the pandas module then a method using only the csv module would look like this
import csv
def has_last_name(row,last_name):
return row['Last_name'] == last_name
def has_gender(row,current_gender):
return row['Gender'] == current_gender
if __name__ == '__main__':
data = None
genders = ['Male','Female','Other']
with open('name_reports.csv') as csvfile:
data = list(csv.DictReader(csvfile,delimiter=','))
L_name = input('Enter the Last name: ')
get_by_last_name = lambda row: has_last_name(row,L_name)
filtered_by_last_name = list(filter(get_by_last_name,data))
for gender in genders:
get_by_gender = lambda row: has_gender(row,gender)
filtered_by_gender = list(filter(get_by_gender,filtered_by_last_name))
print(filtered_by_gender)
The important part is the filter built in function. This takes in a function that takes in an item from a list and returns a bool. filter takes this function and an iterable and returns a generator of items that return true for that function. The other important part is the csv.DictReader which returns your csv file as a dictionary which makes allows you to access attributes by key instead of by index.
i have a dataframe
Name
Joe Smith
Jane Doe
Homer Simpson
i am trying to format this to get to
Name
Smith, Joe
Doe, Jane
Simpson, Homer
i have this code, and it works for ~ 80% of users in my list but some users are not coming through right.
invalid_users = ['Test User', 'Test User2', 'Test User3']
for index, row in df_Users.iterrows():
gap_pos = df_Users["Name"][index].find(" ")
if gap_pos > 0 and row["Name"] not in invalid_users:
row["Name"] = df_Users["Name"][index][len(df_Users["Name"][index])-gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
the users who are not coming through correctly, usually their last name is truncated somewhere - i.e. Simpson ==> mpson
What am I doing wrong here?
Just split on space, then reverse it (that's what .str[::-1] is doing) and join on , :
>>> df['Name'].str.split(' ').str[::-1].str.join(', ')
0 Smith, Joe
1 Doe, Jane
2 Simpson, Homer
Name: Name, dtype: object
And if your data contains the name like Jr. Joe Smith, then you may do it following way:
df['Name'].str.split(' ').str[::-1].apply(lambda x:(x[0],' '.join(x[1:]))).str.join(', ')
I'm not sure what you were trying to with len there, but it's not right. You just want to start straight from gap_pos:
row["Name"] = df_Users["Name"][index][gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
I would be tempted to use split for this.
Pandas is a library that takes profit of vectorial operations, especially for simple transformations and most of DataFrame manipulations.
Given your example, here is a code that would work:
import pandas as pd
df = pd.DataFrame({"name": ["Joe Smith", "Jane Doe", "Homer Simpson"]})
# df
# name
# 0 Joe Smith
# 1 Jane Doe
# 2 Homer Simpson
df["name"] = df["name"].apply(lambda x: f"{x.split(' ')[1]}, {x.split(' ')[0]}")
# df
# name
# 0 Smith, Joe
# 1 Doe, Jane
# 2 Simpson, Homer
The apply function takes every row and applies the specified function to each one of them.
Here, the specified function is a lambda function that, supposing the name pattern is "FirstName LastName", does what you want.
I have a large excel file to clean around 200000 rows. So Im using pandas to drop unwanted rows if the conditions meet but it takes some time to run.
My current code looks like this
def cleanNumbers(number): # checks number if it is a valid number
vaild = True
try:
num = pn.parse('+' + str(number), None)
if not pn.is_valid_number(num):
vaild = False
except:
vaild = False
return vaild
for UncleanNum in tqdm(TeleNum):
valid = cleanNumbers(UncleanNum) # calling cleanNumbers function
if valid is False:
df = df.drop(df[df.telephone == UncleanNum].index)
# dropping row if number is not a valid number
It takes around 30 min for this line of code to finish. Is there a more efficient way to drop rows with pandas? If not can I use numpy to have the same output?
Im not that aquainted with pandas or numpy so if you have any tips to share it would be helpful.
Edit:
Im using phonenumbers lib to check if the telephone number is valid. If its not a valid phonenumber i drop the row that number is on.
Example data
address name surname telephone
Street St. Bill Billinson 7398673456897<--let say this is wrong
Street St. Nick Nick 324523452345
Street St. Sam Sammy 234523452345
Street St. Bob Bob 32452345234534<--and this too
Street St. John Greg 234523452345
Output
address name surname telephone
Street St. Nick Nick 324523452345
Street St. Sam Sammy 234523452345
Street St. John Greg 234523452345
This is what my code does but it slow.
In my opinion here main bootleneck is not drop, but custom function repeating for large number of values.
Create list of all valid numbers and then filter by boolean indexing with Series.isin:
v = [UncleanNum for UncleanNum in tqdm(TeleNum) if cleanNumbers(UncleanNum)]
df = df[df.telephone.isin(v)]
EDIT:
After some testing solution should be simplify, because function return boolean:
df1 = df[df['telephone'].apply(cleanNumbers)]
I have a Data Frame which looks like this
Name Surname Country Path
John Snow UK /Home/drive/John
BOB Anderson /Home/drive/BOB
Tim David UK /Home/drive/Tim
Wayne Green UK /Home/drive/Wayne
I have written a script which first checks if country =="UK", if true, changes Path from "/Home/drive/" to "/Server/files/" using gsub in R.
Script
Pattern<-"/Home/drive/"
Replacement<- "/Server/files/"
for (i in 1:nrow(gs_catalog_Staging_123))
{
if( gs_catalog_Staging_123$country[i] == "UK" && !is.na(gs_catalog_Staging_123$country[i]))
{
gs_catalog_Staging_123$Path<- gsub(Pattern , Replacement , gs_catalog_Staging_123$Path,ignore.case=T)
}
}
The output i get :
Name Surname Country Path
John Snow UK /Server/files/John
*BOB Anderson /Server/files/BOB*
Tim David UK /Server/files/Tim
Wayne Green UK /Server/files/Wayne
The output I want
Name Surname Country Path
John Snow UK /Server/files/John
BOB Anderson /Home/drive/BOB
Tim David UK /Server/files/Tim
Wayne Green UK /Server/files/Wayne
As we can clearly see gsub fails to recognize missing values and appends that row as well.
Many R functions are vectorized, so we can avoid a loop here.
# example data
df <- data.frame(
name = c("John", "Bob", "Tim", "Wayne"),
surname = c("Snow", "Ander", "David", "Green"),
country = c("UK", "", "UK", "UK"),
path = paste0("/Home/drive/", c("John", "Bob", "Tim", "Wayne")),
stringsAsFactors = FALSE
)
# fix the path
df$newpath <- ifelse(df$country=="UK" & !is.na(df$country),
gsub("/Home/drive/", "/Server/files/", df$path),
df$path)
# view result
df
name surname country path newpath
1 John Snow UK /Home/drive/John /Server/files/John
2 Bob Ander /Home/drive/Bob /Home/drive/Bob
3 Tim David UK /Home/drive/Tim /Server/files/Tim
4 Wayne Green UK /Home/drive/Wayne /Server/files/Wayne
In fact, this is the issue with your code. Each time through your loop, you check row i but then you do a full replacement of the whole column. A fix would be to add [i] at appropriate places of your final line of code:
gs_catalog_Staging_123$Path[i] <- gsub(Pattern , Replacement , gs_catalog_Staging_123$Path[i] ,ignore.case=T)
This is my dataframe (where the values in the authors column are comma separated strings):
authors book
Jim, Charles The Greatest Book in the World
Jim An OK book
Charlotte A book about books
Charlotte, Jim The last book
How do I transform it to a long format, like this:
authors book
Jim The Greatest Book in the World
Jim An OK book
Jim The last book
Charles The Greatest Book in the World
Charlotte A book about books
Charlotte The last book
I've tried extracting the individual authors to a list, authors = list(df['authors'].str.split(',')), flatten that list, matched every author to every book, and construct a new list of dicts with every match. But that doesn't seem very pythonic to me, and I'm guessing pandas has a cleaner way to do this.
You can split the authors column by column after setting the index to the book which will get you almost all the way there. Rename and sort columns to finish.
df.set_index('book').authors.str.split(',', expand=True).stack().reset_index('book')
book 0
0 The Greatest Book in the World Jim
1 The Greatest Book in the World Charles
0 An OK book Jim
0 A book about books Charlotte
0 The last book Charlotte
1 The last book Jim
And to get you all the way home
df.set_index('book')\
.authors.str.split(',', expand=True)\
.stack()\
.reset_index('book')\
.rename(columns={0:'authors'})\
.sort_values('authors')[['authors', 'book']]\
.reset_index(drop=True)
The best option is to use pandas.Series.str.split, and then to pandas.DataFrame.explode the list.
Split on ', ', otherwise values following the comma will be preceded by a whitespace (e.g. ' Charles')
Tested in python 3.10, pandas 1.4.3
import pandas as pd
data = {'authors': ['Jim, Charles', 'Jim', 'Charlotte', 'Charlotte, Jim'], 'book': ['The Greatest Book in the World', 'An OK book', 'A book about books', 'The last book']}
df = pd.DataFrame(data)
# display(df)
authors book
0 Jim, Charles The Greatest Book in the World
1 Jim An OK book
2 Charlotte A book about books
3 Charlotte, Jim The last book
# split authors
df.authors = df.authors.str.split(', ')
# explode the column (with a fresh 0, 1... index)
df = df.explode('authors', ignore_index=True)
# display(df)
authors book
0 Jim The Greatest Book in the World
1 Charles The Greatest Book in the World
2 Jim An OK book
3 Charlotte A book about books
4 Charlotte The last book
5 Jim The last book