Pandas replace strings with fuzzy match in the same column - python

I have a column in a dataframe that is like this:
OWNER
--------------
OTTO J MAYER
OTTO MAYER
DANIEL J ROSEN
DANIEL ROSSY
LISA CULLI
LISA CULLY
LISA CULLY
CITY OF BELMONT
CITY OF BELMONT CITY
Some of the names in my data frame are misspelled or having extra/missing characters. I need a column where the names are replaced by any close match in the same column. However, all the similar names need to be group by under one same name.
For example this is I what I expect from the data frame above:
NAME
--------------
OTTO J MAYER
OTTO J MAYER
DANIEL J ROSEN
DANIEL ROSSY
LISA CULLY
LISA CULLY
LISA CULLY
CITY OF BELMONT
CITY OF BELMONT
OTTO MAYER is replaced with OTTO J MAYER because they are both very similar. The DANIEL's stayed the same because they do not match much. The LISA CULL's all have the same values and etc.
I have some code I got from another post on stack overflow that was trying to solve something similar but they are using a dictionary of names. However, I'm having trouble reworking their code to produce the output that I need.
Here is what I have currently:
d = pd.DataFrame({'OWNER' : pd.Series(['OTTO J MAYER', 'OTTO MAYER','DANIEL J ROSEN','DANIEL ROSSY',
'LISA CULLI', 'LISA CULLY'])})
names = d['OWNER']
names = names.values
names
import difflib
def best_match(tokens, names):
for i,t in enumerate(tokens):
closest = difflib.get_close_matches(t, names, n=1)
if len(closest) > 0:
return i, closest[0]
return None
def fuzzy_replace(x, y):
names = y # just a simple replacement list
tokens = x.split()
res = best_match(tokens, y)
if res is not None:
pos, replacement = res
return u" ".join(tokens)
return x
d["OWNER"].apply(lambda x: fuzzy_replace(x, names))

Indeed difflib.get_close_matches is fit for the task, but splitting the name into tokens does no good. In order to differentiate the names as specified, we have to raise the cutoff score to about 0.8, and to make sure that all possible names are returned, raise the maximum number to len(names). Then we have two cases to decide which name to prefer:
If a name occurs more often than the others, choose that one.
Otherwise choose the one occurring first.
def fuzzy_replace(x, names):
aliases = difflib.get_close_matches(x, names, len(names), .8)
closest = pd.Series(aliases).mode()
closest = aliases[0] if closest.empty else closest[0]
d['OWNER'].replace(aliases, closest, True)
for x in d["OWNER"]: fuzzy_replace(x, d['OWNER'])

Related

Pandas DF: Create New Col by removing last word from of existing column

This should be easy, but I'm stumped.
I have a df that includes a column of PLACENAMES. Some of these have multiple word names:
Able County
Baker County
Charlie County
St. Louis County
All I want to do is to create a new column in my df that has just the name, without the "county" word:
Able
Baker
Charlie
St. Louis
I've tried a variety of things:
1. places['name_split'] = places['PLACENAME'].str.split()
2. places['name_split'] = places['PLACENAME'].str.split()[:-1]
3. places['name_split'] = places['PLACENAME'].str.rsplit(' ',1)[0]
4. places = places.assign(name_split = lambda x: ' '.join(x['PLACENAME].str.split()[:-1]))
Works - splits the names into a list ['St.','Louis','County']
The list splice is ignored, resulting in the same list ['St.','Louis','County'] rather than ['St.','Louis']
Raises a ValueError: Length of values (2) does not match length of index (41414)
Raises a TypeError: sequence item 0: expected str instance, list found
I've also defined a function and called it with .assign():
def processField(namelist):
words = namelist[:-1]
name = ' '.join(words)
return name
places = places.assign(name_split = lambda x: processField(x['PLACENAME]))
This also raises a TypeError: sequence item 0: expected str instance, list found
This seems to be a very simple goal and I've probably overthought it, but I'm just stumped. Suggestions about what I should be doing would be deeply appreciated.
Apply Series.str.rpartition function:
places['name_split'] = places['PLACENAME'].str.rpartition()[0]
Use str.replace to remove the last word and the preceding spaces:
places['new'] = place['PLACENAME'].str.replace(r'\s*\w+$', '', regex=True)
# or
places['new'] = place['PLACENAME'].str.replace(r'\s*\S+$', '', regex=True)
# or, only match 'County'
places['new'] = place['PLACENAME'].str.replace(r'\s*County$', '', regex=True)
Output:
PLACENAME new
0 Able County Able
1 Baker County Baker
2 Charlie County Charlie
3 St. Louis County St. Louis
regex demo

How to filter and sort specific csv using python

Please help me with the python script to filter the below CSV.
Below is the example of the CSV dump for which I have done the initial filtration.
Last_name
Gender
Name
Phone
city
Ford
Male
Tom
123
NY
Rich
Male
Robert
21312
LA
Ford
Female
Jessica
123123
NY
Ford
Male
John
3412
NY
Rich
Other
Linda
12312
LA
Ford
Other
James
4321
NY
Smith
Male
David
123123
TX
Rich
Female
Mary
98689
LA
Rich
Female
Jennifer
86860
LA
Ford
Male
Richard
12123
NY
Smith
Other
Daniel
897097
TX
Ford
Other
Lisa
123123123
NY
import re
def gather_info (L_name):
dump_filename = "~/Documents/name_report.csv"
LN = []
with open(dump_filename, "r") as FH:
for var in FH.readlines():
if L_name in var
final = var.split(",")
print(final[1], final[2], final[3])
return LN
if __name__ == "__main__":
L_name = input("Enter the Last name: ")
la_name = gather_info(L_name)
By this, I am able to filter by the last name. for example, if I choose L_name as Ford, then I have my output as
Gender
Name
Phone
Male
Tom
123
Female
Jessica
123123
Male
John
3412
Other
James
4321
Male
Richard
12123
Other
Lisa
22412
I need help extending the script by selecting each gender and the values in the list to perform other functions, then calling the following gender and the values to achieve the same functions. for example, first, it selects the gender Male [Tom, John] and performs other functions. then selects the next gender Female [Jessica] and performs the same functions and then selects the gender Other [James, Lisa] and performs the same functions.
I would recomend using the pandas module which allows for easy filtering and grouping of data
import pandas as pd
if __name__ == '__main__':
data = pd.read_csv('name_reports.csv')
L_name = input("Enter the last name: ")
by_last_name = data[data['Last_name'] == L_name]
groups = by_last_name.groupby(['Gender'])
for group_name, group_data in groups:
print(group_name)
print(group_data)
Breaking this down into its pieces the first part is
data = pd.read_csv('name_reports.csv')
This reads the data from the csv and places it into a dataframe
Second we have
by_last_name = data[data['Last_name'] == L_name]
This filters the dataframe to only have results with Last_name equal to L_name
Next we group the data.
groups = by_last_name.groupby(['Gender'])
this groups the filtered data frames by gender
then we iterate over this. It returns a tuple with the group name and the dataframe associated with that group.
for group_name, group_data in groups:
print(group_name)
print(group_data)
This loop just prints out the data to access fields from it you can use the iterrows function
for index,row in group_data.iterrows():
print(row['city']
print(row['Phone']
print(row['Name']
And then you can use those for whatever function you want. I would recommend reading on the documentation for pandas since depending on the function you plan on using there may be a better way to do it using the library. Here is the link to the library https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html
Since you cannot use the pandas module then a method using only the csv module would look like this
import csv
def has_last_name(row,last_name):
return row['Last_name'] == last_name
def has_gender(row,current_gender):
return row['Gender'] == current_gender
if __name__ == '__main__':
data = None
genders = ['Male','Female','Other']
with open('name_reports.csv') as csvfile:
data = list(csv.DictReader(csvfile,delimiter=','))
L_name = input('Enter the Last name: ')
get_by_last_name = lambda row: has_last_name(row,L_name)
filtered_by_last_name = list(filter(get_by_last_name,data))
for gender in genders:
get_by_gender = lambda row: has_gender(row,gender)
filtered_by_gender = list(filter(get_by_gender,filtered_by_last_name))
print(filtered_by_gender)
The important part is the filter built in function. This takes in a function that takes in an item from a list and returns a bool. filter takes this function and an iterable and returns a generator of items that return true for that function. The other important part is the csv.DictReader which returns your csv file as a dictionary which makes allows you to access attributes by key instead of by index.

Formatting strings in a dataframe

i have a dataframe
Name
Joe Smith
Jane Doe
Homer Simpson
i am trying to format this to get to
Name
Smith, Joe
Doe, Jane
Simpson, Homer
i have this code, and it works for ~ 80% of users in my list but some users are not coming through right.
invalid_users = ['Test User', 'Test User2', 'Test User3']
for index, row in df_Users.iterrows():
gap_pos = df_Users["Name"][index].find(" ")
if gap_pos > 0 and row["Name"] not in invalid_users:
row["Name"] = df_Users["Name"][index][len(df_Users["Name"][index])-gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
the users who are not coming through correctly, usually their last name is truncated somewhere - i.e. Simpson ==> mpson
What am I doing wrong here?
Just split on space, then reverse it (that's what .str[::-1] is doing) and join on , :
>>> df['Name'].str.split(' ').str[::-1].str.join(', ')
0 Smith, Joe
1 Doe, Jane
2 Simpson, Homer
Name: Name, dtype: object
And if your data contains the name like Jr. Joe Smith, then you may do it following way:
df['Name'].str.split(' ').str[::-1].apply(lambda x:(x[0],' '.join(x[1:]))).str.join(', ')
I'm not sure what you were trying to with len there, but it's not right. You just want to start straight from gap_pos:
row["Name"] = df_Users["Name"][index][gap_pos+1:].strip() +', ' + df_Users["Name"][index][:gap_pos]
I would be tempted to use split for this.
Pandas is a library that takes profit of vectorial operations, especially for simple transformations and most of DataFrame manipulations.
Given your example, here is a code that would work:
import pandas as pd
df = pd.DataFrame({"name": ["Joe Smith", "Jane Doe", "Homer Simpson"]})
# df
# name
# 0 Joe Smith
# 1 Jane Doe
# 2 Homer Simpson
df["name"] = df["name"].apply(lambda x: f"{x.split(' ')[1]}, {x.split(' ')[0]}")
# df
# name
# 0 Smith, Joe
# 1 Doe, Jane
# 2 Simpson, Homer
The apply function takes every row and applies the specified function to each one of them.
Here, the specified function is a lambda function that, supposing the name pattern is "FirstName LastName", does what you want.

How to speed up pandas drop() method?

I have a large excel file to clean around 200000 rows. So Im using pandas to drop unwanted rows if the conditions meet but it takes some time to run.
My current code looks like this
def cleanNumbers(number): # checks number if it is a valid number
vaild = True
try:
num = pn.parse('+' + str(number), None)
if not pn.is_valid_number(num):
vaild = False
except:
vaild = False
return vaild
for UncleanNum in tqdm(TeleNum):
valid = cleanNumbers(UncleanNum) # calling cleanNumbers function
if valid is False:
df = df.drop(df[df.telephone == UncleanNum].index)
# dropping row if number is not a valid number
It takes around 30 min for this line of code to finish. Is there a more efficient way to drop rows with pandas? If not can I use numpy to have the same output?
Im not that aquainted with pandas or numpy so if you have any tips to share it would be helpful.
Edit:
Im using phonenumbers lib to check if the telephone number is valid. If its not a valid phonenumber i drop the row that number is on.
Example data
address name surname telephone
Street St. Bill Billinson 7398673456897<--let say this is wrong
Street St. Nick Nick 324523452345
Street St. Sam Sammy 234523452345
Street St. Bob Bob 32452345234534<--and this too
Street St. John Greg 234523452345
Output
address name surname telephone
Street St. Nick Nick 324523452345
Street St. Sam Sammy 234523452345
Street St. John Greg 234523452345
This is what my code does but it slow.
In my opinion here main bootleneck is not drop, but custom function repeating for large number of values.
Create list of all valid numbers and then filter by boolean indexing with Series.isin:
v = [UncleanNum for UncleanNum in tqdm(TeleNum) if cleanNumbers(UncleanNum)]
df = df[df.telephone.isin(v)]
EDIT:
After some testing solution should be simplify, because function return boolean:
df1 = df[df['telephone'].apply(cleanNumbers)]

Converting unordered list of tuples to pandas DataFrame

I am using the library usaddress to parse addresses from a set of files I have. I would like my final output to be a data frame where column names represent parts of the address (e.g. street, city, state) and rows represent each individual address I've extracted. For example:
Suppose I have a list of addresses:
addr = ['123 Pennsylvania Ave NW Washington DC 20008',
'652 Polk St San Francisco, CA 94102',
'3711 Travis St #800 Houston, TX 77002']
and I extract them using usaddress
info = [usaddress.parse(loc) for loc in addr]
"info" is a list of a list of tuples that looks like this:
[[('123', 'AddressNumber'),
('Pennsylvania', 'StreetName'),
('Ave', 'StreetNamePostType'),
('NW', 'StreetNamePostDirectional'),
('Washington', 'PlaceName'),
('DC', 'StateName'),
('20008', 'ZipCode')],
[('652', 'AddressNumber'),
('Polk', 'StreetName'),
('St', 'StreetNamePostType'),
('San', 'PlaceName'),
('Francisco,', 'PlaceName'),
('CA', 'StateName'),
('94102', 'ZipCode')],
[('3711', 'AddressNumber'),
('Travis', 'StreetName'),
('St', 'StreetNamePostType'),
('#', 'OccupancyIdentifier'),
('800', 'OccupancyIdentifier'),
('Houston,', 'PlaceName'),
I would like each list (there are 3 lists within the object "info") to represent a row, and the 2 value of each tuple pair to denote a column and the 1 value of the tuple pair to be the value. Note: the link of the inner lists will not always be the same as not every address will have every bit of information.
Any help would be much appreciated!
Thanks
Not sure if there is a DataFrame constructor that can handle info exactly as you have it now. (Maybe from_records or from_items?--still don't think this structure would be directly compatible.)
Here's a bit of manipulation to get what you're looking for:
cols = [j for _, j in info[0]]
# Could use nested list comprehension here, but this is probably
# more readable.
info2 = []
for row in info:
info2.append([i for i, _ in row])
pd.DataFrame(info2, columns=cols)
AddressNumber StreetName StreetNamePostType StreetNamePostDirectional PlaceName StateName ZipCode
0 123 Pennsylvania Ave NW Washington DC 20008
1 652 Polk St San Francisco, CA 94102
Thank you for your responses! I ended up doing a completely different workaround as follows:
I checked the documentation to see all possible parse_tags from usaddress, created a DataFrame with all possible tags as columns, and one other column with the extracted addresses. Then I proceeded to parse and extract information from the columns using regex. Code below!
parse_tags = ['Recipient','AddressNumber','AddressNumberPrefix','AddressNumberSuffix',
'StreetName','StreetNamePreDirectional','StreetNamePreModifier','StreetNamePreType',
'StreetNamePostDirectional','StreetNamePostModifier','StreetNamePostType','CornerOf',
'IntersectionSeparator','LandmarkName','USPSBoxGroupID','USPSBoxGroupType','USPSBoxID',
'USPSBoxType','BuildingName','OccupancyType','OccupancyIdentifier','SubaddressIdentifier',
'SubaddressType','PlaceName','StateName','ZipCode']
addr = ['123 Pennsylvania Ave NW Washington DC 20008',
'652 Polk St San Francisco, CA 94102',
'3711 Travis St #800 Houston, TX 77002']
df = pd.DataFrame({'Addresses': addr})
pd.concat([df, pd.DataFrame(columns = parse_tags)])
Then I created a new column that made a string out of the usaddress parse list and called it "Info"
df['Info'] = df['Addresses'].apply(lambda x: str(usaddress.parse(x)))
Now here's the major workaround. I looped through each column name and looked for it in the corresponding "Info" cell and applied regular expressions to extract information where they existed!
for colname in parse_tags:
df[colname] = df['Info'].apply(lambda x: re.findall("\('(\S+)', '{}'\)".format(colname), x)[0] if re.search(
colname, x) else "")
This is probably not the most efficient way, but it worked for my purposes. Thanks everyone for providing suggestions!

Categories

Resources