I have 2 different Dataframes for which I am trying to match strings columns (names)
Below are just some sample of DF's
df1 (127000,3)
Code Name PostalCode
150 Maarc 47111
250 Kirc 41111
170 Moic 42111
140 Nirc 44111
550 Lacter 47111
df2 (38000,3)
Code NAME POSTAL_CODE
150 Marc 47111
250 Kikc 41111
170 Mosc 49111
140 NiKc 44111
550 Lacter 47111
The aim is to create another DF3 as shown below
Code NAME Best Match Score
150 Marc Maarc 0.9
250 Karc Kirc 0.9
The following code gives the expected output
import difflib
from functools import partial
f = partial(difflib.get_close_matches, possibilities= df1['Name'].tolist(), n=1)
matches = df2['NAME'].map(f).str[0].fillna('')
scores = [difflib.SequenceMatcher(None, x, y).ratio()
for x, y in zip(matches, df2['NAME'])]
df3 = df2.assign(best=matches, score=scores)
df3.sort_values(by='score')
The problem
To match those strings for only 2 rows it takes around 30sec. This task has to be done for 1K rows which will take hours!
The Question
How can I speed up the code?
I was thinking about something like fetchall?
EDIT
Even the fuzzywuzzy libraries has been tried, which takes longer than difflib with the following code:
from fuzzywuzzy import fuzz
def get_fuzz(df, w):
s = df['Name'].apply(lambda y: fuzz.token_set_ratio(y, w))
idx = s.idxmax()
return {'Name': df['Name'].iloc[idx], 'CODE': df['Code'].iloc[idx], 'Value': s.max()}
df2['NAME'].apply(lambda x: get_fuzz(df1, x))
df2 = df2.assign(search= df2['NAME'].apply(lambda x: get_fuzz(df1, x)))
So I was able to speed up the matching step by using the postal code column as discriminant. I was able to goes from 1h40 to 7mn of computation.
Below are just some sample of DF's
df1 (127000,3)
Code Name PostalCode
150 Maarc 47111
250 Kirc 41111
170 Moic 42111
140 Nirc 44111
550 Lacter 47111
df2 (38000,3)
Code NAME POSTAL_CODE
150 Marc 47111
250 Kikc 41111
170 Mosc 49111
140 NiKc 44111
550 Lacter 47111
And below is the code that matches the Name column and retrieve the name with the best score
%%time
import difflib
from functools import partial
def difflib_match (df1, df2, set_nan = True):
# Fill NaN
df2['best']= np.nan
df2['score']= np.nan
# Apply function to retrieve unique first letter of Name's column
first= df2['POSTAL_CODE'].unique()
# Loop over each first letter to apply the matching by starting with the same Postal code for both DF
for m, letter in enumerate(first):
# IF Divid by 100, print Unique values processed
if m%100 == 0:
print(m, 'of', len(first))
df1_first= df1[df1['PostalCode'] == letter]
df2_first= df2[df2['POSTAL_CODE'] == letter]
# Function to match using the Name column from the Web
f = partial(difflib.get_close_matches, possibilities= df1_first['Name'].tolist(), n=1)
# Define which columns to compare while mapping with first letter
matches = df2_first['NAME'].map(f).str[0].fillna('')
# Retrieve the best score for each match
scores = [difflib.SequenceMatcher(None, x, y).ratio()
for x, y in zip(matches, df2_first['NAME'])]
# Assign the result to the DF
for i, name in enumerate(df2_first['NAME']):
df2['best'].where(df2['NAME'] != name, matches.iloc[i], inplace = True)
df2['score'].where(df2['NAME'] != name, scores[i], inplace = True)
return df2
# Apply Function
df_diff= difflib_match(df1, df2)
# Display DF
print('Shape: ', df_diff.shape)
df_diff.head()
The fastest way I can think of matching string is using Regex.
It's a search language design to find matches in a string.
You can see a example here:
import re
txt = "The rain in Spain"
x = re.search("^The.*Spain$", txt)
//Outputs: x == true
*Taken from: https://www.w3schools.com/python/python_regex.asp
Since I don't understand anything Dataframe, I don't know how to implement Regex in your code, but I hope that Regex function might help you.
Related
Values in my DataFrame look like this:
id val
big_val_167 80
renv_100 100
color_100 200
color_60/write_10 200
I want to remove everything in values of id column after _numeric. So desired result must look like:
id val
big_val 80
renv 100
color 200
color 200
How to do that? I know that str.replace() can be used, but I don't understand how to write regular expression part in it.
You can use regex(re.search) to find the first occurence of _ + digit and then you can solve the problem.
Code:
import re
import pandas as pd
def fix_id(id):
# Find the first occurence of: _ + digits in the id:
digit_search = re.search(r"_\d", id)
return id[:digit_search.start()]
# Your df
df = pd.DataFrame({"id": ["big_val_167", "renv_100", "color_100", "color_60/write_10"],
"val": [80, 100, 200, 200]})
df["id"] = df["id"].apply(fix_id)
print(df)
Output:
id val
0 big_val 80
1 renv 100
2 color 200
3 color 200
I am currently working on a data science project. The Idea is to clean the data from "glassdoor_jobs.csv", and present it in a much more understandable manner.
import pandas as pd
df = pd.read_csv('glassdoor_jobs.csv')
#salary parsing
#Removing "-1" Ratings
#Clean up "Founded"
#state field
#Parse out job description
df['hourly'] = df['Salary Estimate'].apply(lambda x: 1 if 'per hour' in x.lower() else 0)
df['employer_provided'] = df['Salary Estimate'].apply(lambda x: 1 if 'employer provided salary' in x.lower() else 0)
df = df[df['Salary Estimate'] != '-1']
Salary = df['Salary Estimate'].apply(lambda x: x.split('(')[0])
minus_Kd = Salary.apply(lambda x: x.replace('K', '').replace('$',''))
minus_hr = minus_Kd.apply(lambda x: x.lower().replace('per hour', '').replace('employer provided salary:', ''))
df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(x.split('-')[1]))
I am getting the error at that last line. After digging a bit, I found out in minus_hr, some of the 'Salary Estimate' only has one number instead of range:
index
Salary Estimate
0
150
1
58
2
130
3
125-150
4
110-140
5
200
6
67- 77
And so on. Now I'm trying to figure out how to work around the "list index out of range", and make max_salary the same as the min_salary for the cells with only one value.
I am also trying to get average between the min and max salary, and if the cell only has a single value, make that value the average
So in the end, something like index 0 would look like:
index
min
max
average
0
150
150
150
You'll have to add in a conditional statement somewhere.
df['min_salary'] = minus_hr.apply(lambda x: int(x.split('-')[0]) if '-' in x else x)
The above might do it, or you can define a function.
def max_salary(cell_value):
if '-' in cell_value:
max_salary = split(cell_value, '-')[1]
else:
max_salary = cell_value
return max_salary
df['max_salary'] = minus_hr.apply(lambda x: max_salary(x))
def avg_salary(cell_value):
if '-' in cell_value:
salaries = split(cell_value,'-')
avg = sum(salaries)/len(salaries)
else:
avg = cell_value
return avg
df['avg_salary'] = minus_hr.apply(lambda x: avg_salary(x))
Swap in min_salary and repeat
Test the length of x.split('-') before accessing the elements.
salaries = x.split('-')
if len(salaries) == 1:
# only one salary number is given, so assign the same value to min and max
df['min_salary'] = df['max_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
else:
# two salary numbers are given
df['min_salary'] = minus_hr.apply(lambda x: int(salaries[0]))
df['max_salary'] = minus_hr.apply(lambda x: int(salaries[1]))
If you want to avoid .apply()...
Try:
import numpy as np
# extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
sals = df['Salary Estimate'].str.extractall(r'(?P<min_salary>\d+)[^0-9]*(?P<max_salary>\d*)?')
# reset the new frame's index
sals = sals.reset_index()
# join the extracted min/max salary columns to the original dataframe and fill any blanks with nan
df = df.join(sals[['min_salary', 'max_salary']].fillna(np.nan))
# fill any nan values in the 'max_salary' column with values from the 'min_salary' column
df['max_salary'] = df['max_salary'].fillna(df['min_salary'])
# set the type of the columns to int
df['min_salary'] = df['min_salary'].astype(int)
df['max_salary'] = df['max_salary'].astype(int)
# calculate the average
df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)
# see what you've got
print(df)
Or without using regex:
import numpy as np
# extract the two numbers (if there are two numbers) from the 'Salary Estimate' column
df['sals'] = df['Salary Estimate'].str.split('-')
# expand the list in sals to two columns filling with nan
df[['min_salary', 'max_salary']] = pd.DataFrame(df.sals.tolist()).fillna(np.nan)
# delete the sals column
del df['sals']
# # fill any nan values in the 'max_salary' column with values from the 'min_salary' column
df['max_salary'] = df['max_salary'].fillna(df['min_salary'])
# # set the type of the columns to int
df['min_salary'] = df['min_salary'].astype(int)
df['max_salary'] = df['max_salary'].astype(int)
# # calculate the average
df['average_salary'] = df.loc[:,['min_salary', 'max_salary']].mean(axis=1).astype(int)
# see you've got
print(df)
Output:
Salary Estimate min_salary max_salary average_salary
0 150 150 150 150
1 58 58 58 58
2 130 130 130 130
3 125-150 125 150 137
4 110-140 110 140 125
5 200 200 200 200
6 67- 77 67 77 72
I have used re.search to get strings of uniqueID from larger strings.
ex:
import re
string= 'example string with this uniqueID: 300-350'
combination = '(\d+)[-](\d+)'
m = re.search(combination, string)
print (m.group(0))
Out: '300-350'
I have created a dataframe with the UniqueID and the Combination as columns.
uniqueID combinations
0 300-350 (\d+)[-](\d+)
1 off-250 (\w+)[-](\d+)
2 on-stab (\w+)[-](\w+)
And a dictionary meaning_combination relating the combination with the variable meaning it represents:
meaning_combination={'(\\d+)[-](\\d+)': 'A-B',
'(\\w+)[-](\\d+)': 'C-A',
'(\\w+)[-](\\w+)': 'C-D'}
I want to create new columns for each variable (A, B, C, D) and fill them with their corresponding values.
the final result should look like this:
uniqueID combinations A B C D
0 300-350 (\d+)[-](\d+) 300 350
1 off-250 (\w+)[-](\d+) 250 off
2 on-stab (\w+)[-](\w+) stab on
I would fix your regexes to:
meaning_combination={'(\d+-\d+)': 'A-B',
'([^0-9\W]+\-\d+)': 'C-A',
'([^0-9\W]+\-[^0-9\W]+)': 'C-D'}
To capture the full group instead of having three capturing groups.
I.e. (300-350, 300, 350) --> (300-350)
You don't need to have extra two capturing groups because if a specific pattern is satisfied then you know what the positions of the word or digit characters are (based on how you defined the pattern) and you can split by - to access them individually.
I.e.:
str = 'example string with this uniqueID: 300-350'
values = re.findall('(\d+-\d+)', str)
>>>['300-350']
#first digit char:
values[0].split('-')[0]
>>>'300'
If you use this way, you can loop over dictionary keys and list of strings and test if the pattern is satisfied in the string. If it's satisfied (len(re.findall(pattern, string)) != 0), then grab the corresponding dictionary value for the key and split it and split the match and assign dictionary_value.split('-')[0] : match[0].split('-')[0] and dictionary_value.split('-')[1] : match[0].split('-')[1] in a new dictionary that your creating in the loop - also assign unique id to the full match value and combination to the matched pattern. Then use pandas to make a Dataframe.
Altogether:
import re
import pandas as pd
stri= ['example string with this uniqueID: 300-350', 'example string with this uniqueID: off-250', 'example string with this uniqueID: on-stab']
meaning_combination={'(\d+-\d+)': 'A-B',
'([^0-9\W]+\-\d+)': 'C-A',
'([^0-9\W]+\-[^0-9\W]+)': 'C-D'}
values = [{'Unique ID': re.findall(x, st)[0], 'Combination': x, y.split('-')[0] : re.findall(x, st)[0].split('-')[0], y.split('-')[1] : re.findall(x, st)[0].split('-')[1]} for st in stri for x, y in meaning_combination.items() if len(re.findall(x, st)) != 0]
df = pd.DataFrame.from_dict(values)
#just to sort it in order since default is alphabetical
col_val = ['Unique ID', 'Combination', 'A', 'B', 'C', 'D']
df = df.reindex(sorted(df.columns, key=lambda x: col_val.index(x) ), axis=1)
print(df)
output:
Unique ID Combination A B C D
0 300-350 (\d+-\d+) 300 350 NaN NaN
1 off-250 ([^0-9\W]+\-\d+) 250 NaN off NaN
2 on-stab ([^0-9\W]+\-[^0-9\W]+) NaN NaN on stab
Also, note, I think you have a typo in your expected output because you have:
'(\\w+)[-](\\d+)': 'C-A'
which would match off-250, but in your final result you have:
uniqueID combinations A B C D
1 off-250 (\w+)[-](\d+) 250 off
When based on your key this should be in C and A.
I want the input str to match with str in file that have fix row and then I will minus the score column of that row
1!! == i think this is for loop to find match str line by line from first to last
2!! == this is for when input str have matched it will minus score of matched row by 1.
CSV file:
article = pd.read_csv('Customer_List.txt', delimiter = ',',names = ['ID','NAME','LASTNAME','SCORE','TEL','PASS'])
y = len(article.ID)
line=article.readlines()
for x in range (0,y): # 1!!
if word in line :
newarticle = int(article.SCORE[x]) - 1 #2!!
print(newarticle)
else:
x = x + 1
P.S. I have just study python for 5 days, please give me a suggestion.Thank you.
Since I see you using pandas, I will give a solution without any loops as it is much easier.
You have, for example:
df = pd.DataFrame()
df['ID'] = [216, 217]
df['NAME'] = ['Chatchai', 'Bigm']
df['LASTNAME'] = ['Karuna', 'Koratuboy']
df['SCORE'] = [25, 15]
You need to do:
lookfor = str(input("Enter the name: "))
df.loc[df.NAME == lookfor, 'SCORE']-= 1
What happens in the lines above is, you look for the name entered in the NAME column of your dataframe, and reduce the score by 1 if there is a match, which is what you want if I understand your question.
Example:
Now, let's say you are looking for a person called Alex with the name, since there is no such person, you must get the same dataframe back.
Enter the name: Alex
ID NAME LASTNAME SCORE
0 216 Chatchai Karuna 25
1 217 Bigm Koratuboy 15
Now, let's say you are looking for a person called Chatchai with the name, since there is a match and you want the score to be reduced, you will get:
Enter the name: Chatchai
ID NAME LASTNAME SCORE
0 216 Chatchai Karuna 24
1 217 Bigm Koratuboy 15
I have a bunch of data files, with columns 'Names', 'Gender', 'Count', one file per one year. I need to concatenate all the files for some period, sum all counts for all unique names and add a new column with amount of consonant. I can't extract string value from 'Names'. How can I implement that?
Here is my code:
import os
import re
import pandas as pd
PATH = ...
def consonants_dynamics (years):
names_by_year = {}
for year in years:
names_by_year[year] = pd.read_csv(PATH+"\\yob{}.txt".format(year), names =['Names', 'Gender', 'Count'])
names_all = pd.concat(names_by_year, names=['Year', 'Pos'])
dynamics = names_all.groupby('Names').sum().sort_values(by='Count', ascending=False).unstack('Names')
dynamics['Consonants'] = dynamics.apply(count_vowels(dynamics.Names), axis = 1)
return dynamics.head(10)
def count_vowels (name):
vowels = re.compile('A|E|I|O|U|a|e|i|o|u')
return len(name) - len (vowels.findall(name))
If I run something like
a = consonants_dynamics(i for i in range (1900, 2001, 10))
I get the following error message
<ipython-input-9-942fc155267e> in consonants_dynamcis(years)
...
---> 12 dynamics['Consonants'] = dynamics.apply(count_vowels(dynamics.Names), axis = 1)
AttributeError: 'Series' object has no attribute 'Names'
I tried various ways but all failed. How can it be done?
after doing unstack you converted dynamics to a series object where you no longer have Names column dynamics.Names. I think it should be fixed by removing .unstack('Names')
after that use dynamics.index:
dynamics['Consonants'] = dynamics.reset_index()['Names'].apply(count_vowels)
Convert index to_series and apply function:
print (dynamics)
Count
Names
James 2
John 3
Robert 10
def count_vowels (name):
vowels = re.compile('A|E|I|O|U|a|e|i|o|u')
return len(name) - len (vowels.findall(name))
dynamics['Consonants'] = dynamics.index.to_series().apply(count_vowels)
Solution without function with str.len and substract only wovels by str.count:
pat = 'A|E|I|O|U|a|e|i|o|u'
s = dynamics.index.to_series()
dynamics['Consonants_new'] = s.str.len() - s.str.count(pat)
print (dynamics)
Count Consonants_new Consonants
Names
James 2 3 3
John 3 3 3
Robert 10 4 4
EDIT:
Solutions without to_series is add as_index=False to groupby for return DataFrame:
names_all = pd.DataFrame({
'Names':['James','James','John','John', 'Robert', 'Robert'],
'Count':[10,20,10,30, 80,20]
})
dynamics = names_all.groupby('Names', as_index=False).sum()
.sort_values(by='Count', ascending=False)
pat = 'A|E|I|O|U|a|e|i|o|u'
s = dynamics.index.to_series()
dynamics['Consonants'] = dynamics['Names'].str.len() - dynamics['Names'].str.count(pat)
print (dynamics)
Names Count Consonants
2 Robert 100 4
1 John 40 3
0 James 30 3