Speeding up a comparison function for comparing sentences - python

I have a data frame that has a shape of (789174, 9). There is a column called resolution that contains a sentence that is less than 139 characters in length. I built a function to find sentences that have a similarity score of above 0.9 from the difflib library. I have a virtual computer with 96 cpus and 384 gb of ram. I have been running this function for longer than 2 hours now and it still has not processed when i = 1000. I am concerned that this will take too long to process and I am wondering if there is a way to speed this up.
def replace_similars(input_list):
# Replaces %90 and more similar strings
start_time = time.time()
for i in range(len(input_list)):
if i % 1000 == 0:
print(f'time = {time.time()-start_time:.2f} - index = {i}')
for j in range(len(input_list)):
if i < j and difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
input_list[j] = input_list[i]
def generate_mapping(input_list):
new_list = input_list[:] # copy list
replace_similars(new_list)
mapping = {}
for i in range(len(input_list)):
mapping[input_list[i]] = new_list[i]
return mapping
Clearly since we are iterating twice through the column it is O(n^2). I am not sure if there is a way to make this faster. Any suggestions would be greatly appreciated.
EDIT:
I have attempted a speed up using the difflib and fuzzywuzzy. The function only goes through the column once but I do iterate through the dictionary keys.
def cluster_resolution(df):
clusters = {}
for string in df['resolution_modified'].unique():
match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
if match1:
for m in match1:
clusters[m].append(string)
else:
clusters[string] = [ string ]
for m in clusters.keys():
match2 = fuzz.partial_ratio(string, m)
if match2 >= 90:
clusters[m].append(string)
return clusters
mappings = cluster_resolution(df_sample)
Is it possible to speed up the latter function?
Here is an example of some data in a dataframe
d = {'resolution' : ['replaced scanner', 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use', 'tc reimage', 'updated pc', 'deploying replacement scanner', 'upgraded and rebooted station', 'printer has been reconfigured', 'cleared linux print queue and now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is now working','printer offlineswitched usb portprinter is now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well now']}
df = pd.DataFrame(data=d)
How I define similarity:
Similarity is really defined by the overall action taken such as replaced scanner and replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use. The longer strings overall action was replacing the scanner thus those two are very similar which is why I chose to use the partial_ratio function since those have a score of 100.
Attention:
Please refer to the second function cluster_resolution as this is the function I would like to sped up. The latter function is not going to useful.

Regarding your last edit, I'd make a few changes (mainly using fuzzywuzzy.process rather than fuzzywuzzy.fuzz) :
from fuzzywuzzy import process
def cluster_resolution(df):
clusters = {}
for string in df['resolution'].unique():
match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
if match1:
for m in match1:
clusters[m].append(string)
else:
bests = process.extractBests(
string,
set(clusters.keys())-{string},
scorer=fuzz.partial_ratio,
score_cutoff=80,
limit=1
)
if bests:
clusters[bests[0][0]].append(string)
else:
clusters[string] = [ string ]
But I think you could look more into other solutions, like CountVectorizer and whatever metric is adapted there. It is a way to gain speed (as it is vectorized), though the results may be imperfect. Note that CountVectorizer could be a good solution for you as you already made the choice of the partial_ratio.
For example, something like this :
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform
import hdbscan
df = pd.DataFrame(d)
cv = CountVectorizer(stop_words="english")
transformed = cv.fit_transform(df['resolution'])
transformed = pd.DataFrame(
transformed.toarray(),
columns=cv.get_feature_names(),
index=df['resolution'])
#keep only columns with more than 1
transformed = transformed[transformed.columns[transformed.sum()>2]]
#compute the distance matrix
d = pdist(transformed, metric="hamming") * transformed.shape[1]
s = squareform(d)
clusterer = hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=2)
clusterer.fit_predict(s)
df['labels'] = clusterer.labels_
print(df.sort_values('labels'))
I think this is still perfectible (this is my first shot at text clustering...). You could also add your own list of stopwords for CountVectorizer which would be a way to help the algorithm. At the very least, it could help you pre-cluster your dataset before using your previous function, this way for instance :
df.groupby('labels')['resolution'].apply(cluster_resolution)
(That way if your first clustering is roughly ok, you would only check each value against all other values in the cluster, not all values).
Credits to #anon01 for the computation of the distance matrix in this answer, which seems to give slightly better results than the default of hdbscan.
Edit :
Another try, including :
change of the metrics,
adding a step with a TF-IDF model,
and adding a step to lemmatize the words using nltk package.
So this would be :
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import hdbscan
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
d = {...}
df = pd.DataFrame(d)
lemmatizer = WordNetLemmatizer()
def lemmatization(sentence):
tag_dict = {
"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV,
}
# Tokenize the sentence
wordsList = nltk.word_tokenize(sentence)
# Find the right token
tagged = nltk.pos_tag(wordsList)
# Convert the list of (token, tag) to lemmatized tokens
lems = [
lemmatizer.lemmatize(token, tag_dict.get(tag[0], wordnet.NOUN) )
for token, tag
in tagged
]
lems = ' '.join(lems)
return lems
df['lemmatized'] = df['resolution'].apply(lemmatization)
corpus = df['lemmatized']
pipe = Pipeline(
[
('cv', CountVectorizer(stop_words="english")),
('tfid', TfidfTransformer())
]).fit(corpus)
transformed = pipe.transform(corpus)
transformed = pd.DataFrame(
transformed.toarray(),
columns=pipe.named_steps['cv'].get_feature_names(),
index=df['resolution'])
d = pdist(transformed, metric="cosine") * transformed.shape[1]
s = squareform(d)
clusterer = hdbscan.HDBSCAN(metric="precomputed", min_cluster_size=2)
clusterer.fit_predict(s)
df['labels'] = clusterer.labels_
print(df.sort_values('labels'))
You could also add some specific code, as your sample seems to be about very specific maintenance logs.
For instance, you could add new features to the transformed dataframe based on a small list of hardware/sotfware :
#To create a feature about OS :
cols = ['os', 'linux', 'window']
transformed[cols[0]] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))
#To crate a feature about hardware :
cols = ["laptop", "printer", "scanner"]
transformed["hardware"] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))
This step could help have better results but may not be necessary. I'm not sure of how it will compares to FuzzyWuzzy's performance for matching strings, but I will be interested in your feedback !

def replace_similars(input_list):
# Replaces %90 and more similar strings
start_time = time.time()
for i in range(len(input_list)):
if i % 1000 == 0:
print(f'time = {time.time()-start_time:.2f} - index = {i}')
for j in range(i+1, len(input_list)):
if -15 < len(list(input_list[i])) - len(list(input_list[i])) < 15:
if difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
input_list[j] = input_list[i]
def generate_mapping(input_list):
new_list = input_list[:] # copy list
replace_similars(new_list)
mapping = {}
for i in range(len(input_list)):
mapping[input_list[i]] = new_list[i]
return mapping
Even though this might not be a practical solution either cause it is also going to take about 90 years if every iteration takes 0.1s , but still it is a lot more optimised solution.

Related

Speeding up fuzzy match on large list

I am working on a project that uses fuzzy logic on a list of names that could go about 100,000 unique records. On the recent screening that we have conducted, the functions that we use can complete a single name within 2.20 seconds on average. This means that on a list of 10,000 names, this process could take 6 hours, which is really too long.
Is there a way that we can speed up our process? Here's the snippet of the script that we use.
# Importing packages
import pandas as pd
import Levenshtein as lev
# Reading cleaned datasets
df_name_reference = pd.read_csv('path_to_file')
df_name_to_screen = pd.read_csv('path_to_file')
# Function used in name screening
def get_similarity_score(s1, s2):
''' Return match percentage between 2 strings disregarding name swapping
Parameters
-----------
s1 : str : name from df_name_reference (to be used within pandas apply)
s2 : str : name from df_name_to_screen (ref_name variable)
Return
-----------
float
'''
# Get sorted names
s1_sort = ' '.join(sorted(s1.split(' '))).strip() if type(s1)==str else ''
s2_sort = ' '.join(sorted(s2.split(' '))).strip() if type(s2)==str else ''
# Get ratios and return the max value
# THIS COULD BE THE BOTTLENECK OF OUR SCRIPT: MORE DETAILS BELOW
return max([
lev.ratio(s1, s2),
lev.ratio(s1_sort, s2),
lev.ratio(s1, s2_sort),
lev.ratio(s1_sort, s2_sort)
])
# Returning file
screening_results = []
for row in range(df_name_to_screen.shape[0]):
# Get name to screen
ref_name = df_name_to_screen.loc[row, 'fullname']
# Get scores
scores = df_name_reference.fullname.apply(lev.ratio, args=(ref_name,))
# Append results
screening_results.append(pd.DataFrame({'screened_name':ref_name, 'scores':scores}))
I took four scores from lev.ratio. This is to address variations in the arrangement of names, ie. firstname-lastname and lastname-firstname formats. I know that fuzzywuzzy package has token_sort_ratio, but I've noticed that it's just splitting the name parts, and sorting it alphabetically, which leads to lower scores. Plus, fuzzywuzzy is slower than Levenshtein. So, I had to manually capture the similarity score of sorted and unsorted names.
Can anyone give an approach that I could try? Thanks!
EDIT: Here's a sample dataset that you may try. This is in Google Drive.
In case you don't need scores for all entries in the reference data but just the top N then you can use difflib.get_close_matches to remove the others before calculating any scores:
screening_results = []
for row in range(df_name_to_screen.shape[0]):
ref_name = df_name_to_screen.loc[row, 'fullname']
skimmed = pd.DataFrame({
'fullname': difflib.get_close_matches(
ref_name,
df_name_reference.fullname,
N_RESULTS,
0
)
})
scores = skimmed.fullname.apply(lev.ratio, args=(ref_name,))
screening_results.append(pd.DataFrame({'screened_name': ref_name, 'scores': scores}))
This takes about 50ms per row using the file you provided.

Applying function to pandas dataframe: is there a more efficient way of doing this?

I have a dataframe that has a small number of columns but many rows (about 900K right now, and it's going to get bigger as I collect more data). It looks like this:
Author
Title
Date
Category
Text
url
0
Amira Charfeddine
Wild Fadhila 01
2019-01-01
novel
الكتاب هذا نهديه لكل تونسي حس إلي الكتاب يحكي ...
NaN
1
Amira Charfeddine
Wild Fadhila 02
2019-01-01
novel
في التزغريت، والعياط و الزمامر، ليوم نتيجة الب...
NaN
2
253826
1515368_7636953
2010-12-28
/forums/forums/91/
هذا ما ينص عليه إدوستور التونسي لا رئاسة مدى ا...
https://www.tunisia-sat.com/forums/threads/151...
3
250442
1504416_7580403
2010-12-21
/forums/sports/
\n\n\n\n\n\nاعلنت الجامعة التونسية لكرة اليد ا...
https://www.tunisia-sat.com/forums/threads/150...
4
312628
1504416_7580433
2010-12-21
/forums/sports/
quel est le résultat final\n,,,,????
https://www.tunisia-sat.com/forums/threads/150...
The "Text" Column has a string of text that may be just a few words (in the case of a forum post) or it may a portion of a novel and have tens of thousands of words (as in the two first rows above).
I have code that constructs the dataframe from various corpus files (.txt and .json), then cleans the text and saves the cleaned dataframe as a pickle file.
I'm trying to run the following code to analyze how variable the spelling of different words are in the corpus. The functions seem simple enough: One counts the occurrence of a particular spelling variable in each Text row; the other takes a list of such frequencies and computes a Gini Coefficient for each lemma (which is just a numerical measure of how heterogenous the spelling is). It references a spelling_var dictionary that has a lemma as its key and the various ways of spelling that lemma as values. (like {'color': ['color', 'colour']} except not in English.)
This code works, but it uses a lot of CPU time. I'm not sure how much, but I use PythonAnywhere for my coding and this code sends me into the tarpit (in other words, it makes me exceed my daily allowance of CPU seconds).
Is there a way to do this so that it's less CPU intensive? Preferably without me having to learn another package (I've spent the past several weeks learning Pandas and am liking it, and need to just get on with my analysis). Once I have the code and have finished collecting the corpus, I'll only run it a few times; I won't be running it everyday or anything (in case that matters).
Here's the code:
import pickle
import pandas as pd
import re
with open('1_raw_df.pkl', 'rb') as pickle_file:
df = pickle.load(pickle_file)
spelling_var = {
'illi': ["الي", "اللي"],
'besh': ["باش", "بش"],
...
}
spelling_df = df.copy()
def count_word(df, word):
pattern = r"\b" + re.escape(word) + r"\b"
return df['Text'].str.count(pattern)
def compute_gini(freq_list):
proportions = [f/sum(freq_list) for f in freq_list]
squared = [p**2 for p in proportions]
return 1-sum(squared)
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
gini = compute_gini(count_list)
spelling_df[w] = gini
I rewrote two lines in the last double loop, see the comments in the code below. does this solve your issue?
gini_lst = []
for w, var in spelling_var.items():
count_list = []
for v in var:
count_list.append(count_word(spelling_df, v))
#gini = compute_gini(count_list) # don't think you need to compute this at every iteration of the inner loop, right?
#spelling_df[w] = gini # having this inside of the loop creates a new column at each iteration, which could crash your CPU
gini_lst.append(compute_gini(count_list))
# this creates a df with a row for each lemma with its associated gini value
df_lemma_gini = pd.DataFrame(data={"lemma_column": list(spelling_var.keys()), "gini_column": gini_lst})

Python sklearn TfidfVectorizer: Vectorize documents ahead of query for semantic search

I want to run semantic search using TF-IDF.
This code works, but it is really slow when used on a large corpus of documents:
search_terms = "my query"
documents = ["my","list","of","docs"]
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform([search_terms] + documents)
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]
It seems quite inefficient:
Every new search query triggers a re-vectorizing of the entire corpus.
I am wondering how I can do the bulk work of vectorizing my corpus ahead of time, saving the result in an "index file". So that, when I run a query, the only thing left to do is to vectorize the few words from the query, and then to calculate similarity.
I tried vectorizing query and documents separately:
vec_docs = vectorizer.fit_transform(documents)
vec_query = vectorizer.fit_transform([search_terms])
cosine_similarities = linear_kernel(vec_query, vec_docs).flatten()
But it gives me this error:
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 3 while Y.shape[1] == 260541
How can I run the corpus vectorization ahead of time without knowing what the query will be?
My main goal is to get blazing fast results even with a large corpus of documents (say, a few GB worth of text), even on a low-powered server, by doing the bulk of the data-crunching ahead of time.
TF/IDF vectors are high-dimensional and sparse. The basic data structure that supports that is an inverted index. You can either implement it yourself or use a standard index (e.g., Lucene).
Nevertheless, if you would like to experiment with modern deep-neural-based vector representations, check out the following semantic search demo. It uses a similarity search service that can handle billions of vectors.
(Note, I am a co-author of this demo.)
You almost have it right.
In this instance, you can get away with fitting (and transforming) your documents and only transforming your search terms. Here is your code, modified accordingly and using the twenty_newsgroups documents (11k) in its place. You can run it as a script and interactively verify you get fast results:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
news = fetch_20newsgroups()
search_terms = "my query"
# documents = ["my", "list", "of", "docs"]
documents = news.data
vectorizer = TfidfVectorizer()
# fit_transform does two things: fits the vectorizer and transforms documents
doc_vectors = vectorizer.fit_transform(documents)
# the vectorizer is already fit; just transform search_terms via vectorizer
search_term_vector = vectorizer.transform([search_terms])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
if __name__ == "__main__":
while True:
query_str = input("\n\n\n\nquery string (return to quit): ")
if not query_str:
print("bye!")
break
search_term_vector = vectorizer.transform([query_str])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
best_idx = np.argmax(cosine_similarities)
best_score = cosine_similarities[best_idx]
best_doc = documents[best_idx]
if best_score < 0.1:
print("no good matches")
else:
max_doc = documents[np.argmax(cosine_similarities)]
print(
f"Best match ({round(best_score, 4)}):\n\n", best_doc[0:200] + "...",
)
Example output:
query string (return to quit): protocol
Best match 0.239 (0.014 sec):
From: ethan#cs.columbia.edu (Ethan Solomita)
Subject: Re: X protocol packet type
Article-I.D.: cs.C52I2q.IFJ
Organization: Columbia University Department of Computer Science
Lines: 7
In article <9309...
Note: this algorithm find the best match(es) at best in O(n_documents) time, compared to Lucene (powers Elasticsearch) that uses skip lists that can search in O(log(n_documents)). Production search engines also have quiet a bit of tuning to optimize performance. The above could be useful with some tweaking but isn't going to topple Google tomorrow :)

How to fuzzy match two lists in Python

I have two lists: ref_list and inp_list. How can one make use of FuzzyWuzzy to match the input list from the reference list?
inp_list = pd.DataFrame(['ADAMS SEBASTIAN', 'HAIMBILI SEUN', 'MUTESI
JOHN', 'SHEETEKELA MATT', 'MUTESI JOHN KUTALIKA',
'ADAMS SEBASTIAN HAUSIKU', 'PETERS WILSON',
'PETERS MARIO', 'SHEETEKELA MATT NICKY'],
columns =['Names'])
ref_list = pd.DataFrame(['ADAMS SEBASTIAN HAUSIKU', 'HAIMBILI MIKE', 'HAIMBILI SEUN', 'MUTESI JOHN
KUTALIKA', 'PETERS WILSON MARIO', 'SHEETEKELA MATT NICKY MBILI'], columns =
['Names'])
After some research, I modified some codes I found on the internet. Problems with these codes - they work very well on small sample size. In my case the inp_list and ref_list are 29k and 18k respectively in length and it takes more than a day to run.
Below are the codes, first a helper function was defined.
def match_term(term, inp_list, min_score=0):
# -1 score in case I don't get any matches
max_score = -1
# return empty for no match
max_name = ''
# iterate over all names in the other
for term2 in inp_list:
# find the fuzzy match score
score = fuzz.token_sort_ratio(term, term2)
# checking if I am above my threshold and have a better score
if (score > min_score) & (score > max_score):
max_name = term2
max_score = score
return (max_name, max_score)
# list for dicts for easy dataframe creation
dict_list = []
#iterating over the sales file
for name in inp_list:
#use the defined function above to find the best match, also set the threshold to a chosen #
match = match_term(name, ref_list, 94)
#new dict for storing data
dict_ = {}
dict_.update({'passenger_name': name})
dict_.update({'match_name': match[0]})
dict_.update({'score': match[1]})
dict_list.append(dict_)
Where can these codes be improved to run smoothly and perhaps avoid evaluating items that have already been assessed?
You can try to vectorized the operations instead of evaluate the scores in a loop.
Make a df where the firse col ref is ref_list and the second col inp is each name in inp_list. Then call df.apply(lambda row:process.extractOne(row['inp'], row['ref']), axis=1). Finally you'll get the best match name and score in ref_list for each name in inp_list.
The measures you are using are computationally demanding with a number of pairs of strings that high. Alternatively to fuzzywuzzy, you could try to use instead a library called string-grouper which exploits a faster Tf-idf method and the cosine similarity measure to find similar words. As an example:
import random, string, time
import pandas as pd
from string_grouper import match_strings
alphabet = list(string.ascii_lowercase)
from_r, to_r = 0, len(alphabet)-1
random_strings_1 = ["".join(alphabet[random.randint(from_r, to_r)]
for i in range(6)) for j in range(5000)]
random_strings_2 = ["".join(alphabet[random.randint(from_r, to_r)]
for i in range(6)) for j in range(5000)]
series_1 = pd.Series(random_strings_1)
series_2 = pd.Series(random_strings_2)
t_1 = time.time()
matches = match_strings(series_1, series_2,
min_similarity=0.6)
t_2 = time.time()
print(t_2 - t_1)
print(matches)
It takes less than one second to do 25.000.000 comparisons! For a surely more useful test of the library look here: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html where it is claimed that
"Using this approach made it possible to search for near duplicates in
a set of 663,000 company names in 42 minutes using only a dual-core
laptop".
To tune your matching algorithm further look at the **kwargs arguments you can give to the match_strings function above.

How to optimize retrieval of 10 most frequent words inside a json data object?

I'm looking for ways to make the code more efficient (runtime and memory complexity)
Should I use something like a Max-Heap?
Is the bad performance due to the string concatenation or sorting the dictionary not in-place or something else?
Edit: I replaced the dictionary/map object to applying a Counter method on a list of all retrieved names (with duplicates)
minimal request: script should take less then 30 seconds
current runtime: it takes 54 seconds
# Try to implement the program efficiently (running the script should take less then 30 seconds)
import requests
# Requests is an elegant and simple HTTP library for Python, built for human beings.
# Requests is the only Non-GMO HTTP library for Python, safe for human consumption.
# Requests is not a built in module (does not come with the default python installation), so you will have to install it:
# http://docs.python-requests.org/en/v2.9.1/
# installing it for pyCharm is not so easy and takes a lot of troubleshooting (problems with pip's main version)
# use conda/pip install requests instead
import json
# dict subclass for counting hashable objects
from collections import Counter
#import heapq
import datetime
url = 'https://api.namefake.com'
# a "global" list object. TODO: try to make it "static" (local to the file)
words = []
#####################################################################################
# Calls the site http://www.namefake.com 100 times and retrieves random names
# Examples for the format of the names from this site:
# Dr. Willis Lang IV
# Lily Purdy Jr.
# Dameon Bogisich
# Ms. Zora Padberg V
# Luther Krajcik Sr.
# Prof. Helmer Schaden etc....
#####################################################################################
requests.packages.urllib3.disable_warnings()
t = datetime.datetime.now()
for x in range(100):
# for each name, break it to first and last name
# no need for authentication
# http://docs.python-requests.org/en/v2.3.0/user/quickstart/#make-a-request
responseObj = requests.get(url, verify=False)
# Decoding JSON data from returned response object text
# Deserialize ``s`` (a ``str``, ``bytes`` or ``bytearray`` instance
# containing a JSON document) to a Python object.
jsonData = json.loads(responseObj.text)
x = jsonData['name']
newName = ""
for full_name in x:
# make a string from the decoded python object concatenation
newName += str(full_name)
# split by whitespaces
y = newName.split()
# parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
if "." in y[0] or "Miss" in y[0]:
words.append(y[2])
else:
words.append(y[0])
words.append(y[1])
# Return the top 10 words that appear most frequently, together with the number of times, each word appeared.
# Output example: ['Weber', 'Kris', 'Wyman', 'Rice', 'Quigley', 'Goodwin', 'Lebsack', 'Feeney', 'West', 'Marlen']
# (We don't care whether the word was a first or a last name)
# list of tuples
top_ten =Counter(words).most_common(10)
top_names_list = [name[0] for name in top_ten ]
print((datetime.datetime.now()-t).total_seconds())
print(top_names_list)
You are calling an endpoint of an API that generates dummy information one person at a time - that takes considerable amount of time.
The rest of the code is taking almost no time.
Change the endpoint you are using (there is no bulk-name-gathering on the one you use) or use built-in dummy data provided by python modules.
You can clearly see that "counting and processing names" is not the bottleneck here:
from faker import Faker # python module that generates dummy data
from collections import Counter
import datetime
fake = Faker()
c = Counter()
# get 10.000 names, split them and add 1st part
t = datetime.datetime.now()
c.update( (fake.name().split()[0] for _ in range(10000)) )
print(c.most_common(10))
print((datetime.datetime.now()-t).total_seconds())
Output for 10000 names:
[('Michael', 222), ('David', 160), ('James', 140), ('Jennifer', 134),
('Christopher', 125), ('Robert', 124), ('John', 120), ('William', 111),
('Matthew', 111), ('Lisa', 101)]
in
1.886564 # seconds
General advise for code optimization: measure first then optimize the bottlenecks.
If you need a codereview you can check https://codereview.stackexchange.com/help/on-topic and see if your code fits with the requirements for the codereview stackexchange site. As with SO some effort should be put into the question first - i.e. analyzing where the majority of your time is being spent.
Edit - with performance measurements:
import requests
import json
from collections import defaultdict
import datetime
# defaultdict is (in this case) better then Counter because you add 1 name at a time
# Counter is superiour if you update whole iterables of names at a time
d = defaultdict(int)
def insertToDict(n):
d[n] += 1
url = 'https://api.namefake.com'
api_times = []
process_times = []
requests.packages.urllib3.disable_warnings()
for x in range(10):
# for each name, break it to first and last name
try:
t = datetime.datetime.now() # start time for API call
# no need for authentication
responseObj = requests.get(url, verify=False)
jsonData = json.loads(responseObj.text)
# end time for API call
api_times.append( (datetime.datetime.now()-t).total_seconds() )
x = jsonData['name']
t = datetime.datetime.now() # start time for name processing
newName = ""
for name_char in x:
# make a string from the decoded python object concatenation
newName = newName + str(name_char)
# split by whitespaces
y = newName.split()
# parse the first name (check first if header exists (Prof. , Dr. , Mr. , Miss)
if "." in y[0] or "Miss" in y[0]:
insertToDict(y[2])
else:
insertToDict(y[0])
insertToDict(y[1])
# end time for name processing
process_times.append( (datetime.datetime.now()-t).total_seconds() )
except:
continue
newA = sorted(d, key=d.get, reverse=True)[:10]
print(newA)
print(sum(api_times))
print(sum( process_times ))
Output:
['Ruecker', 'Clare', 'Darryl', 'Edgardo', 'Konopelski', 'Nettie', 'Price',
'Isobel', 'Bashirian', 'Ben']
6.533625
0.000206
You can make the parsing part better .. I did not, because it does not matter.
It is better to use timeit for performance testing (it calls code multiple times and averages, smoothing artifacts due to caching/lag/...) (thx #bruno desthuilliers ) - in this case I did not use timeit because I do not want to call API 100000 times to average results

Categories

Resources