Dataset strings replace not speeding up with threads - python

I was recently getting into Natural Language Processing for a university project and, given a list of words, I wanted to try and delete all those words from a dataset of Strings.
My dataset looks like this, but much bigger:
data_set = ['Human machine interface for lab abc computer applications',
'A survey of user opinion of computer system response time',
'The EPS user interface management system',
'System and human system engineering testing of EPS',
'Relation of user perceived response time to error measurement',
'The generation of random binary unordered trees',
'The intersection graph of paths in trees',
'Graph minors IV Widths of trees and well quasi ordering',
'Graph minors A survey']
The list of words to delete looks like this, but again, much longer:
to_remove = ['abc', 'of', 'quasi', 'well']
Since in Python I didn't find any function to directly delete words from strings, I used the replace() function.
The program should take the data_set and, for each word in to_remove, it should call a replace() on a different string of the data_set. I was hoping that threads could speed things up, but unfortunately it takes almost the same time as the program without threads. Am I correctly implementing threads? Or did I miss something?
The code with threads is the following:
from multiprocessing.dummy import Pool as ThreadPool
def remove_words(params):
changed_data_set = params[0]
for elem in params[1]:
changed_data_set = changed_data_set.replace(' ' + elem, ' ')
return changed_data_set
def parallel_task(params, threads=2):
pool = ThreadPool(threads)
results = pool.map(remove_words, params)
pool.close()
pool.join()
return results
parameters = []
for rows in data_set:
parameters.append((rows, to_remove))
new_data_set = parallel_task(parameters, 8)
The code without threads is the following:
def remove_words(data_set, to_replace):
for len in range(len(data_set)):
for word in to_replace:
data_set[len] = data_set[len].replace(' ' + row, ' ')
return data_set
changed_data_set = remove_words(data_set, to_remove)

Related

Speeding up a comparison function for comparing sentences

I have a data frame that has a shape of (789174, 9). There is a column called resolution that contains a sentence that is less than 139 characters in length. I built a function to find sentences that have a similarity score of above 0.9 from the difflib library. I have a virtual computer with 96 cpus and 384 gb of ram. I have been running this function for longer than 2 hours now and it still has not processed when i = 1000. I am concerned that this will take too long to process and I am wondering if there is a way to speed this up.
def replace_similars(input_list):
# Replaces %90 and more similar strings
start_time = time.time()
for i in range(len(input_list)):
if i % 1000 == 0:
print(f'time = {time.time()-start_time:.2f} - index = {i}')
for j in range(len(input_list)):
if i < j and difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
input_list[j] = input_list[i]
def generate_mapping(input_list):
new_list = input_list[:] # copy list
replace_similars(new_list)
mapping = {}
for i in range(len(input_list)):
mapping[input_list[i]] = new_list[i]
return mapping
Clearly since we are iterating twice through the column it is O(n^2). I am not sure if there is a way to make this faster. Any suggestions would be greatly appreciated.
EDIT:
I have attempted a speed up using the difflib and fuzzywuzzy. The function only goes through the column once but I do iterate through the dictionary keys.
def cluster_resolution(df):
clusters = {}
for string in df['resolution_modified'].unique():
match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
if match1:
for m in match1:
clusters[m].append(string)
else:
clusters[string] = [ string ]
for m in clusters.keys():
match2 = fuzz.partial_ratio(string, m)
if match2 >= 90:
clusters[m].append(string)
return clusters
mappings = cluster_resolution(df_sample)
Is it possible to speed up the latter function?
Here is an example of some data in a dataframe
d = {'resolution' : ['replaced scanner', 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use', 'tc reimage', 'updated pc', 'deploying replacement scanner', 'upgraded and rebooted station', 'printer has been reconfigured', 'cleared linux print queue and now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is now working','printer offlineswitched usb portprinter is now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well now']}
df = pd.DataFrame(data=d)
How I define similarity:
Similarity is really defined by the overall action taken such as replaced scanner and replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use. The longer strings overall action was replacing the scanner thus those two are very similar which is why I chose to use the partial_ratio function since those have a score of 100.
Attention:
Please refer to the second function cluster_resolution as this is the function I would like to sped up. The latter function is not going to useful.
Regarding your last edit, I'd make a few changes (mainly using fuzzywuzzy.process rather than fuzzywuzzy.fuzz) :
from fuzzywuzzy import process
def cluster_resolution(df):
clusters = {}
for string in df['resolution'].unique():
match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
if match1:
for m in match1:
clusters[m].append(string)
else:
bests = process.extractBests(
string,
set(clusters.keys())-{string},
scorer=fuzz.partial_ratio,
score_cutoff=80,
limit=1
)
if bests:
clusters[bests[0][0]].append(string)
else:
clusters[string] = [ string ]
But I think you could look more into other solutions, like CountVectorizer and whatever metric is adapted there. It is a way to gain speed (as it is vectorized), though the results may be imperfect. Note that CountVectorizer could be a good solution for you as you already made the choice of the partial_ratio.
For example, something like this :
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform
import hdbscan
df = pd.DataFrame(d)
cv = CountVectorizer(stop_words="english")
transformed = cv.fit_transform(df['resolution'])
transformed = pd.DataFrame(
transformed.toarray(),
columns=cv.get_feature_names(),
index=df['resolution'])
#keep only columns with more than 1
transformed = transformed[transformed.columns[transformed.sum()>2]]
#compute the distance matrix
d = pdist(transformed, metric="hamming") * transformed.shape[1]
s = squareform(d)
clusterer = hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=2)
clusterer.fit_predict(s)
df['labels'] = clusterer.labels_
print(df.sort_values('labels'))
I think this is still perfectible (this is my first shot at text clustering...). You could also add your own list of stopwords for CountVectorizer which would be a way to help the algorithm. At the very least, it could help you pre-cluster your dataset before using your previous function, this way for instance :
df.groupby('labels')['resolution'].apply(cluster_resolution)
(That way if your first clustering is roughly ok, you would only check each value against all other values in the cluster, not all values).
Credits to #anon01 for the computation of the distance matrix in this answer, which seems to give slightly better results than the default of hdbscan.
Edit :
Another try, including :
change of the metrics,
adding a step with a TF-IDF model,
and adding a step to lemmatize the words using nltk package.
So this would be :
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import hdbscan
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
d = {...}
df = pd.DataFrame(d)
lemmatizer = WordNetLemmatizer()
def lemmatization(sentence):
tag_dict = {
"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV,
}
# Tokenize the sentence
wordsList = nltk.word_tokenize(sentence)
# Find the right token
tagged = nltk.pos_tag(wordsList)
# Convert the list of (token, tag) to lemmatized tokens
lems = [
lemmatizer.lemmatize(token, tag_dict.get(tag[0], wordnet.NOUN) )
for token, tag
in tagged
]
lems = ' '.join(lems)
return lems
df['lemmatized'] = df['resolution'].apply(lemmatization)
corpus = df['lemmatized']
pipe = Pipeline(
[
('cv', CountVectorizer(stop_words="english")),
('tfid', TfidfTransformer())
]).fit(corpus)
transformed = pipe.transform(corpus)
transformed = pd.DataFrame(
transformed.toarray(),
columns=pipe.named_steps['cv'].get_feature_names(),
index=df['resolution'])
d = pdist(transformed, metric="cosine") * transformed.shape[1]
s = squareform(d)
clusterer = hdbscan.HDBSCAN(metric="precomputed", min_cluster_size=2)
clusterer.fit_predict(s)
df['labels'] = clusterer.labels_
print(df.sort_values('labels'))
You could also add some specific code, as your sample seems to be about very specific maintenance logs.
For instance, you could add new features to the transformed dataframe based on a small list of hardware/sotfware :
#To create a feature about OS :
cols = ['os', 'linux', 'window']
transformed[cols[0]] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))
#To crate a feature about hardware :
cols = ["laptop", "printer", "scanner"]
transformed["hardware"] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))
This step could help have better results but may not be necessary. I'm not sure of how it will compares to FuzzyWuzzy's performance for matching strings, but I will be interested in your feedback !
def replace_similars(input_list):
# Replaces %90 and more similar strings
start_time = time.time()
for i in range(len(input_list)):
if i % 1000 == 0:
print(f'time = {time.time()-start_time:.2f} - index = {i}')
for j in range(i+1, len(input_list)):
if -15 < len(list(input_list[i])) - len(list(input_list[i])) < 15:
if difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
input_list[j] = input_list[i]
def generate_mapping(input_list):
new_list = input_list[:] # copy list
replace_similars(new_list)
mapping = {}
for i in range(len(input_list)):
mapping[input_list[i]] = new_list[i]
return mapping
Even though this might not be a practical solution either cause it is also going to take about 90 years if every iteration takes 0.1s , but still it is a lot more optimised solution.

Doc2Vec not providing adequate results in most_similar

I'm trying to use Doc2Vec to go through the classic exercise of training on Wikipedia articles, using the article title as the tag.
Here's my code and the results, is there something that I'm missing that they would not give the matching results with most_similar? Following this tutorial, but I used the wiki-english-20171001 dataset that came with gensim.
import gensim.downloader as api
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
import re
def cleanText(text):
text = re.sub(r'\|\|\|', r' ', text)
text = re.sub(r'http\S+', r'<URL>', text)
text = text.lower()
text = re.sub(r'[^\w\s]','',text)
return text
wiki = api.load("wiki-english-20171001")
data = [d for d in wiki]
for i in range(10):
print(data[i])
def my_create_tagged_docs(data):
for wikiidx in range(len(data)):
yield TaggedDocument([i for i in data[wikiidx].get('section_texts') for i in cleanText(i).split()], [data[wikiidx].get('title')])
wiki_data = my_create_tagged_docs(data)
del data
del wiki
model = Doc2Vec(dm=1, dm_mean=1, size=200, window=8, min_count=19, iter =10, epochs=40)
model.build_vocab(wiki_data)
model.train(wiki_data, total_examples=model.corpus_count, epochs=model.epochs)
model.docvecs.most_similar(positive=["Lady Gaga"], topn=10)
[('Chlorothrix', 0.35521823167800903),
("A Child's Garden of Verses", 0.3533579707145691),
('Fish Mooney', 0.35129639506340027),
('2000 Paris–Roubaix', 0.3463437855243683),
('Calvin C. Chaffee', 0.3439667224884033),
('Murders of Eve Stratford and Lynne Weedon', 0.3397218585014343),
('Black Air', 0.3396576941013336),
('Turzyn', 0.3312540054321289),
('Scott Baker', 0.33018186688423157),
('Amongst the Waves', 0.3297169804573059)]
model.docvecs.most_similar(positive=["Machine learning"], topn=10)
[('Wolf Rock, Connecticut', 0.3855834901332855),
('Amália Rodrigues', 0.3349645137786865),
('Victoria Park, Leicester', 0.33312514424324036),
('List of visual anthropology films', 0.3311382532119751),
('Sadqay Teri Mout Tun', 0.3287636637687683),
('T. Damodaran', 0.32876330614089966),
('Urqu Jawira (Aroma)', 0.32281631231307983),
('Tiggy Wiggy', 0.3226730227470398),
('Frédéric Brun (cyclist, born 1988)', 0.32106447219848633),
('Unholy Crusade', 0.3200794756412506)]
It looks like your wiki_data is a single-pass generator, as returned by my_create_tagged_docs(), which can be iterated over only once - not an iterable object capable of many iterations, as the many steps of the Doc2Vec training requires.
You can test your wiki_data object for whether it's multiply-iterable, just after it's been assigned, by executing:
print(sum(1 for _ in wiki_data))
print(sum(1 for _ in wiki_data))
If you see the same number twice – the total number of documents – all's well. If the 2nd number is 0, you've created a single-use iterator instead of a multiple-use iterable.
As a result, the build_vocab() call will work to initialize the known-vocabulary & model – but then the train() will see an empty iterable, completing instantly with no real training happening. (If you run with logging at the INFO level, this may be obvious in the log timestamps for the various steps.)
Two possible fixes:
If you're lucky enough to have enough RAM to hold the whole corpus as Python objects, converting it into a in-memory list would ensure it's multiple-iterable:
wiki_data = list(my_create_tagged_docs(data))
But, most won't have that much RAM * shouldn't/needn't take that step. Instead, you can define a class for an iterable view on the data, which can return a fresh iterator every time it's needed. There's an example with further explanation in a blog post by the founder of the gensim project at:
https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/

Iterating Over Numpy Array for NLP Application

I have a Word2Vec model that I'm building where I have a vocab_list of about 30k words. I have a list of sentences (sentence_list) about 150k large. I am trying to remove tokens (words) from the sentences that weren't included in vocab_list. The task seemed simple, but nesting for loops and reallocating memory is slow using the below code. This task took approx. 1hr to run so I don't want to repeat it.
Is there a cleaner way to try this?
import numpy as np
from datetime import datetime
start=datetime.now()
timing=[]
result=[]
counter=0
for sent in sentences_list:
counter+=1
if counter %1000==0 or counter==1:
print(counter, 'row of', len(sentences_list), ' Elapsed time: ', datetime.now()-start)
timing.append([counter, datetime.now()-start])
final_tokens=[]
for token in sent:
if token in vocab_list:
final_tokens.append(token)
#if len(final_tokens)>0:
result.append(final_tokens)
print(counter, 'row of', len(sentences_list),' Elapsed time: ', datetime.now()-start)
timing.append([counter, datetime.now()-start])
sentences=result
del result
timing=pd.DataFrame(timing, columns=['Counter', 'Elapsed_Time'])
Note that typical word2vec implementations (like Google's original word2vec.c or gensim Word2Vec) will often just ignore words in their input that aren't part of their established vocabulary (as specified by vocab_list or enforced via a min_count). So you may not need to perform this filtering at all.
Using a more-idiomatic Python list-comprehension might be noticeably faster (and would certainly be more compact). Your code could simply be:
filtered_sentences = [
[word for word in sent if word in vocab_list]
for sent in sentences_list
]

Parallelizing function in a for loop

I have a function that I'd like to parallelize.
import multiprocessing as mp
from pathos.multiprocessing import ProcessingPool as Pool
cores=mp.cpu_count()
# create the multiprocessing pool
pool = Pool(cores)
def clean_preprocess(text):
"""
Given a string of text, the function:
1. Remove all punctuations and numbers and converts texts to lower case
2. Handles negation words defined above.
3. Tokenies words that are of more than length 1
"""
cores=mp.cpu_count()
pool = Pool(cores)
lower = re.sub(r'[^a-zA-Z\s\']', "", text).lower()
lower_neg_handled = n_pattern.sub(lambda x: n_dict[x.group()], lower)
letters_only = re.sub(r'[^a-zA-Z\s]', "", lower_neg_handled)
words = [i for i in tok.tokenize(letters_only) if len(i) > 1] ##parallelize this?
return (' '.join(words))
I have been reading the documentations on multiprocessing but am still a little confused on how to parallelize my function appropriately. I will be grateful if somebody could point me in the right direction in parallelizing a function like mine.
On your function, you could decide to parallelize by splitting the text in sub-parts, apply the tokenization to the subparts, then join results.
Something along the line of:
text0 = text[:len(text)/2]
text1 = text[len(text)/2:]
Then apply your processing to these two parts, using:
# here, I suppose that clean_preprocess is the sequential version,
# and we manage the pool outside of it
with Pool(2) as p:
words0, words1 = pool.map(clean_preprocess, [text0, text1])
words = words1 + words2
# or continue with words0 words1 to save the cost of joining the lists
However, your function seems memory bound, so it won't have a a terrible acceleration (typically a factor 2 is the max we can hope for on standard computers these days), see e.g. How much does parallelization help the performance if the program is memory-bound? or What do the terms "CPU bound" and "I/O bound" mean?
So you could try to split the text in more than 2 parts, but may not get any faster. You could even get disappointing performance, because splitting the text could be more expensive than processing it.

Python Gensim how to make WMD similarity run faster with multiprocessing

I am trying to run gensim WMD similarity faster. Typically, this is what is in the docs:
Example corpus:
my_corpus = ["Human machine interface for lab abc computer applications",
>>> "A survey of user opinion of computer system response time",
>>> "The EPS user interface management system",
>>> "System and human system engineering testing of EPS",
>>> "Relation of user perceived response time to error measurement",
>>> "The generation of random binary unordered trees",
>>> "The intersection graph of paths in trees",
>>> "Graph minors IV Widths of trees and well quasi ordering",
>>> "Graph minors A survey"]
my_query = 'Human and artificial intelligence software programs'
my_tokenized_query =['human','artificial','intelligence','software','programs']
model = a trained word2Vec model on about 100,000 documents similar to my_corpus.
model = Word2Vec.load(word2vec_model)
from gensim import Word2Vec
from gensim.similarities import WmdSimilarity
def init_instance(my_corpus,model,num_best):
instance = WmdSimilarity(my_corpus, model,num_best = 1)
return instance
instance[my_tokenized_query]
the best matched document is "Human machine interface for lab abc computer applications" which is great.
However the function instance above takes an extremely long time. So I thought of breaking up the corpus into N parts and then doing WMD on each with num_best = 1, then at the end of it, the part with the max score will be the most similar.
from multiprocessing import Process, Queue ,Manager
def main( my_query,global_jobs,process_tmp):
process_query = gensim.utils.simple_preprocess(my_query)
def worker(num,process_query,return_dict):
instance=init_instance\
(my_corpus[num*chunk+1:num*chunk+chunk], model,1)
x = instance[process_query][0][0]
y = instance[process_query][0][1]
return_dict[x] = y
manager = Manager()
return_dict = manager.dict()
for num in range(num_workers):
process_tmp = Process(target=worker, args=(num,process_query,return_dict))
global_jobs.append(process_tmp)
process_tmp.start()
for proc in global_jobs:
proc.join()
return_dict = dict(return_dict)
ind = max(return_dict.iteritems(), key=operator.itemgetter(1))[0]
print corpus[ind]
>>> "Graph minors A survey"
The problem I have with this is that, even though it outputs something, it doesn't give me a good similar query from my corpus even though it gets the max similarity of all the parts.
Am I doing something wrong?
Comment: chunk is a static variable: e.g. chunk = 600 ...
If you define chunk static, then you have to compute num_workers.
10001 / 600 = 16,6683333333 = 17 num_workers
It's common to use not more process than cores you have.
If you have 17 cores, that's ok.
cores are static, therefore you should:
num_workers = os.cpu_count()
chunk = chunksize(my_corpus, num_workers)
Not the same result, changed to:
#process_query = gensim.utils.simple_preprocess(my_query)
process_query = my_tokenized_query
All worker results Index 0..n.
Therefore, return_dict[x] could be overwritten from last worker with same Index having lower value. The Index in return_dict is NOT the same as Index in my_corpus. Changed to:
#return_dict[x] = y
return_dict[ (num * chunk)+x ] = y
Using +1 in chunk size computing, will skip that first Document.
I don't know how you compute chunk, consider this example:
def chunksize(iterable, num_workers):
c_size, extra = divmod(len(iterable), num_workers)
if extra:
c_size += 1
if len(iterable) == 0:
c_size = 0
return c_size
#Usage
chunk = chunksize(my_corpus, num_workers)
...
#my_corpus_chunk = my_corpus[num*chunk+1:num*chunk+chunk]
my_corpus_chunk = my_corpus[num * chunk:(num+1) * chunk]
Results: 10 cycle, Tuple=(Index worker num=0, Index worker num=1)
With multiprocessing, with chunk=5:
02,09:(3, 8), 01,03:(3, 5):
System and human system engineering testing of EPS
04,06,07:(0, 8), 05,08:(0, 5), 10:(0, 7):
Human machine interface for lab abc computer applications
Without multiprocessing, with chunk=5:
01:(3, 6), 02:(3, 5), 05,08,10:(3, 7), 07,09:(3, 8):
System and human system engineering testing of EPS
03,04,06:(0, 5):
Human machine interface for lab abc computer applications
Without multiprocessing, without chunking:
01,02,03,04,06,07,08:(3, -1):
System and human system engineering testing of EPS
05,09,10:(0, -1):
Human machine interface for lab abc computer applications
Tested with Python: 3.4.2
Using Python 2.7:
I used threading instead of multi-processing.
In the WMD-Instance creation thread, I do something like this:
wmd_instances = []
if wmd_instance_count > len(wmd_corpus):
wmd_instance_count = len(wmd_corpus)
chunk_size = int(len(wmd_corpus) / wmd_instance_count)
for i in range(0, wmd_instance_count):
if i == wmd_instance_count -1:
wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:], wmd_model, num_results)
else:
wmd_instance = WmdSimilarity(wmd_corpus[i*chunk_size:chunk_size], wmd_model, num_results)
wmd_instances.append(wmd_instance)
wmd_logic.setWMDInstances(wmd_instances, chunk_size)
'wmd_instance_count' is the number of threads to use to search. I also remember the chunk-size. Then, when I want to search for something, I start "wmd_instance_count"-threads to search for and they return found sims:
def perform_query_for_job_on_instance(wmd_logic, wmd_instances, query, jobID, instance):
wmd_instance = wmd_instances[instance]
sims = wmd_instance[query]
wmd_logic.set_mt_thread_result(jobID, instance, sims)
'wmd_logic' is the instance of a class that then does this:
def set_mt_thread_result(self, jobID, instance, sims):
res = []
#
# We need to scale the found ids back to our complete corpus size...
#
for sim in sims:
aSim = (int(sim[0] + (instance * self.chunk_size)), sim[1])
res.append(aSim)
I know, the code isn't nice, but it works. It uses 'wmd_instance_count' threads to find results, I aggregate them and then choose the top-10 or something like that.
Hope this helps.

Categories

Resources