Search Engine - rank the output by a weighted mechanism - python

I am trying to build a semantic search FAQ system using Elastic 7.7.0 and Universal Sentence Encoder (USE4) word embeddings, so far i have indexed a set of question and answers, which i am able to search. I am doing 2 searches whenever there is input :
key word search on elastic on indexed data
Semantic search using USE4 embeddings
Now i want to combine both to give the robust output, because sometimes results are off from these individual algorithms. Any good suggestions on how can i combine them? use the weighted mechanism to give more weight to Semantic search, and/or be able to match them again. Question is how do i get best of both. Please advise.
import time
import sys
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
import csv
import tensorflow as tf
import tensorflow_hub as hub
def connect2ES():
# connect to ES on localhost on port 9200
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
if es.ping():
print('Connected to ES!')
else:
print('Could not connect!')
sys.exit()
print("*********************************************************************************");
return es
def keywordSearch(es, q):
#Search by Keywords
b={
'query':{
'match':{
"title":q
}
}
}
res= es.search(index='questions-index_quora2',body=b)
print("Keyword Search:\n")
for hit in res['hits']['hits']:
print(str(hit['_score']) + "\t" + hit['_source']['title'] )
print("*********************************************************************************");
return
# Search by Vec Similarity
def sentenceSimilaritybyNN(embed, es, sent):
query_vector = tf.make_ndarray(tf.make_tensor_proto(embed([sent]))).tolist()[0]
b = {"query" : {
"script_score" : {
"query" : {
"match_all": {}
},
"script" : {
"source": "cosineSimilarity(params.query_vector, 'title_vector') + 1.0",
"params": {"query_vector": query_vector}
}
}
}
}
#print(json.dumps(b,indent=4))
res= es.search(index='questions-index_quora2',body=b)
print("Semantic Similarity Search:\n")
for hit in res['hits']['hits']:
print(str(hit['_score']) + "\t" + hit['_source']['title'] )
print("*********************************************************************************");
if __name__=="__main__":
es = connect2ES();
embed = hub.load("./data/USE4/") #this is where my USE4 Model is saved.
while(1):
query=input("Enter a Query:");
start = time.time()
if query=="END":
break;
print("Query: " +query)
keywordSearch(es, query)
sentenceSimilaritybyNN(embed, es, query)
end = time.time()
print(end - start)
My output looks like this:
Enter a Query:what can i watch this weekend
Query: what can i watch this weekend
Keyword Search:
9.6698 Where can I watch gonulcelen with english subtitles?
7.114256 What are some good movies to watch?
6.3105774 What kind of animal did this?
6.2754908 What are some must watch TV shows before you die?
6.0294256 What is the painting on this image?
6.0294256 What the meaning of this all life?
6.0294256 What are your comments on this picture?
5.9638205 Which is better GTA5 or Watch Dogs?
5.9269657 Can somebody explain to me how to do this problem with steps?
*********************************************************************************
Semantic Similarity Search:
1.6078881 What are some good movies to watch?
1.5065247 What are some must watch TV shows before you die?
1.502714 What are some movies that everyone needs to watch at least once in life?
1.4787409 Where can I watch gonulcelen with english subtitles?
1.4713362 What are the best things to do on Halloween?
1.4669418 Which are the best movies of 2016?
1.4554278 What are some interesting things to do when bored?
1.4307204 How can I improve my skills?
1.4261798 What are the best films that take place in one room?
1.4175651 What are the best things to learn in life?
*********************************************************************************
0.05920886993408203
i want one output which is based on both of these, where we can get more accurate results and rank them accordingly too. Please advise or redirect where i can refer some good practices around this. Thanks in Advance.

Related

How do I implement a model that finds a correlation (not similarity) between query and target sentence?

When building an NLP model (I'm going to use an attention-based one), how can we implement one for finding the correlation, not similarity, between the query and target sentences?
For instance, the two sentences "I am an environmentalist." and "We should raise the gas price, ban combustion-engine vehicles, and promote better public transit." are somehow similar and positively correlated. However, if the first sentence becomes "I am not an environmentalist.", the two sentences are still similar but now negatively correlated.
import json
import azure.functions as func
from sentence_transformers import SentenceTransformer, util
query = ["I am an environmentalist.",
"I am not an environmentalist.",
"I am a tech-savvy person."]
target = ["We should raise the gas price, ban combustion-engine vehicles, and promote better public transit."]
embedder = SentenceTransformer('distilbert-base-nli-stsb-mean-tokens')
query_embedding = embedder.encode(query, convert_to_tensor=True)
target_embedding = embedder.encode(target, convert_to_tensor=True)
searched = util.semantic_search(query_embedding, target_embedding)
print(searched)
# [
# [{'corpus_id': 0, 'score': 0.30188844}],
# [{'corpus_id': 0, 'score': 0.22667089}],
# [{'corpus_id': 0, 'score': 0.05061193}]
# ]
Are there any useful resources or information about this difference and/or finding the correlation by a model? I'm still new to the field of NLP (I have used the sentence transformer for some of my projects) so maybe I simply didn't do a good search on the web.

Python sklearn TfidfVectorizer: Vectorize documents ahead of query for semantic search

I want to run semantic search using TF-IDF.
This code works, but it is really slow when used on a large corpus of documents:
search_terms = "my query"
documents = ["my","list","of","docs"]
vectorizer = TfidfVectorizer()
doc_vectors = vectorizer.fit_transform([search_terms] + documents)
cosine_similarities = linear_kernel(doc_vectors[0:1], doc_vectors).flatten()
document_scores = [item.item() for item in cosine_similarities[1:]]
It seems quite inefficient:
Every new search query triggers a re-vectorizing of the entire corpus.
I am wondering how I can do the bulk work of vectorizing my corpus ahead of time, saving the result in an "index file". So that, when I run a query, the only thing left to do is to vectorize the few words from the query, and then to calculate similarity.
I tried vectorizing query and documents separately:
vec_docs = vectorizer.fit_transform(documents)
vec_query = vectorizer.fit_transform([search_terms])
cosine_similarities = linear_kernel(vec_query, vec_docs).flatten()
But it gives me this error:
ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 3 while Y.shape[1] == 260541
How can I run the corpus vectorization ahead of time without knowing what the query will be?
My main goal is to get blazing fast results even with a large corpus of documents (say, a few GB worth of text), even on a low-powered server, by doing the bulk of the data-crunching ahead of time.
TF/IDF vectors are high-dimensional and sparse. The basic data structure that supports that is an inverted index. You can either implement it yourself or use a standard index (e.g., Lucene).
Nevertheless, if you would like to experiment with modern deep-neural-based vector representations, check out the following semantic search demo. It uses a similarity search service that can handle billions of vectors.
(Note, I am a co-author of this demo.)
You almost have it right.
In this instance, you can get away with fitting (and transforming) your documents and only transforming your search terms. Here is your code, modified accordingly and using the twenty_newsgroups documents (11k) in its place. You can run it as a script and interactively verify you get fast results:
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import linear_kernel
news = fetch_20newsgroups()
search_terms = "my query"
# documents = ["my", "list", "of", "docs"]
documents = news.data
vectorizer = TfidfVectorizer()
# fit_transform does two things: fits the vectorizer and transforms documents
doc_vectors = vectorizer.fit_transform(documents)
# the vectorizer is already fit; just transform search_terms via vectorizer
search_term_vector = vectorizer.transform([search_terms])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
if __name__ == "__main__":
while True:
query_str = input("\n\n\n\nquery string (return to quit): ")
if not query_str:
print("bye!")
break
search_term_vector = vectorizer.transform([query_str])
cosine_similarities = linear_kernel(doc_vectors, search_term_vector).flatten()
best_idx = np.argmax(cosine_similarities)
best_score = cosine_similarities[best_idx]
best_doc = documents[best_idx]
if best_score < 0.1:
print("no good matches")
else:
max_doc = documents[np.argmax(cosine_similarities)]
print(
f"Best match ({round(best_score, 4)}):\n\n", best_doc[0:200] + "...",
)
Example output:
query string (return to quit): protocol
Best match 0.239 (0.014 sec):
From: ethan#cs.columbia.edu (Ethan Solomita)
Subject: Re: X protocol packet type
Article-I.D.: cs.C52I2q.IFJ
Organization: Columbia University Department of Computer Science
Lines: 7
In article <9309...
Note: this algorithm find the best match(es) at best in O(n_documents) time, compared to Lucene (powers Elasticsearch) that uses skip lists that can search in O(log(n_documents)). Production search engines also have quiet a bit of tuning to optimize performance. The above could be useful with some tweaking but isn't going to topple Google tomorrow :)

Speeding up a comparison function for comparing sentences

I have a data frame that has a shape of (789174, 9). There is a column called resolution that contains a sentence that is less than 139 characters in length. I built a function to find sentences that have a similarity score of above 0.9 from the difflib library. I have a virtual computer with 96 cpus and 384 gb of ram. I have been running this function for longer than 2 hours now and it still has not processed when i = 1000. I am concerned that this will take too long to process and I am wondering if there is a way to speed this up.
def replace_similars(input_list):
# Replaces %90 and more similar strings
start_time = time.time()
for i in range(len(input_list)):
if i % 1000 == 0:
print(f'time = {time.time()-start_time:.2f} - index = {i}')
for j in range(len(input_list)):
if i < j and difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
input_list[j] = input_list[i]
def generate_mapping(input_list):
new_list = input_list[:] # copy list
replace_similars(new_list)
mapping = {}
for i in range(len(input_list)):
mapping[input_list[i]] = new_list[i]
return mapping
Clearly since we are iterating twice through the column it is O(n^2). I am not sure if there is a way to make this faster. Any suggestions would be greatly appreciated.
EDIT:
I have attempted a speed up using the difflib and fuzzywuzzy. The function only goes through the column once but I do iterate through the dictionary keys.
def cluster_resolution(df):
clusters = {}
for string in df['resolution_modified'].unique():
match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
if match1:
for m in match1:
clusters[m].append(string)
else:
clusters[string] = [ string ]
for m in clusters.keys():
match2 = fuzz.partial_ratio(string, m)
if match2 >= 90:
clusters[m].append(string)
return clusters
mappings = cluster_resolution(df_sample)
Is it possible to speed up the latter function?
Here is an example of some data in a dataframe
d = {'resolution' : ['replaced scanner', 'replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use', 'tc reimage', 'updated pc', 'deploying replacement scanner', 'upgraded and rebooted station', 'printer has been reconfigured', 'cleared linux print queue and now it is working','user reset her password successfully closing tt','have reset the printer to get it to print again','i plugged usb cable into port and scanner works','reconfigured hand scanner and linked to station','replaced the scanner with station is functional','laptops battery needed to be reset asset serial','reconfigured scanner confirmed that it scans as intended','reimaging laptop corrected the anyconnect software issue','printer was unplugged from usb port working properly now','reconnected usb cable and reassign printer ports on port','reconfigured scanner to base and tested with aa all fine','replaced the defective device with a fresh imaged laptop','reconfigured the printer and the media to print properly','tested printer at station connected and working resolved','red scanner reconfigured and base rebooted via usb joint','station scanner was synced to base and station and is now working','printer offlineswitched usb portprinter is now online and working','replaced the barcode label with one reflecting the tcs ip address','restarted the thin client by using ssh to run the restart command','printer reconfigured and test they are functioning normally again','removed old printer for service installed replacement tested good','tc required reboot rebooted tc had aa signin dp is now functional','resetting the printer to factory settings and then reconfigure it','updated windows os forced update and the laptop operated normally','printer settings are set correct and printer is working correctly','power to printer was disconnected reconnected and is working fine','power cycled equipment and restocked spooler with plastic bubbles','laptop checked ive logged into paskiplacowepl without any problem','reseated scanner cables connection into usb port to resolve issue','the scanner has been replaced and the station is working well now']}
df = pd.DataFrame(data=d)
How I define similarity:
Similarity is really defined by the overall action taken such as replaced scanner and replaced the scanner for the user with a properly working one from the cage replaced the wire on the damaged one and stored it for later use. The longer strings overall action was replacing the scanner thus those two are very similar which is why I chose to use the partial_ratio function since those have a score of 100.
Attention:
Please refer to the second function cluster_resolution as this is the function I would like to sped up. The latter function is not going to useful.
Regarding your last edit, I'd make a few changes (mainly using fuzzywuzzy.process rather than fuzzywuzzy.fuzz) :
from fuzzywuzzy import process
def cluster_resolution(df):
clusters = {}
for string in df['resolution'].unique():
match1 = difflib.get_close_matches(string, clusters.keys(), cutoff=0.9)
if match1:
for m in match1:
clusters[m].append(string)
else:
bests = process.extractBests(
string,
set(clusters.keys())-{string},
scorer=fuzz.partial_ratio,
score_cutoff=80,
limit=1
)
if bests:
clusters[bests[0][0]].append(string)
else:
clusters[string] = [ string ]
But I think you could look more into other solutions, like CountVectorizer and whatever metric is adapted there. It is a way to gain speed (as it is vectorized), though the results may be imperfect. Note that CountVectorizer could be a good solution for you as you already made the choice of the partial_ratio.
For example, something like this :
from sklearn.feature_extraction.text import CountVectorizer
from scipy.spatial.distance import pdist, squareform
import hdbscan
df = pd.DataFrame(d)
cv = CountVectorizer(stop_words="english")
transformed = cv.fit_transform(df['resolution'])
transformed = pd.DataFrame(
transformed.toarray(),
columns=cv.get_feature_names(),
index=df['resolution'])
#keep only columns with more than 1
transformed = transformed[transformed.columns[transformed.sum()>2]]
#compute the distance matrix
d = pdist(transformed, metric="hamming") * transformed.shape[1]
s = squareform(d)
clusterer = hdbscan.HDBSCAN(metric='precomputed', min_cluster_size=2)
clusterer.fit_predict(s)
df['labels'] = clusterer.labels_
print(df.sort_values('labels'))
I think this is still perfectible (this is my first shot at text clustering...). You could also add your own list of stopwords for CountVectorizer which would be a way to help the algorithm. At the very least, it could help you pre-cluster your dataset before using your previous function, this way for instance :
df.groupby('labels')['resolution'].apply(cluster_resolution)
(That way if your first clustering is roughly ok, you would only check each value against all other values in the cluster, not all values).
Credits to #anon01 for the computation of the distance matrix in this answer, which seems to give slightly better results than the default of hdbscan.
Edit :
Another try, including :
change of the metrics,
adding a step with a TF-IDF model,
and adding a step to lemmatize the words using nltk package.
So this would be :
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from scipy.spatial.distance import pdist, squareform
import pandas as pd
import hdbscan
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
d = {...}
df = pd.DataFrame(d)
lemmatizer = WordNetLemmatizer()
def lemmatization(sentence):
tag_dict = {
"J": wordnet.ADJ,
"N": wordnet.NOUN,
"V": wordnet.VERB,
"R": wordnet.ADV,
}
# Tokenize the sentence
wordsList = nltk.word_tokenize(sentence)
# Find the right token
tagged = nltk.pos_tag(wordsList)
# Convert the list of (token, tag) to lemmatized tokens
lems = [
lemmatizer.lemmatize(token, tag_dict.get(tag[0], wordnet.NOUN) )
for token, tag
in tagged
]
lems = ' '.join(lems)
return lems
df['lemmatized'] = df['resolution'].apply(lemmatization)
corpus = df['lemmatized']
pipe = Pipeline(
[
('cv', CountVectorizer(stop_words="english")),
('tfid', TfidfTransformer())
]).fit(corpus)
transformed = pipe.transform(corpus)
transformed = pd.DataFrame(
transformed.toarray(),
columns=pipe.named_steps['cv'].get_feature_names(),
index=df['resolution'])
d = pdist(transformed, metric="cosine") * transformed.shape[1]
s = squareform(d)
clusterer = hdbscan.HDBSCAN(metric="precomputed", min_cluster_size=2)
clusterer.fit_predict(s)
df['labels'] = clusterer.labels_
print(df.sort_values('labels'))
You could also add some specific code, as your sample seems to be about very specific maintenance logs.
For instance, you could add new features to the transformed dataframe based on a small list of hardware/sotfware :
#To create a feature about OS :
cols = ['os', 'linux', 'window']
transformed[cols[0]] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))
#To crate a feature about hardware :
cols = ["laptop", "printer", "scanner"]
transformed["hardware"] = np.ceil(transformed[[x for x in cols if x in transformed.columns]].sum(axis=1))
This step could help have better results but may not be necessary. I'm not sure of how it will compares to FuzzyWuzzy's performance for matching strings, but I will be interested in your feedback !
def replace_similars(input_list):
# Replaces %90 and more similar strings
start_time = time.time()
for i in range(len(input_list)):
if i % 1000 == 0:
print(f'time = {time.time()-start_time:.2f} - index = {i}')
for j in range(i+1, len(input_list)):
if -15 < len(list(input_list[i])) - len(list(input_list[i])) < 15:
if difflib.SequenceMatcher(None, input_list[i], input_list[j]).ratio() >= 0.9:
input_list[j] = input_list[i]
def generate_mapping(input_list):
new_list = input_list[:] # copy list
replace_similars(new_list)
mapping = {}
for i in range(len(input_list)):
mapping[input_list[i]] = new_list[i]
return mapping
Even though this might not be a practical solution either cause it is also going to take about 90 years if every iteration takes 0.1s , but still it is a lot more optimised solution.

How can I use regular expressions in my vocabulary for CountVectorizer?

How do I make "First word in the doc was [target word]" a feature?
Consider these two sentences:
example = ["At the moment, my girlfriend is Jenny. She is working as an artist at the moment.",
"My girlfriend is Susie. She is working as an accountant at the moment."]
If I were trying to measure relationship commitment, I'd want to be able to treat the phrase "at the moment" as a feature only when it shows up at the beginning like that.
I would love to be able to use regex's in the vocabulary...
phrases = ["^at the moment", 'work']
vect = CountVectorizer(vocabulary=phrases, ngram_range=(1, 3), token_pattern=r'\w{1,}')
dtm = vect.fit_transform(example)
But that doesn't seem to work.
I have also tried this, but get a 'vocabulary is empty' error...
CountVectorizer(token_pattern = r"(?u)^currently")
What's the right way to do this? Do I need a custom vectorizer? Any simple tutorials you can link me to? This is my first sklearn project, and I've been Googling this for hours. Any help much appreciated!
OK I think I've figured out a way, based on hacking the get_tweet_length() function in this tutorial...
https://ryan-cranfill.github.io/sentiment-pipeline-sklearn-4/
I added this function...
def first_words(text):
matchesList = re.findall('^at the moment', text, re.I)
if len(matchesList) > 0:
return 1
else:
return 0
And used them with base sklearn_helper pipelinize_feature() function, which converts output into the array format desired by the sklearn's FeautreUnion function.
vect4 = pipelinize_feature(first_words, active=True)
I can then use this along with my normal CountVectorizers via FeatureUnion
unionObj = FeatureUnion([
('vect1', vect1),
('vect2', vect2),
('vect4', vect4)
])

python googlemaps all possible distances between different locations

schools=['GSGS','GSGL','JKG','JMG','MCGD','MANGD','SLSA','WHGR','WOG','GCG','LP',
'PGG', 'WVSG', 'ASGE','CZG', 'EAG','GI']
for i in range (1,17):
gmaps = googlemaps.Client(key='')
distances = gmaps.distance_matrix((GSGS), (schools), mode="driving"['rows'][0]['elements'][0]['distance']['text']
print(distances)
The elements of the list are schools. I didn't want to make the list to long so I used these abbreviations.
I want to get all the distances between "GSGS" and the schools in the list. I don't know what to write inside the second bracket.
distances = gmaps.distance_matrix((GSGS), (schools)
If I run it like that, it outputs this error:
Traceback (most recent call last):
File "C:/Users/helpmecoding/PycharmProjects/untitled/distance.py", line 31, in
<module>
distances = gmaps.distance_matrix((GSGS), (schools), mode="driving")['rows'][0]['elements'][0]['distance']['text']
KeyError: 'distance'
I could do it one for one but thats not what I want. If I write another school from the list schools and delete the for loop it works fine.
I know I have to do a loop so that it cycles trough the list, but I don't know how to do it. Behind every variable for example "GSGS" is the address/location from the school.
I deleted the key just for safety.
My Dad helped me and we solved the problem. Now i have what i want :) Now i have to do a list with all distances between the schools. And if i got that i have to do the Dijkstra Algorithm to find the shortest route between them. Thanks for helping!
import googlemaps
GSGS = (address)
GSGL = (address)
. . .
. . .
. . .
schools =
(GSGS,GSGL,JKG,JMG,MCGD,MANGD,SLSA,WHGR,WOG,GCG,LP,PGG,WVSG,ASGE,CZG,EAG,GI)
school_names = ("GSGS","GSGL","JKG","JMG","MCGD","MANGD","SLSA","WHGR","WOG","GCG","LP","PGG","WVSG","ASGE","CZG","EAG","GI")
school_distances = ()
for g in range(0,len(schools)):
n = 0
for i in schools:
gmaps = googlemaps.Client(key='TOPSECRET')
distances = gmaps.distance_matrix(schools[g], i)['rows'][0]['elements'][0]['distance']['text']
if school_names[g] != school_names[n]:
print(school_names[g] + " - " + school_names[n] + " " + distances)
else:
print(school_names[g] + " - " + school_names[n] + " " + "0 km")
n = n + 1
In my experience, it is sometimes difficult to know what is going on when you use a third-party api. Though I am not a proponent of reinventing the wheel sometimes it is necessary to get a full picture of what is going on. So, I recommend giving it a shot building your own api endpoint request call and see if that works.
import requests
schools = ['GSGS','GSGL','JKG','JMG','MCGD','MANGD','SLSA','WHGR','WOG','GCG','LP','PGG', 'WVSG', 'ASGE','CZG', 'EAG','GI']
def gmap_dist(apikey, origins, destinations, **kwargs):
units = kwargs.get("units", "imperial")
mode = kwargs.get("mode", "driving")
baseurl = "https://maps.googleapis.com/maps/api/distancematrix/json?"
urlargs = {"key": apikey, "units": units, "origins": origins, "destinations": destinations, "mode": mode}
req = requests.get(baseurl, params=urlargs)
data = req.json()
print(data)
# do this for each key and index pair until you
# find the one causing the problem if it
# is not immediately evident from the whole data print
print(data["rows"])
print(rows[0])
# Check if there are elements
try:
distances = data['rows'][0]['elements'][0]['distance']
except KeyError:
raise KeyError("No elements found")
except IndexError:
raise IndexError("API Request Error. No response returned")
else:
return distances
Also as a general rule of thumb it is good to have a test case to make sure things are working as they should before testing the whole list,
#test case
try:
test = gmap_dist(apikey="", units="imperial", origins="GSGS", destinations="GSGL", mode="driving")
except Exception as err:
raise Exception(err)
else:
dists = gmap_dist(apikey="", units="imperial", origins="GSGS", destinations=schools, mode="driving")
print(dists)
Lastly, if you are testing the distance from "GSGS" to other schools, then you might want to get it out of your list of schools as the distance will be 0.
Now, I suspect that the reason you are getting this exception is because there are no json elements returned. Probably, because one of your parameters was improperly formatted.
If this function returns a KeyError still. Check the address spelling and make sure your apikey is valid. Although if it was the Apikey I would expect they would not bother to give you even empty results.
Hope this helps. Comment if it doesn't work.

Categories

Resources