Given a string S of variable length and a dictionary D of n-grams N, I want to:
extract all N in S that match with a fuzzy matching logic (to catch spelling errors)
extract all Numbers in S
show the results in the same order as they are in S
I accomplished points 1 and 2, but my approach, based on the creation of n-grams from S and fuzzy matching against the dictionary (plus matching of numbers) does not maintain the order in which the items are in S
from nltk import everygrams
from flask_caching import Cache
import re
string = "Hello everybody, today we have 2.000 cell phones here"
ngrams = (list(everygrams(string.split(), 1, 4)))
my_dict = {
"brand": "ITEM_01",
"model": "ITEM_02",
"cell phone": "ITEM_04",
"today" : "ITEM_05"
}
result=""
results=[] # list with final results
d = FuzzyDict(my_dict) # create the dictionary for fuzzy matching
for k in ngrams:
candidate = ' '.join(k)
print (f"Searching for {candidate}")
try:
#matching n-gram in Dictionary using fuzzy match
result = d[candidate]
print (f"Found {result}")
results.append(result)
except:
print("An exception occurred")
#matching complex numbers
numbers = re.findall(r'(?:[+-]|\()?\$?\d+(?:,\d+)*(?:\.\d+)?\)?', candidate)
#appending numbers to list
results.extend(numbers)
#NOTE chronological order is not kept!
#keeping unque values since my approach will extract several instances of the same item
myset = set(results)
results_unique = list(myset)
This should give me "ITEM_5 2.000 ITEM_4" (now the order is casual)
Related
I have a pandas data frame that contains two columns named Potential Word, Fixed Word. The Potential Word column contains words of different languages which contains spell mistakes words and correct words and the Fixed Word column contains the correct words corresponded to Potential Word.
Below I have shared some of the samples data
Potential Word
Fixed Word
Exemple
Example
pipol
People
pimple
Pimple
Iunik
unique
My vocab Dataframe contains 600K unique row.
My Solution:
key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
match_value = match(each, key) # match is a function that returns a
# similarity value of two strings
if match_value > glob_match_value and match_value > match_threshold:
glob_match_value = match_value
potential_fixed_word = each
Problem
The problem with my code its takes a lot of time to fix every word because of the loop running through the large vocab list. When a word is missed on the vocab then it takes almost 5 or 6 sec to solve a sentence of 10 ~12 words. The match function performs decently so the objective of the optimization.
I need optimized solution help me here
In the perspective of Information Retrieval (IR), You need to reduce search space. Matching the given_word (as key) against all Potential Words is definitely inefficient.
Instead, you need to match against a reasonable number of candidates.
To find such candidates, you need to index Potential Words and Fixed Words.
from whoosh.analysis import StandardAnalyzer
from whoosh.fields import Schema, TEXT
from whoosh.index import create_in
ix = create_in("indexdir", Schema(
potential=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True),
fixed=TEXT(analyzer=StandardAnalyzer(stoplist=None), stored=True)
))
writer = ix.writer()
writer.add_document(potential='E x e m p l e', fixed='Example')
writer.add_document(potential='p i p o l', fixed='People')
writer.add_document(potential='p i m p l e', fixed='Pimple')
writer.add_document(potential='l u n i k', fixed='unique')
writer.commit()
With this index, you can search some candidates.
from whoosh.qparser import SimpleParser
with ix.searcher() as searcher:
results = searcher.search(SimpleParser('potential', ix.schema).parse('p i p o l'))
for result in results[:2]:
print(result)
The output is
<Hit {'fixed': 'People', 'potential': 'p i p o l'}>
<Hit {'fixed': 'Pimple', 'potential': 'p i m p l e'}>
Now, you can match the given_word only against few candidates, instead of all 600K.
It is not perfect, however, this is inevitable trade-off and how IR essentially works. Try this with different numbers of candidates.
Without changing much on your implementation, as I believe it is somewhat required to iterate over the list of potential words for each word.
Here my intention is not to optimize the match function itself but to leverage multiple threads to search in parallel.
import concurrent.futures
import time
from concurrent.futures.thread import ThreadPoolExecutor
from typing import Any, Union, Iterator
import pandas as pd
# Replace your dataframe here for testing this
df = pd.DataFrame({'Potential Word': ["a", "b", "c"], "Fixed Word": ["a", "c", "b"]})
# Replace by your match function
def match(w1, w2):
# Simulate some work is happening here
time.sleep(1)
return 1
# This is mostly your function itself
# Using index to recreate the sentence from the returned values
def matcher(idx, given_word):
key = given_word
glob_match_value = 0
potential_fixed_word = ''
match_threshold = 0.65
for each in df['Potential Word']:
match_value = match(each, key) # match is a function that returns a
# similarity value of two strings
if match_value > glob_match_value and match_value > match_threshold:
glob_match_value = match_value
potential_fixed_word = each
return idx, potential_fixed_word
else:
# Handling default case, you might want to change this
return idx, ""
sentence = "match is a function that returns a similarity value of two strings match is a function that returns a " \
"similarity value of two strings"
start = time.time()
# Using a threadpool executor
# You can increase or decrease the max_workers based on your machine
executor: Union[ThreadPoolExecutor, Any]
with concurrent.futures.ThreadPoolExecutor(max_workers=24) as executor:
futures: Iterator[Union[str, Any]] = executor.map(matcher, list(range(len(sentence.split()))), sentence.split())
# Joining back the input sentence
out_sentence = " ".join(x[1] for x in sorted(futures, key=lambda x: x[0]))
print(out_sentence)
print(time.time() - start)
Please note that the runtime of this will depend upon
The maximum time taken by a single match call
Number of words in the sentence
Number of worker threads (TIP: Try to see if you can use as many as the number of words in the sentence)
I would use the sortedcollections module. In general the access time to a SortedList or SortedDict is O(log(n)) instead of O(n); In your case 19.1946 if/then checks vs 600,000 if/then checks.
from sortedcollections import SortedDict
I am trying to extract unigram, bi- and trigram strings that are formed with combination of some of the smaller parts. Is there any possible way to extract them individually without counting the smaller ones when they are part of the larger ones?
text = "the log user should able to identify log entries and domain log entries"
ngramList = ['log', 'log entries','domain log entries']
import re
counts = {}
for ngram in ngrams:
words = ngram.rsplit()
pattern = re.compile(r'%s' % "\s+".join(words),re.IGNORECASE)
counts[ngram] = len(pattern.findall(text))
print(counts)
current program output = 'log':3 ,'log entries':2,'domain log entries':1
expected output = 'log' : 1 , 'log entries':1, 'domain log entries':1
You can first sort the ngram list by size and then use re.subn to substitute each ngram (from large to small) by an empty string and at the same time count the number of substitutions.
Because you sort from larger to smaller ngram, you ensure that the smaller ones don't get counted 'as part of the larger ones' because you remove those from the string in the loop.
import re
s = "the log user should able to identify log entries and domain log entries"
ngramList = ['log', 'log entries','domain log entries']
ngramList.sort(key=len, reverse=True)
counts = {}
for ngram in ngramList:
words = ngram.rsplit()
pattern = re.compile(r'%s' % "\s+".join(words), re.IGNORECASE)
s, n = re.subn(pattern, '', s)
counts[ngram] = n
print(counts)
As Wiktor indicates in the comments, you may want to improve your regex pattern though. Now the pattern will also match 'log' in the word 'key logging'. To be sure, you want to wrap the token in word breaks:
pattern = re.compile(r"\b(?:{})\b".format(r"\s+".join(ngram.split())), re.IGNORECASE)
I am using str.contains for text analytics in Pandas. If for the sentence "My latest Data job was an Analyst" , I want a combination of the words "Data" & "Analyst" but at the same time I want to specify the number of words between the two words used for the combination( here it is 2 words between "Data" and "Analyst".Currently I am using (DataFile.XXX.str.contains('job') & DataFile.XXX.str.contains('Analyst') to get the counts for "job Analyst".
How can I Specify the number of words in between the 2 words in the str.contains syntax.
Thanks in advance
You can't. At least, not in a simple or standardized way.
Even the basics, like how you define a "word," are a lot more complex than you probably imagine. Both word parsing and lexical proximity (e.g. "are two words within distance D of one another in sentence s?") is the realm of natural language processing (NLP). NLP and proximity searches are not part of basic Pandas, nor of Python's standard string processing. You could import something like NLTK, the Natural Language Toolkit to solve this problem in a general way, but that's a whole 'nother story.
Let's look at a simple approach. First you need a way to parse a string into words. The following is rough by NLP standards, but will work for simpler cases:
def parse_words(s):
"""
Simple parser to grab English words from string.
CAUTION: A simplistic solution to a hard problem.
Many possibly-important edge- and corner-cases
not handled. Just one example: Hyphenated words.
"""
return re.findall(r"\w+(?:'[st])?", s, re.I)
E.g.:
>>> parse_words("and don't think this day's last moment won't come ")
['and', "don't", 'think', 'this', "day's", 'last', 'moment', "won't", 'come']
Then you need a way to find all the indices in a list where a target word is found:
def list_indices(target, seq):
"""
Return all indices in seq at which the target is found.
"""
indices = []
cursor = 0
while True:
try:
index = seq.index(target, cursor)
except ValueError:
return indices
else:
indices.append(index)
cursor = index + 1
And finally a decision making wrapper:
def words_within(target_words, s, max_distance, case_insensitive=True):
"""
Determine if the two target words are within max_distance positiones of one
another in the string s.
"""
if len(target_words) != 2:
raise ValueError('must provide 2 target words')
# fold case for case insensitivity
if case_insensitive:
s = s.casefold()
target_words = [tw.casefold() for tw in target_words]
# for Python 2, replace `casefold` with `lower`
# parse words and establish their logical positions in the string
words = parse_words(s)
target_indices = [list_indices(t, words) for t in target_words]
# words not present
if not target_indices[0] or not target_indices[1]:
return False
# compute all combinations of distance for the two words
# (there may be more than one occurance of a word in s)
actual_distances = [i2 - i1 for i2 in target_indices[1] for i1 in target_indices[0]]
# answer whether the minimum observed distance is <= our specified threshold
return min(actual_distances) <= max_distance
So then:
>>> s = "and don't think this day's last moment won't come at last"
>>> words_within(["THIS", 'last'], s, 2)
True
>>> words_within(["think", 'moment'], s, 2)
False
The only thing left to do is map that back to Pandas:
df = pd.DataFrame({'desc': [
'My latest Data job was an Analyst',
'some day my prince will come',
'Oh, somewhere over the rainbow bluebirds fly',
"Won't you share a common disaster?",
'job! rainbow! analyst.'
]})
df['ja2'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 2))
df['ja3'] = df.desc.apply(lambda x: words_within(["job", 'analyst'], x, 3))
This is basically how you'd solve the problem. Keep in mind, it's a rough and simplistic solution. Some simply-posed questions are not simply-answered. NLP questions are often among them.
I have the following sentence and dict :
sentence = "I love Obama and David Card, two great people. I live in a boat"
dico = {
'dict1':['is','the','boat','tree'],
'dict2':['apple','blue','red'],
'dict3':['why','Obama','Card','two'],
}
I want to match the number of the elements that are in the sentence and in a given dict. The heavier method consists in doing the following procedure:
classe_sentence = []
text_splited = sentence.split(" ")
dic_keys = dico.keys()
for key_dics in dic_keys:
for values in dico[key_dics]:
if values in text_splited:
classe_sentence.append(key_dics)
from collections import Counter
Counter(classe_sentence)
Which gives the following output:
Counter({'dict1': 1, 'dict3': 2})
However it's not efficient at all since there are two loops and it is raw comparaison. I was wondering if there is a faster way to do that. Maybe using itertools object. Any idea ?
Thanks in advance !
You can use the set data data type for all you comparisons, and the set.intersection method to get the number of matches.
It will increare algorithm efficiency, but it will only count each word once, even if it shows up in several places in the sentence.
sentence = set("I love Obama and David Card, two great people. I live in a boat".split())
dico = {
'dict1':{'is','the','boat','tree'},
'dict2':{'apple','blue','red'},
'dict3':{'why','Obama','Card','two'}
}
results = {}
for key, words in dico.items():
results[key] = len(words.intersection(sentence))
Assuming you want case-sensitive matching:
from collections import defaultdict
sentence_words = defaultdict(lambda: 0)
for word in sentence.split(' '):
# strip off any trailing or leading punctuation
word = word.strip('\'";.,!?')
sentence_words[word] += 1
for name, words in dico.items():
count = 0
for x in words:
count += sentence_words.get(x, 0)
print('Dictionary [%s] has [%d] matches!' % (name, count,))
I have a full inverted index in form of nested python dictionary. Its structure is :
{word : { doc_name : [location_list] } }
For example let the dictionary be called index, then for a word " spam ", entry would look like :
{ spam : { doc1.txt : [102,300,399], doc5.txt : [200,587] } }
so that, the documents containing any word can be given by index[word].keys() , and frequency in that document by len(index[word][document])
Now my question is, how do I implement a normal query search in this index. i.e. given a query containing lets say 4 words, find documents containing all four matches (ranked by total frequency of occurrence ), then docs containing 3 matches and so on ....
**
Added this code, using S. Lott's answer.
This is the code I have written. Its working exactly as I want, ( just some formatting of output is needed ) but I know it could be improved.
**
from collections import defaultdict
from operator import itemgetter
# Take input
query = input(" Enter the query : ")
# Some preprocessing
query = query.lower()
query = query.strip()
# now real work
wordlist = query.split()
search_words = [ x for x in wordlist if x in index ] # list of words that are present in index.
print "\nsearching for words ... : ", search_words, "\n"
doc_has_word = [ (index[word].keys(),word) for word in search_words ]
doc_words = defaultdict(list)
for d, w in doc_has_word:
for p in d:
doc_words[p].append(w)
# create a dictionary identifying matches for each document
result_set = {}
for i in doc_words.keys():
count = 0
matches = len(doc_words[i]) # number of matches
for w in doc_words[i]:
count += len(index[w][i]) # count total occurances
result_set[i] = (matches,count)
# Now print in sorted order
print " Document \t\t Words matched \t\t Total Frequency "
print '-'*40
for doc, (matches, count)) in sorted(result_set.items(), key = itemgetter(1), reverse = True):
print doc, "\t",doc_words[doc],"\t",count
Pls comment ....
Thanx.
Here's a start:
doc_has_word = [ (index[word].keys(),word) for word in wordlist ]
This will build an list of (word,document) pairs. You can't easily make a dictionary out of that, since each document occurs many times.
But
from collections import defaultdict
doc_words = defaultdict(list)
for d, w in doc_has_word:
doc_words[tuple(d.items())].append(w)
Might be helpful.
import itertools
index = {...}
def query(*args):
result = []
doc_count = [(doc, len(index[word][doc])) for word in args for doc in index[word]]
doc_group = itertools.groupby(doc_count, key=lambda doc: doc[0])
for doc, group in doc_group:
result.append((doc, sum([elem[1] for elem in group])))
return sorted(result, key=lambda x:x[1])[::-1]
Here is a solution for finding the similar documents (the hardest part):
wordList = ['spam','eggs','toast'] # our list of words to query for
wordMatches = [index.get(word, {}) for word in wordList]
similarDocs = reduce(set.intersection, [set(docMatch.keys()) for docMatch in wordMatches])
wordMatches gets a list where each element is a dictionary of the document matches for one of the words being matched.
similarDocs is a set of the documents that contain all of the words being queried for. This is found by taking just the document names out of each of the dictionaries in the wordMatches list, representing these lists of document names as sets, and then intersecting the sets to find the common document names.
Once you have found the documents that are similar, you should be able to use a defaultdict (as shown in S. Lott's answer) to append all of the lists of matches together for each word and each document.
Related links:
This answer demonstrates defaultdict(int). defaultdict(list) works pretty much the same way.
set.intersection example