Faster and more efficient python method for fuzzy matching substrings - python

I want the program to search for all occurrences of crocodile, etc with fuzzy matching i.e. If there are any spelling mistakes, it should count those words as well.
s="Difference between a crocodile and an alligator is......." #Long paragraph, >10000 words
to_search=["crocodile","insect","alligator"]
for i in range(len(to_search)):
for j in range(len(s)):
a = s[j:j+len(to_search[i])]
match = difflib.SequenceMatcher(None,a,to_search[I]).ratio()
if(match>0.9): #90% similarity
print(a)
So all of the following should be considered as instances of "crocodile": "crocodile","crocodil","crocodele",etc
The above method works but is too slow if the main string ("s" here) is of large size like >1million words.
Is there any way to do this that's faster than the above method**?
**(splitting the string into sub-string sized blocks and then comparing sub-string with reference word)

One of the reasons it takes too long on a large body of text is that you are repeating the sliding window through the entire text multiple times, once for each words you are searching for. And a lot of the computation is on comparing your words to blocks of the same length which might contain parts of multiple words.
If you are willing to posit that you are always looking to match individual words, you could split the text into words and just compare against the words - far fewer comparisons (number of words, vs. windows starting at every position in the text), and the splitting only needs to be done once, not for every search term. Here's an example:
to_search= ["crocodile", "insect", "alligator"]
s = "Difference between a crocodile and an alligator is" #Long paragraph, >10000 words
s_words = s.replace(".", " ").split(" ") # Split on spaces, with periods removed
for search_for in to_search:
for s_word in s_words:
match = difflib.SequenceMatcher(None, s_word, search_for).ratio()
if(match > 0.9): #90% similarity
print(s_word)
continue # no longer need to continue the search for this word!
This should give you significant speedup, hope it solves your needs!
Happy Coding!

Related

Extract text between two dots containing a specific word

I'm trying to extract a sentence between two dots. All sentences have inflam or Inflam in them which is my specific word but I don't know how to make that happen.
what I want is ".The bulk of the underlying fibrous connective tissue consists of diffuse aggregates of chronic inflammatory cells."
or
".The fibrous connective tissue reveals scattered vascular structures and possible chronic inflammation."
from a long paragraph
what I have tried so far is this
##title Extract microscopic-inflammation { form-width: "20%" }
def inflammation1(microscopic_description):
PATTERNS=[
"(?=\.)(.*)(?<=inflamm)",
"(?=inflamm)(.*)(?<=.)",
]
for pattern in PATTERNS:
matches = re.findall(pattern, microscopic_description)
if len(matches) > 0:
break
inflammation1 = ''.join([k for k in matches])
return (inflammation1)
for index, microscopic_description in enumerate(texts):
print(inflammation1(microscopic_description))
print("#"*79, index)
which hasn't worked for me and it gives me error. when I separate my patterns and run them in different cells they work. The problem is they don't work together to give me the sentence between "." and "." before inflamm and after inflamm.
import re
string='' # replace with your paragraph
print(re.search(r"\.[\s\w]*\.",string).group()) #will print first matched string
print(re.findall(r"\.[\s\w]*\.",string)) #will print all matched strings
You can try by checking for the word in every sentence of the text.
for sentence in text.split("."):
if word in sentence:
print(sentence[1:])
Here you do exactly that and if you find the word, you print the sentence without the space in the start of it. You can modify it in any way you want.

What am I doing wrong using Gensim.Phrases to extract repeating multiple-word terms from within a single sentence?

I would like to first extract repeating n-grams from within a single sentence using Gensim's Phrases, then use those to get rid of duplicates within sentences. Like so:
Input: "Testing test this test this testing again here testing again here"
Desired output: "Testing test this testing again here"
My code seemed to have worked for generating up to 5-grams using multiple sentences but whenever I pass it a single sentence (even a list full of the same sentence) it doesn't work. If I pass a single sentence, it splits the words into characters. If I pass the list full of the same sentence, it detects nonsense like non-repeating words while not detecting repeating words.
I thought my code was working because I used like 30MB of text and produced very intelligible n-grams up to n=5 that seemed to correspond to what I expected. I have no idea how to tell its precision and recall, though. Here is the full function, which recursively generates all n-grams from 2 to n::
def extract_n_grams(documents, maximum_number_of_words_per_group=2, threshold=10, minimum_count=6, should_print=False, should_use_keywords=False):
from gensim.models import Phrases
from gensim.models.phrases import Phraser
tokens = [doc.split(" ") for doc in documents] if type(documents) == list else [documents.split(" ") for _ in range(100)] # this is what I tried
final_n_grams = []
for current_n in range(maximum_number_of_words_per_group - 1):
n_gram = Phrases(tokens, min_count=minimum_count, threshold=threshold, connector_words=connecting_words)
n_gram_phraser = Phraser(n_gram)
resulting_tokens = []
for token in tokens:
resulting_tokens.append(n_gram_phraser[token])
current_n_gram_final = []
for token in resulting_tokens:
for word in token:
if '_' in word:
# no n_gram should have a comma between words
if ',' not in word:
word = word.replace('_', ' ')
if word not in current_n_gram_final and all([word not in gram for gram in final_n_grams]):
current_n_gram_final.append(word)
tokens = n_gram[tokens]
final_n_grams.append(current_n_gram_final)
In addition to trying repeating the sentence in the list, I also tried using NLKT's word_tokenize as suggested here. What am I doing wrong? Is there an easier approach?
The Gensim Phrases class is designed to statistically detect when certain pairs of words appear so often together, compared to independently, that it might be useful to combine them into a single token.
As such, it's unlikely to be helpful for your example task, of eliminating the duplicate 3-word ['testing', 'again', 'here'] run-of-tokens.
First, it never eliminates tokens – only combines them. So, if it saw the couplet ['again', 'here'] appearing ver often together, rather than as separate 'again' and 'here', it'd turn it into 'again_here' – not eliminate it.
But second, it does these combinations not for every repeated n-token grouping, but only if the large amount of training data implies, based on the threshold configured, that certain pairs stick out. (And it only goes beyond pairs if run repeatedly.) Your example 3-word grouping, ['testing', 'again', 'here'], does not seem likely to stick out as a composition of extra-likely pairings.
If you have a more rigorous definition of which tokens/runs-of-tokens need to be eliminated, you'd probably want to run other Python code on the lists-of-tokens to enforce that de-duplication. Can you describe in more detail, perhaps with more examples, the kinds of n-grams you want removed? (Will they only be at the beginning or end of a text, or also the middle? Do they have to be next-to each other, or can they be spread throughout the text? Why are such duplicates present in the data, & why is it thought important to remove them?)
Update: Based on the comments about the real goal, a few lines of Python that check, at each position in a token-list, whether the next N tokens match the previous N tokens (and thus can be ignored) should do the trick. For example:
def elide_repeated_ngrams(list_of_tokens):
return_tokens = []
i = 0
while i < len(list_of_tokens):
for candidate_len in range(len(return_tokens)):
if list_of_tokens[i:i+candidate_len] == return_tokens[-candidate_len:]:
i = i + candidate_len # skip the repeat
break # begin fresh forward repeat-check
else:
# this token not part of any repeat; include & proceed
return_tokens.append(list_of_tokens[i])
i += 1
return return_tokens
On your test case:
>>> elide_repeated_ngrams("Testing test this test this testing again here testing again here".split())
['Testing', 'test', 'this', 'testing', 'again', 'here']

Python Create List of Words Per Sentence and Calculate Mean and Place in CSV File

I'm looking to count the number of words per sentence, calculate the mean words per sentence, and put that info into a CSV file. Here's what I have so far. I probably just need to know how to count the number of words before a period. I might be able to figure it out from there.
#Read the data in the text file as a string
with open("PrideAndPrejudice.txt") as pride_file:
pnp = pride_file.read()
#Change '!' and '?' to '.'
for ch in ['!','?']:
if ch in pnp:
pnp = pnp.replace(ch,".")
#Remove period after Dr., Mr., Mrs. (choosing not to include etc. as that often ends a sentence although in can also be in the middle)
pnp = pnp.replace("Dr.","Dr")
pnp = pnp.replace("Mr.","Mr")
pnp = pnp.replace("Mrs.","Mrs")
To split a string into a list of strings on some character:
pnp = pnp.split('.')
Then we can split each of those sentences into a list of strings (words)
pnp = [sentence.split() for sentence in pnp]
Then we get the number of words in each sentence
pnp = [len(sentence) for sentence in pnp]
Then we can use statistics.mean to calculate the mean:
statistics.mean(pnp)
To use statistics you must put import statistics at the top of your file. If you don't recognize the ways I'm reassigning pnp, look up list comprehensions.
You might be interested in the split() function for strings. It seems like you're editing your text to make sure all sentences end in a period and every period ends a sentence.
Thus,
pnp.split('.')
is going to give you a list of all sentences. Once you have that list, for each sentence in the list,
sentence.split() # i.e., split according to whitespace by default
will give you a list of words in the sentence.
Is that enough of a start?
You can try the code below.
numbers_per_sentence = [len(element) for element in (element.split() for element in pnp.split("."))]
mean = sum(numbers_per_sentence)/len(numbers_per_sentence)
However, for real natural language processing I would probably recommend a more robust solution such as NLTK. The text manipulation you perform (replacing "?" and "!", removing commas after "Dr.","Mr." and "Mrs.") is probably not enough to be 100% sure that comma is always a sentence separator (and that there are no other sentence separators in your text, even if it happens to be true for Pride And Prejudice)

if allwords in title: match

Using python3, i have a list of words like:
['foot', 'stool', 'carpet']
these lists vary in length from 1-6 or so. i have thousands and thousands of strings to check, and it is required to make sure that all three words are present in a title. where:
'carpet stand upon the stool of foot balls.'
is a correct match, as all the words are present here, even though they are out of order.
ive wondered about this for a long time, and the only thing i could think of was some sort of iteration like:
for word in list: if word in title: match!
but this give me results like 'carpet cleaner' which is incorrect. i feel as though there is some sort of shortcut to do this, but i cant seem to figure it out without using excessivelist(), continue, break or other methods/terminology that im not yet familiar with. etc etc.
You can use all():
words = ['foot', 'stool', 'carpet']
title = "carpet stand upon the stool of foot balls."
matches = all(word in title for word in words)
Or, inverse the logic with not any() and not in:
matches = not any(word not in title for word in words)

Is there a faster way of looping through a set and replacing MWE in a sentence? - Python

The task is to group expressions that are made up of multiple words (aka Multi-Word Expressions).
Given a dictionary of MWE, I need to add dashes to the input sentences where MWE are detected, e.g.
**Input:** i have got an ace of diamonds in my wet suit .
**Output:** i have got an ace-of-diamonds in my wet-suit .
Currently I looped through the sorted dictionary and see whether the MWE appears in the sentence and replace them whenever it appears. But there's a lot of wasted iterations.
Is there a better way of doing so? One solution is to produce all possible n-grams 1st, i.e. chunker2()
import re, time
mwe_list =set([i.strip() for i in codecs.open( \
"wn-mwe-en.dic","r","utf8").readlines()])
def chunker(sentence):
for item in mwe_list:
if item or item.replace("-", " ") in sentence:
#print item
mwe_item = '-'.join(item.split(" "))
r=re.compile(re.escape(mwe_item).replace('\\-','[- ]'))
sentence=re.sub(r,mwe_item,sentence)
return sentence
def chunker2(sentence):
nodes = []
tokens = sentence.split(" ")
for i in range(0,len(tokens)):
for j in range(i,len(tokens)):
nodes.append(" ".join(tokens[i:j]))
n = sorted(set([i for i in nodes if not "" and len(i.split(" ")) > 1]))
intersect = mwe_list.intersection(n)
for i in intersect:
print i
sentence = sentence.replace(i, i.replace(" ", "-"))
return sentence
s = "i have got an ace of diamonds in my wet suit ."
time.clock()
print chunker(s)
print time.clock()
time.clock()
print chunker2(s)
print time.clock()
I'd try doing it like this:
For each sentence, construct a set of n-grams up to a given length (the longest MWE in your list).
Now, just do mwe_nmgrams.intersection(sentence_ngrams) and search/replace them.
You won't have to waste time by iterating over all of the items in your original set.
Here's a slightly faster version of chunker2:
def chunker3(sentence):
tokens = sentence.split(' ')
len_tokens = len(tokens)
nodes = set()
for i in xrange(0, len_tokens):
for j in xrange(i, len_tokens):
chunks = tokens[i:j]
if len(chunks) > 1:
nodes.add(' '.join(chunks))
intersect = mwe_list.intersection(n)
for i in intersect:
print i
sentence = sentence.replace(i, i.replace(' ', '-'))
return sentence
First, a 2x improvement: Because you are replacing the MWEs with hyphenated versions, you can pre-process the dictionary (wn-mwe-en.dic) to eliminate all hyphens from the MWEs in the set, eliminating one string comparison. If you allow hyphens within the sentence, then you'll have to pre-process it as well, presumably online, for a minor penalty. This should cut your runtime in half.
Next, a minor improvement: Immutable tuples are generally faster for iteration rather than a set or list (which are mutable and the iterator has to check for movement of elements in memory with each step). The set() conversion will eliminate duplicates, as you intend. The tuple bit will firm it up in memory allowing low level iteration optimizations by the python interpreter and its compiled libs.
Finally, you should probably parse both the sentence and the MWEs into words or tokens before doing all your comparisons, this would cut down on the # of string comparisons required by the average length of your words (4x if your words are 4 characters long on average). You'd also be able to nest another loop to search for the first word in the MWE as an anchor for all MWEs that share that first word, reducing the length of the string comparisons required. But I'll leave this lion's share for you experimentation on real data. And depending on the intepreter vs. compiled lib efficiency, doing all this splitting nested looping at the python level may actually slow things down.
So here's the result of the first two easy "sure" bets. Should be 2x faster despite the preprocessing, unless your sentence is very short.
mwe_list = set(i.strip() for i in codecs.open("wn-mwe-en.dic", "r", "utf8").readlines())
mwe_list = tuple(mwe.replace('-', ' ').strip() for mwe in mwe_list)
sentence = sentence.replace('-', ' ').strip()
def chunker(sentence):
for item in mwe_list:
if item in sentence:
...
Couldn't find a .dic file on my system or I'd profile it for you.

Categories

Resources