RegEx: How to find all instance of a collocation? - python

I am trying to write a script in python to find word collocations in a text. A word collocation is a pair of words that co-occur often in various texts. For example in the collocation "lemon zest", the words lemon and zest co-occur often and thus it is a collocation. Now I want to use re.findall to find all occurences of a given collocation. Unlike "lemon zest", there are some collocations that wouldn't be next to each other in texts. For example, in the phrase "kind of funny", because "of" is stop word, it would be already removed. So given the collocation "kind funny", a program has to return "kind of funny" as output.
Can anybody tell me how to do this? I should mention that I require a scalable approcah as I am dealing with gigabytes of text
Edit1:
inputCollocation = "kind funny"
Document1 = "This film is kind of funny"
Document2 = "It is kind of funny"
Document3 = "That film is funny"
ExpectedOutput: Document1, Document2
Thank you in advance.

You can just use string comparison:
inputCollocation = "kind funny"
documents = dict(
Document1 = "This film kind funny",
Document2 = "It kind funny",
Document3 = "That film funny",
)
def remove_stopwords(text):
...
matching = [
document for (document, text) in documents.iteritems()
if inputCollocation in remove_stopwords(text.lower())
]
print 'ExpectedOutput:', ', '.join(matching)
You could also consider using NLTK which has tools for finding collocations.

Related

Extracting sentences containing a keyword using set()

I'm trying to extract sentences that contain selected keywords using set.intersection().
So far I'm only getting sentences that have the word 'van'. I can't get sentences with the words 'blue tinge' or 'off the road' because the code below can only handle single keywords.
Why is this happening, and what can I do to solve the problem? Thank you.
from textblob import TextBlob
import nltk
nltk.download('punkt')
search_words = set(["off the road", "blue tinge" ,"van"])
blob = TextBlob("That is the off the road vehicle I had in mind for my adventure.
Which one? The one with the blue tinge. Oh, I'd use the money for a van.")
matches = []
for sentence in blob.sentences:
blobwords = set(sentence.words)
if search_words.intersection(blobwords):
matches.append(str(sentence))
print(matches)
Output: ["Oh, I'd use the money for a van."]
If you want to check for exact match of the search keywords this can be accomplished using:
from nltk.tokenize import sent_tokenize
text = "That is the off the road vehicle I had in mind for my adventure. Which one? The one with the blue tinge. Oh, I'd use the money for a van."
search_words = ["off the road", "blue tinge" ,"van"]
matches = []
sentances = sent_tokenize(text)
for word in search_words:
for sentance in sentances:
if word in sentance:
matches.append(sentance)
print(matches)
The output is:
['That is the off the road vehicle I had in mind for my adventure.',
"Oh, I'd use the money for a van.",
'The one with the blue tinge.']
If you want partial matching then use fuzzywuzzy for percentage matching.

Is there a way to do fuzzy string matching for words on string?

I want to do fuzzy matching on string with words.
The target string could be like.
"Hello, I am going to watch a film today."
where the words I want to search are.
"flim toda".
This hopefully should return "film today" as a search result.
I have used this method but it seems to be working only with one word.
import difflib
def matches(large_string, query_string, threshold):
words = large_string.split()
matched_words = []
for word in words:
s = difflib.SequenceMatcher(None, word, query_string)
match = ''.join(word[i:i+n] for i, j, n in s.get_matching_blocks() if n)
if len(match) / float(len(query_string)) >= threshold:
matched_words.append(match)
return matched_words
large_string = "Hello, I am going to watch a film today"
query_string = "film"
print(list(matches(large_string, query_string, 0.8)))
This only works with one word and it returns when there is little noise.
Is there any way to do such fuzzy matching with words?
The feature you are thinking of is called "query suggestion" and does rely on spell checking, but it relies on markov chains built out of search engine query log.
That being said, you use an approach similar to the one described in this answer: https://stackoverflow.com/a/58166648/140837
You can simply use Fuzzysearch, please see the example below;
from fuzzysearch import find_near_matches
text_string = "Hello, I am going to watch a film today."
matches = find_near_matches('flim toda', text_string, max_l_dist=2)
print([my_string[m.start:m.end] for m in matches])
This will give you the desired output.
['film toda']
Please note that you can give a value for max_l_dist parameter based on how much you are going to tolerate.

Extracting titles from a text

I'm working on a project, where I have to extract honorific titles (Mr, Mrs, St, etc.) from a novel. The desired output with the text I'm working with is:
['Col', 'Dr', 'Mr', 'Mrs', 'Otto', 'Rev', 'St']
However, with the code I wrote, the output is this:
{'Tom.', 'Mrs.', 'Otto.', 'Mary.', 'Bots.', 'Come.', 'No.', 'Col.', 'Cain.', 'Dr.', 'Gang.', 'Ike.', 'Kean.', 'St.', 'Hank.', 'Him.', 'Finn.', 'Ann.', 'Jane.', 'Alas.', 'Huck.', 'Sis.', 'Buck.', 'Jim.', 'Sid.', 'Mr.', 'Bill.', 'Rev.', 'Yes.'}
This is the code I have so far:
def get_titles(text):
pattern = re.compile('[A-Z][a-z]{1,3}\.')
title_tokens = set(re.findall(pattern, text))
pattern2 = re.compile('[A-Z][a-z]{1,3}')
pseudo_titles = set(re.findall(pattern2, text))
pseudo_titles = [word.strip() for word in pseudo_titles]
pseudo_titles = [word.replace('\n', '') for word in pseudo_titles]
difference = title_tokens.difference(pseudo_titles)
return difference
test = get_titles(text)
print(test)
As you can notice, the output gives me additional words with periods in them. I believe the issue stems from the regular expressions, but I'm not sure. Any advice or tips are appreciated.
The text can be found here: http://www.gutenberg.org/files/76/76-0.txt
Essentially, you are asking for an algorithm which can tell the difference between a title and one-word sentence. These are lexically indistinguishable; for example, consider the following two strings:
"Do I know who did this? Yes. Smith did it."
"Do I know who did this? Mr. Smith did it."
In the first sentence, "Yes." is a one-word sentence, and in the second, "Mr." is a title. As humans we only know this because we understand the meanings of the tokens "Yes" and "Mr"; so an algorithm which is able to distinguish between these cases requires some information about the meanings of the tokens it's parsing. It cannot work purely lexically like a regex does. This means you must either write a whitelist of allowed titles, or a blacklist of words which are not titles, or otherwise the problem is much more difficult.
Alternatively, if your project doesn't involve parsing titles from very many novels, you could just trim down the results by hand, using your human knowledge that "Tom" and "Yes" aren't titles. It shouldn't be that much work.

Extracting sentence from a dataframe with description column based on a phrase

I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?
So for this case the result will be
expected output
I have used this,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.
There are lot of methods to do that ,#ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)
Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
This is going to be quite slow, but idk if there's a better way.

Modifying corpus by inserting codewords using Python

I have about a corpus (30,000 customer reviews) in a csv file (or a txt file). This means each customer review is a line in the text file. Some examples are:
This bike is amazing, but the brake is very poor
This ice maker works great, the price is very reasonable, some bad
smell from the ice maker
The food was awesome, but the water was very rude
I want to change these texts to the following:
This bike is amazing POSITIVE, but the brake is very poor NEGATIVE
This ice maker works great POSITIVE and the price is very reasonable
POSITIVE, some bad NEGATIVE smell from the ice maker
The food was awesome POSITIVE, but the water was very rude NEGATIVE
I have two separate lists (lexicons) of positive words and negative words. For example, a text file contains such positive words as:
amazing
great
awesome
very cool
reasonable
pretty
fast
tasty
kind
And, a text file contains such negative words as:
rude
poor
worst
dirty
slow
bad
So, I want the Python script that reads the customer review: when any of the positive words is found, then insert "POSITIVE" after the positive word; when any of the negative words is found, then insert "NEGATIVE" after the positive word.
Here is the code I have tested so far. This works (see my comments in the codes below), but it needs improvement to meet my needs described above.
Specifically, my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE").
#adapted from http://stackoverflow.com/questions/6116978/python-replace-multiple-strings
import re
def multiple_replacer(*key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
def multiple_replace(string, *key_values):
return multiple_replacer(*key_values)(string)
#this my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE")
my_escaper = multiple_replacer(('cheap','cheap POSITIVE'), ('good', 'good POSITIVE'), ('avoid', 'avoid NEGATIVE'))
d = []
with open("review.txt","r") as file:
for line in file:
review = line.strip()
d.append(review)
for line in d:
print my_escaper(line)
A straightforward way to code this would be to load your positive and negative words from your lexicons into separate sets. Then, for each review, split the sentence into a list of words and look-up each word in the sentiment sets. Checking set membership is O(1) in the average case. Insert the sentiment label (if any) into the word list and then join to build the final string.
Example:
import re
reviews = [
"This bike is amazing, but the brake is very poor",
"This ice maker works great, the price is very reasonable, some bad smell from the ice maker",
"The food was awesome, but the water was very rude"
]
positive_words = set(['amazing', 'great', 'awesome', 'reasonable'])
negative_words = set(['poor', 'bad', 'rude'])
for sentence in reviews:
tagged = []
for word in re.split('\W+', sentence):
tagged.append(word)
if word.lower() in positive_words:
tagged.append("POSITIVE")
elif word.lower() in negative_words:
tagged.append("NEGATIVE")
print ' '.join(tagged)
While this approach is straightforward, there is a downside: you lose the punctuation due to the use of re.split().
If I understood correctly, you need something like:
if word in POSITIVE_LIST:
pattern.sub(replacement_function, word+" POSITIVE")
if word in NEGATIVE_LIST:
pattern.sub(replacement_function, word+" NEGATIVE")
Is it OK with you?

Categories

Resources