Modifying corpus by inserting codewords using Python

Modifying corpus by inserting codewords using Python - python

I have about a corpus (30,000 customer reviews) in a csv file (or a txt file). This means each customer review is a line in the text file. Some examples are:
This bike is amazing, but the brake is very poor
This ice maker works great, the price is very reasonable, some bad
smell from the ice maker
The food was awesome, but the water was very rude
I want to change these texts to the following:
This bike is amazing POSITIVE, but the brake is very poor NEGATIVE
This ice maker works great POSITIVE and the price is very reasonable
POSITIVE, some bad NEGATIVE smell from the ice maker
The food was awesome POSITIVE, but the water was very rude NEGATIVE
I have two separate lists (lexicons) of positive words and negative words. For example, a text file contains such positive words as:
amazing
great
awesome
very cool
reasonable
pretty
fast
tasty
kind
And, a text file contains such negative words as:
rude
poor
worst
dirty
slow
bad
So, I want the Python script that reads the customer review: when any of the positive words is found, then insert "POSITIVE" after the positive word; when any of the negative words is found, then insert "NEGATIVE" after the positive word.
Here is the code I have tested so far. This works (see my comments in the codes below), but it needs improvement to meet my needs described above.
Specifically, my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE").
#adapted from http://stackoverflow.com/questions/6116978/python-replace-multiple-strings
import re
def multiple_replacer(*key_values):
replace_dict = dict(key_values)
replacement_function = lambda match: replace_dict[match.group(0)]
pattern = re.compile("|".join([re.escape(k) for k, v in key_values]), re.M)
return lambda string: pattern.sub(replacement_function, string)
def multiple_replace(string, *key_values):
return multiple_replacer(*key_values)(string)
#this my_escaper works (this code finds such words as cheap and good and replace them with cheap POSITIVE and good POSITIVE), but the problem is that I have two files (lexicons), each containing about thousand positive/negative words. So what I want is that the codes read those word lists from the lexicons, search them in the corpus, and replace those words in the corpus (for example, from "good" to "good POSITIVE", from "bad" to "bad NEGATIVE")
my_escaper = multiple_replacer(('cheap','cheap POSITIVE'), ('good', 'good POSITIVE'), ('avoid', 'avoid NEGATIVE'))
d = []
with open("review.txt","r") as file:
for line in file:
review = line.strip()
d.append(review)
for line in d:
print my_escaper(line)

A straightforward way to code this would be to load your positive and negative words from your lexicons into separate sets. Then, for each review, split the sentence into a list of words and look-up each word in the sentiment sets. Checking set membership is O(1) in the average case. Insert the sentiment label (if any) into the word list and then join to build the final string.
Example:
import re
reviews = [
"This bike is amazing, but the brake is very poor",
"This ice maker works great, the price is very reasonable, some bad smell from the ice maker",
"The food was awesome, but the water was very rude"
]
positive_words = set(['amazing', 'great', 'awesome', 'reasonable'])
negative_words = set(['poor', 'bad', 'rude'])
for sentence in reviews:
tagged = []
for word in re.split('\W+', sentence):
tagged.append(word)
if word.lower() in positive_words:
tagged.append("POSITIVE")
elif word.lower() in negative_words:
tagged.append("NEGATIVE")
print ' '.join(tagged)
While this approach is straightforward, there is a downside: you lose the punctuation due to the use of re.split().

If I understood correctly, you need something like:
if word in POSITIVE_LIST:
pattern.sub(replacement_function, word+" POSITIVE")
if word in NEGATIVE_LIST:
pattern.sub(replacement_function, word+" NEGATIVE")
Is it OK with you?

Related

Create a list of alphabetically sorted UNIQUE words and display the first N words in python

I am new to Python, apologize for a simple question. My task is the following:
Create a list of alphabetically sorted unique words and display the first 5 words
I have text variable, which contains a lot of text information
I did
test = text.split()
sorted(test)
As a result, I receive a list, which starts from symbols like $ and numbers.
How to get to words and print N number of them.

I'm assuming by "word", you mean strings that consist of only alphabetical characters. In such a case, you can use .filter to first get rid of the unwanted strings, turn it into a set, sort it and then print your stuff.
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: x.isalpha(), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', 'of', 'peak']
But the problem with this is that it will still ignore words like mountain's, because of that pesky '. A regex solution might actually be far better in such a case-
For now, we'll be going for this regex - ^[A-Za-z']+$, which means the string must only contain alphabets and ', you may add more to this regex according to what you deem as "words". Read more on regexes here.
We'll be using re.match instead of .isalpha this time.
WORD_PATTERN = re.compile(r"^[A-Za-z']+$")
text = "$1523-the king of the 521236 mountain rests atop the king mountain's peak $#"
# Extract only the words that consist of alphabets
words = filter(lambda x: bool(WORD_PATTERN.match(x)), text.split(' '))
# Print the first 5 words
sorted(set(words))[:5]
Output-
['atop', 'king', 'mountain', "mountain's", 'of']
Keep in mind however, this gets tricky when you have a string like hi! What's your name?. hi!, name? are all words except they are not fully alphabetic. The trick to this is to split them in such a way that you get hi instead of hi!, name instead of name? in the first place.
Unfortunately, a true word split is far outside the scope of this question. I suggest taking a look at this question

I am newbie here, apologies for mistakes. Thank you.
test = '''The coronavirus outbreak has hit hard the cattle farmers in Pabna and Sirajganj as they are now getting hardly any customer for the animals they prepared for the last year targeting the Eid-ul-Azha this year.
Normally, cattle traders flock in large numbers to the belt -- one of the biggest cattle producing areas of the country -- one month ahead of the festival, when Muslims slaughter animals as part of their efforts to honour Prophet Ibrahim's spirit of sacrifice.
But the scene is different this year.'''
test = test.lower().split()
test2 = sorted([j for j in test if j.isalpha()])
print(test2[:5])

You can slice the sorted return list until the 5 position
sorted(test)[:5]
or if looking only for words
sorted([i for i in test if i.isalpha()])[:5]
or by regex
sorted([i for i in test if re.search(r"[a-zA-Z]")])
by using the slice of a list you will be able to get all list elements until a specific index in this case 5.

How to classify derived words that share meaning as the same tokens?

I would like to count unrelated words in an article but I have troubles with grouping words of the same meaning derived from one another.
For instance, I would like gasoline and gas to be treated as the same token in sentences like The price of gasoline has risen. and "Gas" is a colloquial form of the word gasoline in North American English. Conversely, in BE the term would be "petrol". Therefore, if these two sentences comprised the entire article, the count for gas (or gasoline) would be 3 (petrol would not be counted).
I have tried using NLTK's stemmers and lemmatizers but to no avail. Most seem to reproduce gas as gas and gasoline as gasolin which is not helpful for my purposes at all. I understand that this is the usual behaviour. I have checked out a thread that seems to be a little bit similar, however the answers there are not completely applicable to my case as I require the words to be derived from one another.
How to treat derived words of the same meaning as same tokens in order to count them together?

I propose a two steps approach:
First, find synonyms by comparing word embeddings (only non-stopwords). This should remove similar written words, which mean something else, such as gasolineand gaseous.
Then, check if synonyms share some of their stem. Essentially if "gas" is in "gasolin" and the other way around. This shall suffice because you only compare your synonyms.
import spacy
import itertools
from nltk.stem.porter import *
threshold = 0.6
#compare the stems of the synonyms
stemmer = PorterStemmer()
def compare_stems(a, b):
if stemmer.stem(a) in stemmer.stem(b):
return True
if stemmer.stem(b) in stemmer.stem(a):
return True
return False
candidate_synonyms = {}
#add a candidate to the candidate dictionary of sets
def add_to_synonym_dict(a,b):
if a not in candidate_synonyms:
if b not in candidate_synonyms:
candidate_synonyms[a] = {a, b}
return
a, b = b,a
candidate_synonyms[a].add(b)
nlp = spacy.load('en_core_web_lg')
text = u'The price of gasoline has risen. "Gas" is a colloquial form of the word gasoline in North American English. Conversely in BE the term would be petrol. A gaseous state has nothing to do with oil.'
words = nlp(text)
#compare every word with every other word, if they are similar
for a, b in itertools.combinations(words, 2):
#check if one of the word pairs are stopwords or punctuation
if a.is_stop or b.is_stop or a.is_punct or b.is_punct:
continue
if a.similarity(b) > threshold:
if compare_stems(a.text.lower(), b.text.lower()):
add_to_synonym_dict(a.text.lower(), b.text.lower())
print(candidate_synonyms)
#output: {'gasoline': {'gas', 'gasoline'}}
Then you can count your synonym candidates based on their appearances in the text.
Note: I chose the threshold for synonyms with 0.6 by chance. You would probably test which threshold suits your task. Also my code is just a quick and dirty example, this could be done a lot cleaner.
`

Extracting sentence from a dataframe with description column based on a phrase

I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?
So for this case the result will be
expected output
I have used this,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.

There are lot of methods to do that ,#ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)

Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
This is going to be quite slow, but idk if there's a better way.

count all occurences of each word from a list that appear in several thousand records in python

I have a list of reviews and a list of words that I am trying to count how many times each word shows in each review. The list of keywords is roughly around 30 and could grow/change. The current population of reviews is roughly 5000 with the review word count ranging from 3 to several hundred words. The number of reviews will definitely grow. Right now the keyword list is static and the number of reviews will not be growing to much so any solution to get the counts of keywords in each review will work, but ideally it will be one where there isn't a major performance issue if the number reviews drastically increase or the keywords change and all the reviews have to be reanalyzed.
I have been reading through different methods on stackoverflow and haven't been able to get any to work. I know you can use skikit learn to get the count of each word, but haven't figured out if there is a way to count a phrase. I have also tried various regex expressions. If the keyword list was all single words, I know I could very easily use skikit learn, a loop or regex, but I am having issues when the keyword has multiple words.
Two links I have tried
Python - Check If Word Is In A String
Phrase matching using regex and Python
the solution here is close, but it doesn't count all occurrences of the same word
How to return the count of words from a list of words that appear in a list of lists?
both the list of keywords and reviews are being pulled from a MySQL DB. All keywords are in lowercase. All text has been made lowercase and all non-alphanumeric except spaces have been stripped from the reviews. My original though was to use skikit learn countvectorizer to count the words, but not knowing how to handle counting a phrase I switched. I am currently attempting with loops and regex, but I am open to any solution
# Example of what I am currently attempting with regex
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']
for review in reviews:
for word in keywords:
results = re.findall(r'\bword\b',review) #this returns no results, the variable word is not getting picked up
#--also tried variations of this to no avail
#--tried creating the pattern first and passing it
# pattern = "r'\\b" + word + "\\b'"
# results = re.findall(pattern,review) #this errors with the msg: sre_constants.error: multiple repeat at position 9
#The results would be
review1: test=2; 'blue sky'=0;'grass is green'=0
review2: test=2; 'blue sky'=1;'grass is green'=0
review3: test=1; 'blue sky'=0;'grass is green'=1

I would first do it in brute force rather than overcomplicating it and try to optimize it later.
from collections import defaultdict
keywords = ['test','blue sky','grass is green']
reviews = ['this is a test. test should come back twice and not 3 times for testing','this pharse contains test and blue sky and look another test','the grass is green test']
results = dict()
for i in keywords:
for j in reviews:
results[i] = results.get(i, 0) + j.count(i)
print results
>{'test': 6, 'blue sky': 1, 'grass is green': 1}
it's importont that we query the dict with .get, in case we don't have a key set, we don't want to deal with KeyError exception.
If you want to go the complicated route, you can build your own trie and counter structure to do searches in large text files.
Parsing one terabyte of text and efficiently counting the number of occurrences of each word

None of the options you tried search for the value of word:
results = re.findall(r'\bword\b', review) checks for the word word in the string.
When you try pattern = "r'\\b" + word + "\\b'" you check for the string "r'\b[value of word]\b'.
You can use the first option, but the pattern should be r'\b%s\b' % word. That will search for the value of word.

NER naive algorithm

I never really dealt with NLP but had an idea about NER which should NOT have worked and somehow DOES exceptionally well in one case. I do not understand why it works, why doesn't it work or weather it can be extended.
The idea was to extract names of the main characters in a story through:
Building a dictionary for each word
Filling for each word a list with the words that appear right next to it in the text
Finding for each word a word with the max correlation of lists (meaning that the words are used similarly in the text)
Given that one name of a character in the story, the words that are used like it, should be as well (Bogus, that is what should not work but since I never dealt with NLP until this morning I started the day naive)
I ran the overly simple code (attached below) on Alice in Wonderland, which for "Alice" returns:
21 ['Mouse', 'Latitude', 'William', 'Rabbit', 'Dodo', 'Gryphon', 'Crab', 'Queen', 'Duchess', 'Footman', 'Panther', 'Caterpillar', 'Hearts', 'King', 'Bill', 'Pigeon', 'Cat', 'Hatter', 'Hare', 'Turtle', 'Dormouse']
Though it filters for upper case words (and receives "Alice" as the word to cluster around), originally there are ~500 upper case words, and it's still pretty spot on as far as main characters goes.
It does not work that well with other characters and in other stories, though gives interesting results.
Any idea if this idea is usable, extendable or why does it work at all in this story for "Alice" ?
Thanks!
#English Name recognition
import re
import sys
import random
from string import upper
def mimic_dict(filename):
dict = {}
f = open(filename)
text = f.read()
f.close()
prev = ""
words = text.split()
for word in words:
m = re.search("\w+",word)
if m == None:
continue
word = m.group()
if not prev in dict:
dict[prev] = [word]
else :
dict[prev] = dict[prev] + [word]
prev = word
return dict
def main():
if len(sys.argv) != 2:
print 'usage: ./main.py file-to-read'
sys.exit(1)
dict = mimic_dict(sys.argv[1])
upper = []
for e in dict.keys():
if len(e) > 1 and e[0].isupper():
upper.append(e)
print len(upper),upper
exclude = ["ME","Yes","English","Which","When","WOULD","ONE","THAT","That","Here","and","And","it","It","me"]
exclude = [ x for x in exclude if dict.has_key(x)]
for s in exclude :
del dict[s]
scores = {}
for key1 in dict.keys():
max = 0
for key2 in dict.keys():
if key1 == key2 : continue
a = dict[key1]
k = dict[key2]
diff = []
for ia in a:
if ia in k and ia not in diff:
diff.append( ia)
if len(diff) > max:
max = len(diff)
scores[key1]=(key2,max)
dictscores = {}
names = []
for e in scores.keys():
if scores[e][0]=="Alice" and e[0].isupper():
names.append(e)
print len(names), names
if __name__ == '__main__':
main()

From the looks of your program and previous experience with NER, I'd say this "works" because you're not doing a proper evaluation. You've found "Hare" where you should have found "March Hare".
The difficulty in NER (at least for English) is not finding the names; it's detecting their full extent (the "March Hare" example); detecting them even at the start of a sentence, where all words are capitalized; classifying them as person/organisation/location/etc.
Also, Alice in Wonderland, being a children's novel, is a rather easy text to process. Newswire phrases like "Microsoft CEO Steve Ballmer" pose a much harder problem; here, you'd want to detect
[ORG Microsoft] CEO [PER Steve Ballmer]

What you are doing is building a distributional thesaurus-- finding words which are distributionally similar to a query (e.g. Alice), i.e. words that appear in similar contexts. This does not automatically make them synonyms, but means they are in a way similar to the query. The fact that your query is a named entity does not on its own guarantee that the similar words that you retrieve will be named entities. However, since Alice, the Hare and the Queen tend to appear is similar context because they share some characteristics (e.g. they all speak, walk, cry, etc-- the details of Alice in wonderland escape me) they are more likely to be retrieved. It turns out whether a word is capitalised or not is a very useful piece of information when working out if something is a named entity. If you do not filter out the non-capitalised words, you will see many other neighbours that are not named entities.
Have a look at the following papers to get an idea of what people do with distributional semantics:
Lin 1998
Grefenstette 1994
Schuetze 1998
To put your idea in the terminology used in these papers, Step 2 is building a context vector for the word with from a window of size 1. Step 3 resembles several well-known similarity measures in distributional semantics (most notably the so-called Jaccard coefficient).
As larsmans pointed out, this seems to work so well because you are not doing a proper evaluation. If you ran this against a hand-annotated corpus you will find it is very bad at identifying the boundaries of names entities and it does not even attempt to guess if they are people or places or organisations... Nevertheless, it is a great first attempt at NLP, keep it up!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.