I am dealing with a case now for which I would like to create my own stemming algorithm. I know that there are some excellent libraries for this but this does not work for this use case.
In essence I would like to import a dictionary so I can loop through words in a sentence and if a word is present in a list, reduce it to its base form.
So in case, fe reduce 'banker' to bank. Im have produced this but this is not scalable.
list_bank = ('banking', 'banker' )
sentence = ("There's a banker")
banker_tags = []
for word in sentence.split():
print(word)
So in case, fe reduce 'banker' to bank
if word in list_bank:
#replace word
Any suggestion on how I can get this working?
Put the words and their stems in a dictionary and then use that to look up the stemmed form:
dictionary = { 'banker' : 'bank', 'banking': 'bank' } # Add the rest of your words and stems
sentence = "There's a banker"
for word in sentence.split():
if word in dictionary:
word = dictionary[word]
print(word)
There's
a
bank
Related
I have a dataframe with text in one of its columns.
I have listed some predefined keywords which I need for analysis and words associated with it (and later make a wordcloud and counter of occurrences) to understand topics /context associated with such keywords.
Use case:
df.text_column()
keywordlist = [coca , food, soft, aerated, soda]
lets say one of the rows of the text column has text : ' coca cola is expanding its business in soft drinks and aerated water'.
another entry like : 'lime soda is the best selling item in fast food stores'
my objective is to get Bigram/trigram like:
'coca_cola','coca_cola_expanding', 'soft_drinks', 'aerated_water', 'business_soft_drinks', 'lime_soda', 'food_stores'
Kindly help me to do that [Python only]
First, you can optioanlly load the nltk's stop word list and remove any stop words from the text (such as "is", "its", "in", and "and"). Alternatively, you can define your own stop words list, as well as even extend the nltk's list with additional words. Following, you can use nltk.bigrams() and nltk.trigrams() methods to get bigrams and trigrams joined with an underscore _, as you asked. Also, have a look at Collocations.
Edit:
If you haven't already, you need to include the following once in your code, in order to download the stop words list.
nltk.download('stopwords')
Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
word_data = "coca cola is expanding its business in soft drinks and aerated water"
#word_data = "lime soda is the best selling item in fast food stores"
# load nltk's stop word list
stop_words = list(stopwords.words('english'))
# extend the stop words list
#stop_words.extend(["best", "selling", "item", "fast"])
# tokenise the string and remove stop words
word_tokens = word_tokenize(word_data)
clean_word_data = [w for w in word_tokens if not w.lower() in stop_words]
# get bigrams
bigrams_list = ["_".join(item) for item in nltk.bigrams(clean_word_data)]
print(bigrams_list)
# get trigrams
trigrams_list = ["_".join(item) for item in nltk.trigrams(clean_word_data)]
print(trigrams_list)
Update
Once you get the bigram and trigram lists, you can check for matches against your keyword list to keep only the relevant ones.
keywordlist = ['coca' , 'food', 'soft', 'aerated', 'soda']
def find_matches(n_grams_list):
matches = []
for k in keywordlist:
matching_list = [s for s in n_grams_list if k in s]
[matches.append(m) for m in matching_list if m not in matches]
return matches
all_matching_bigrams = find_matches(bigrams_list) # find all mathcing bigrams
all_matching_trigrams = find_matches(trigrams_list) # find all mathcing trigrams
# join the two lists
all_matches = all_matching_bigrams + all_matching_trigrams
print(all_matches)
Output:
['coca_cola', 'business_soft', 'soft_drinks', 'drinks_aerated', 'aerated_water', 'coca_cola_expanding', 'expanding_business_soft', 'business_soft_drinks', 'soft_drinks_aerated', 'drinks_aerated_water']
Question: A researcher has gathered thousands of news articles. But she wants to focus her attention on articles including a specific word.
The function should meet the following criteria:
Do not include documents where the keyword string shows up only as a part of a larger word. For example, if she were looking for the keyword “closed”, you would not include the string “enclosed.”
She does not want you to distinguish upper case from lower case letters. So the phrase “Closed the case.” would be included when the keyword is “closed”
Do not let periods or commas affect what is matched. “It is closed.” would be included when the keyword is “closed”. But you can assume there are no other types of punctuation.
My code:-
keywords=["casino"]
def multi_word_search(document,keywords):
dic={}
z=[]
for word in document:
i=document.index(word)
token=word.split()
new=[j.rstrip(",.").lower() for j in token]
for k in keywords:
if k.lower() in new:
dic[k]=z.append(i)
else:
dic[k]=[]
return dic
It must return value of {'casino': [0]} on giving document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?'], keywords=['casino'], but got {'casino': []} instead.
I wonder if someone could help me?
I would first tokenize the string "new" using split(), then build a set to speed up look up.
If you want case insensitive you need to lower case both sides
for k in keywords:
s = set(new.split())
if k in s:
dic[k] = z.append(i)
else:
dic[k]=[]
return dic
This is not as trivial as it seem. From a NLP (natural language processing) splitting a text into words is not trivial (it is called tokenisation).
import nltk
# stemmer = nltk.stem.PorterStemmer()
def multi_word_search(documents, keywords):
# Initialize result dictionary
dic = {kw: [] for kw in keywords}
for i, doc in enumerate(documents):
# Preprocess document
doc = doc.lower()
tokens = nltk.word_tokenize(doc)
tokens = [stemmer.stem(token) for token in tokens]
# Search each keyword
for kw in keywords:
# kw = stemmer.stem(kw.lower())
kw = kw.lower()
if kw in tokens:
# If found, add to result dictionary
dic[kw].append(i)
return dic
documents = ['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?' 'Some casinos']
keywords=['casino']
multi_word_search(documents, keywords)
To increase matching you can use stemming (it removes plurals and verbs flexions, ex: running -> run)
This should work too..
document=['The Learn Python Challenge Casino', 'They bought a car', 'Casinoville?']
keywords=['casino', 'car']
def findme(term):
for x in document:
val = x.split(' ')
for v in val:
if term.lower() == v.lower():
return document.index(x)
for key in keywords:
n = findme(key)
print(f'{key}:{n}')
I am having a dataset, which is consisting of two columns, one is a Myers-Briggs personality type and the other one is containing the last 50 tweets of that person. I have tokenized, removed the URLs and the stop words from the list, and lemmatized the words.
I am then creating a collections.Counter of the most common words and I am checking whether they are valid English words with nltk.
The problem is that checking if the word exists in the corpora vocabulary takes too much time and I also think that a lot of words are missing from this vocabulary. This is my code:
import nltk
import collections
from nltk.corpus import words
# nltk.download("words")
# Creating a frequency Counter of all the words
frequency_counter = collections.Counter(df.posts.explode())
sorted_common_words = sorted(frequency_counter.items(), key = lambda pair: -pair[1])
words_lst = []
for i in range(len(sorted_common_words)):
if sorted_common_words[i][1] > 1000:
words_lst.append(sorted_common_words[i][0])
valid_words = []
invalid_words = []
valid_words = [word for word in words_lst if word in words.words()]
invalid_words = [word for word in words_lst if word not in words.words()]
My problem is that the invalid_words list is containing some valid English words like:
f*ck
changed
surprised
girlfriend
avatar
anymore
And some more of course. Even checking manually if those words exist in the words.words() it returns False. I tried initially to stem my text but this produced some root of the words, which didn't look right, and that's why decided to lemmatize them.
Is there a library in Python which have all the stemmed versions of the English words? I guess this will speed up significantly my script.
My original dataframe is around 9000 lines, and a bit more than 5M tokenized words and around 110.000 unique words after cleaning the dataset. 'words.words()is containing 236736 words, so checking if those 110.000 words are withinwords.words()` will take too much time. I have checked and checking of 1000 takes approximately a minute. This is mainly due to the limitation of Python to be run on only one core, so I cannot parallelize the operation on all available cores.
I would suggest this solution:
# your code as it was before
words_lst = []
for i in range(len(sorted_common_words)):
if sorted_common_words[i][1] > 1000:
words_lst.append(sorted_common_words[i][0])
import numpy as np
words_arr = np.array(words_lst,dtype=str)
words_dictionary = np.array(words.words(),dtype=str)
mask_valid_words = np.in1d(words_arr, words_dictionary)
valid_words = words_arr[mask_valid_words]
invalid_words = words_arr[~mask_valid_words]
I have used the following code to extract a sentence from file(the sentence should contain some or all of the search keywords)
search_keywords=['mother','sing','song']
with open('text.txt', 'r') as in_file:
text = in_file.read()
sentences = text.split(".")
for sentence in sentences:
if (all(map(lambda word: word in sentence, search_keywords))):
print sentence
The problem with the above code is that it does not print the required sentence if one of the search keywords do not match with the sentence words. I want a code that prints the sentence containing some or all of the search keywords. It would be great if the code can also search for a phrase and extract the corresponding sentence.
It seems like you want to count the number of search_keyboards in each sentence. You can do this as follows:
sentences = "My name is sing song. I am a mother. I am happy. You sing like my mother".split(".")
search_keywords=['mother','sing','song']
for sentence in sentences:
print("{} key words in sentence:".format(sum(1 for word in search_keywords if word in sentence)))
print(sentence + "\n")
# Outputs:
#2 key words in sentence:
#My name is sing song
#
#1 key words in sentence:
# I am a mother
#
#0 key words in sentence:
# I am happy
#
#2 key words in sentence:
# You sing like my mother
Or if you only want the sentence(s) that have the most matching search_keywords, you can make a dictionary and find the maximum values:
dct = {}
for sentence in sentences:
dct[sentence] = sum(1 for word in search_keywords if word in sentence)
best_sentences = [key for key,value in dct.items() if value == max(dct.values())]
print("\n".join(best_sentences))
# Outputs:
#My name is sing song
# You sing like my mother
If I understand correctly, you should be using any() instead of all().
if (any(map(lambda word: word in sentence, search_keywords))):
print sentence
So you want to find sentences that contain at least one keyword. You can use any() instead of all().
EDIT:
If you want to find the sentence which contains the most keywords:
sent_words = []
for sentence in sentences:
sent_words.append(set(sentence.split()))
num_keywords = [len(sent & set(search_keywords)) for sent in sent_words]
# Find only one sentence
ind = num_keywords.index(max(num_keywords))
# Find all sentences with that number of keywords
ind = [i for i, x in enumerate(num_keywords) if x == max(num_keywords)]
I have a dirty document which includes invalid English words, numbers, etc.
I just want to take all valid English words and then calculate the ratio of my list of words to the total number of valid English words.
For example, if my document has the sentence:
sentence= ['eishgkej he might be a good person. I might consider this.']
I want to count only "he might be a good person. I might consider this" and count "might".
So, I got the answer 2/10.
I am thinking about using the below code. However, I need to change not the line features[word] = 1 but the count of features...
all_words = nltk.FreqDist(w.lower() for w in reader.words() if w.lower() not in english_sw)
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
if word in document_words:
features[word] = 1
else:
features[word]=0
return features
According to the documentation you can use count(self, sample) to return the count of a word in a FreqDist object. So I think you want something like:
for word in word_features:
if word in document_words:
features[word] = all_words.count(word)
else:
features[word]= 0
Or you could use indexing, i.e. all_words[word] should return the same as all_words.count(word)
If you want the frequency of the word you can do all_words.freq(word)