How to get n-grams from a column in pandas dataframe - python

I have some doubts regarding n-grams.
Specifically, I would like to extract 2-grams, 3-grams and 4-grams from the following column:
Sentences
For each topic, we will explore the words occuring in that topic and its relative weight.
We will check where our test document would be classified.
For each document we create a dictionary reporting how many
words and how many times those words appear.
Save this to ‘bow_corpus’, then check our selected document earlier.
To do this, I used the following function
def n_grams(lines , min_length=2, max_length=4):
lenghts=range(min_length,max_length+1)
ngrams={length:collections.Counter() for length in lengths)
queue= collection.deque(maxlen=max_length)
but it does not work since I got None as output.
Can you please tell me what is wrong in the code?

Your ngrams dictionary has empty Counter() objects because you don't pass anything to count. There are also a few other problems:
Function names can't include - in Python.
collection.deque is invalid, I think you wanted to call collections.deque()
I think there are better options to fix your code than using collections library. Two of them are as follows:
You might fix your function using list comprehension:
def n_grams(lines, min_length=2, max_length=4):
tokens = lines.split()
ngrams = dict()
for n in range(min_length, max_length + 1):
ngrams[n] = [tokens[i:i+n] for i in range(len(tokens)-n+1)]
return ngrams
Or you might use nltk which supports tokenization and n-grams natively.
from nltk import ngrams
from nltk.tokenize import word_tokenize
def n_grams(lines, min_length=2, max_length=4):
tokens = word_tokenize(lines)
ngrams = {n: ngrams(tokens, n) for n in range(min_length, max_length + 1)}
return ngrams

Related

word in words.words() check too slow and inaccurate in Python

I am having a dataset, which is consisting of two columns, one is a Myers-Briggs personality type and the other one is containing the last 50 tweets of that person. I have tokenized, removed the URLs and the stop words from the list, and lemmatized the words.
I am then creating a collections.Counter of the most common words and I am checking whether they are valid English words with nltk.
The problem is that checking if the word exists in the corpora vocabulary takes too much time and I also think that a lot of words are missing from this vocabulary. This is my code:
import nltk
import collections
from nltk.corpus import words
# nltk.download("words")
# Creating a frequency Counter of all the words
frequency_counter = collections.Counter(df.posts.explode())
sorted_common_words = sorted(frequency_counter.items(), key = lambda pair: -pair[1])
words_lst = []
for i in range(len(sorted_common_words)):
if sorted_common_words[i][1] > 1000:
words_lst.append(sorted_common_words[i][0])
valid_words = []
invalid_words = []
valid_words = [word for word in words_lst if word in words.words()]
invalid_words = [word for word in words_lst if word not in words.words()]
My problem is that the invalid_words list is containing some valid English words like:
f*ck
changed
surprised
girlfriend
avatar
anymore
And some more of course. Even checking manually if those words exist in the words.words() it returns False. I tried initially to stem my text but this produced some root of the words, which didn't look right, and that's why decided to lemmatize them.
Is there a library in Python which have all the stemmed versions of the English words? I guess this will speed up significantly my script.
My original dataframe is around 9000 lines, and a bit more than 5M tokenized words and around 110.000 unique words after cleaning the dataset. 'words.words()is containing 236736 words, so checking if those 110.000 words are withinwords.words()` will take too much time. I have checked and checking of 1000 takes approximately a minute. This is mainly due to the limitation of Python to be run on only one core, so I cannot parallelize the operation on all available cores.
I would suggest this solution:
# your code as it was before
words_lst = []
for i in range(len(sorted_common_words)):
if sorted_common_words[i][1] > 1000:
words_lst.append(sorted_common_words[i][0])
import numpy as np
words_arr = np.array(words_lst,dtype=str)
words_dictionary = np.array(words.words(),dtype=str)
mask_valid_words = np.in1d(words_arr, words_dictionary)
valid_words = words_arr[mask_valid_words]
invalid_words = words_arr[~mask_valid_words]

this error comes IndexError: list index out of range

This program is to find similarities between the a sentences and words and how they are similar in synonyms I have downloaded the nltk when i first coded it was run and there were no errors but after some days when i run the program
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.corpus import wordnet as wn
filtered_uploaded_sentences = []
uploaded_sentence_synset = []
database_word_synset = []
uploaded_doc_sentence=" The issue of text semantics, such as word semantics and sentence semantics has received increasing attentions in recent years. However, rare research focuses on the document-level semantic matching due to its complexity. Long documents usually have sophisticated structure and massive information, which causes hardship to measure their semantic similarity. The semantic similarity between words, sentences, texts, and documents is widely studied in various fields, including natural language processing, document semantic comparison, artificial intelligence, semantic web, and semantic search engines. "
database_word=["car","complete",'focus',"semantics"]
stopwords = stopwords.words('english')
uploaded_sentence_words_tokenized = word_tokenize(uploaded_doc_sentence)
#filtering the sentence and synset
for word in uploaded_sentence_words_tokenized:
if word not in stopwords:
filtered_uploaded_sentences.append(word)
print (filtered_uploaded_sentences)
for sentences_are in filtered_uploaded_sentences:
uploaded_sentence_synset.append(wn.synsets(sentences_are))
print(uploaded_sentence_synset)
#for finding similrity in the words
for databasewords in database_word:
database_word_synset.append(wn.synsets(databasewords)[0])
print(database_word_synset)
Index Error: list index out of range
this error comes when the uploaded_doc_sentence is short and long sentence is used the error accorded
check.append(wn.wup_similarity(data,sen[0]))
I want to compare the sentence and the words and the results are stored. this type
#the similarity main function for words
for data in database_word_synset:
for sen in uploaded_sentence_synset :
check.append(wn.wup_similarity(data,sen[0]))
print(check)
The problem is that there are empty lists contained in uploaded_sentence_synset. I'm not sure what you're trying to do, but modify the last block of code to:
for data in database_word_synset:
for sen in uploaded_sentence_synset:
if sen:
check.append(wn.wup_similarity(data, sen[0]))
Without the if-else block, you're essentially trying to index the first element of a list, giving you an IndexError.
by removing the empty [] list blocks from the list and making the multi dimension list into single dimension list the problem was solver
list2 = [x for x in main_sen if x != []]
print(list2)
result=list()
for t in list2:
for x in t:
result.append(x)

Get list of all words that can be stemmed to a particular stemming [duplicate]

I'm familiar with word stemming and completion from the tm package in R.
I'm trying to come up with a quick and dirty method for finding all variants of a given word (within some corpus.) For example, I'd like to get "leukocytes" and "leuckocytic" if my input is "leukocyte".
If I had to do it right now, I would probably just go with something like:
library(tm)
library(RWeka)
dictionary <- unique(unlist(lapply(crude, words)))
grep(pattern = LovinsStemmer("company"),
ignore.case = T, x = dictionary, value = T)
I used Lovins because Snowball's Porter doesn't seem to be aggressive enough.
I'm open to suggestions for other stemmers, scripting languages (Python?), or entirely different approaches.
This solution requires preprocessing your corpus. But once that is done it is a very quick dictionary lookup.
from collections import defaultdict
from stemming.porter2 import stem
with open('/usr/share/dict/words') as f:
words = f.read().splitlines()
stems = defaultdict(list)
for word in words:
word_stem = stem(word)
stems[word_stem].append(word)
if __name__ == '__main__':
word = 'leukocyte'
word_stem = stem(word)
print(stems[word_stem])
For the /usr/share/dict/words corpus, this produces the result
['leukocyte', "leukocyte's", 'leukocytes']
It uses the stemming module that can be installed with
pip install stemming

Built-in function to get the frequency of one word with spaCy?

I'm looking for faster alternatives to NLTK to analyze big corpora and do basic things like calculating frequencies, PoS tagging etc... SpaCy seems great and easy to use in many ways, but I can't find any built-in function to count the frequency of a specific word for example. I've looked at the spaCy documentation, but I can't find a straightforward way to do it. Am I missing something?
What I would like would be the NLTK equivalent of:
tokens.count("word") #where tokens is the tokenized text in which the word is to be counted
In NLTK, the above code would tell me that in my text, the word "word" appears X number of times.
Note that I've come by the count_by function, but it doesn't seem to do what I'm looking for.
I use spaCy for frequency counts in corpora quite often. This is what I usually do:
import spacy
nlp = spacy.load("en_core_web_sm")
list_of_words = ['run', 'jump', 'catch']
def word_count(string):
words_counted = 0
my_string = nlp(string)
for token in my_string:
# actual word
word = token.text
# lemma
lemma_word = token.lemma_
# part of speech
word_pos = token.pos_
if lemma_word in list_of_words:
words_counted += 1
print(lemma_word)
return words_counted
sentence = "I ran, jumped, and caught the ball."
words_counted = word_count(sentence)
print(words_counted)
Python stdlib includes collections.Counter for this kind of purpose. You have not given me an answer if this answer suits your case.
from collections import Counter
text = "Lorem Ipsum is simply dummy text of the ...."
freq = Counter(text.split())
print(freq)
>>> Counter({'the': 6, 'Lorem': 4, 'of': 4, 'Ipsum': 3, 'dummy': 2 ...})
print(freq['Lorem'])
>>> 4
Alright just to give some time reference, I have used this script,
import random, timeit
from collections import Counter
def loadWords():
with open('corpora.txt', 'w') as corpora:
randWords = ['foo', 'bar', 'life', 'car', 'wrong',\
'right', 'left', 'plain', 'random', 'the']
for i in range(100000000):
corpora.write(randWords[random.randint(0, 9)] + " ")
def countWords():
with open('corpora.txt', 'r') as corpora:
content = corpora.read()
myDict = Counter(content.split())
print("foo: ", myDict['foo'])
print(timeit.timeit(loadWords, number=1))
print(timeit.timeit(countWords, number=1))
Results,
149.01646934738716
foo: 9998872
18.093295297389773
Still I am not sure if this is enough for you.
Updating with this answer as this is the page I found when searching for an answer for this specific problem. I find that this is an easier solution than the ones provided before and that only uses spaCy.
As you mentioned spaCy Doc object has the built in method Doc.count_by. From what I understand of your question it does what you ask for but it is not obvious.
It counts the occurances of an given attribute and returns a dictionary with the attributes hash as key in integer form and the counts.
Solution
First of all we need to import ORTH from spacy.attr. ORTH is the exact verbatim text of a token. We also need to load the model and provide a text.
import spacy
from spacy.attrs import ORTH
nlp = spacy.load("en_core_web_sm")
doc = nlp("apple apple orange banana")
Then we create a dictionary of word counts
count_dict = doc.count_by(ORTH)
You could count by other attributes like LEMMA, just import the attribute you wish to use.
If we look at the dictionary we will se that it contains the hash for the lexeme and the word count.
count_dict
Results:
{8566208034543834098: 2, 2208928596161743350: 1, 2525716904149915114: 1}
We can get the text for the word if we look up the hash in the vocab.
nlp.vocab.strings[8566208034543834098]
Returns
'apple'
With this we can create a simple function that takes the search word and a count dict created with the Doc.count_by method.
def get_word_count(word, count_dict):
return count_dict[nlp.vocab.strings[word]]
If we run the function with our search word 'apple' and the count dict we created earlier
get_word_count('apple', count_dict)
We get:
2
https://spacy.io/api/doc#count_by

recursive extract synonym for a new word from NLTK

Assuming I have two small dictionaries
posList=['interesting','novel','creative','state-of-the-art']
negList=['outdated','straightforward','trivial']
I have a new word, say "innovative", which is out of my knowledge and I am trying to figure out its sentiment via finding out its synonyms via NLTK function, if the synonyms fall out my small dictionaries, then I recursively call the NLTK function to find the synonyms of the synonyms from last time
The start input could be like this:
from nltk.corpus import wordnet
innovative = wordnet.synsets('innovative')
for synset in innovative:
print synset
print synset.lemmas
It produces the output like this
Synset('advanced.s.03')
[Lemma('advanced.s.03.advanced'), Lemma('advanced.s.03.forward-looking'), Lemma('advanced.s.03.innovative'), Lemma('advanced.s.03.modern')]
Synset('innovative.s.02')
[Lemma('innovative.s.02.innovative'), Lemma('innovative.s.02.innovational'), Lemma('innovative.s.02.groundbreaking')]
Clearly new words include 'advanced','forward-looking','modern','innovational','groundbreaking' are the new words and not in my dictionary, so now I should use these words as start to call synsets function again until no new lemma word appearing.
Anyone can give me a demo code how to extract these lemma words from Synset and keep them in a set strcutre?
It involves dealing with re module in Python I think but I am quite new to Python. Another point I need to address is that I need to get adjective only, so only 's' and 'a' symbol in the Lemma('advanced.s.03.modern'), not 'v' (verb) or 'n' (noun).
Later I would try to calculate the similarity score for a new word with any dictionary word, I need to define the measure. This problem is difficult since adj words are not arranged in hierarchy way and no available measure according to my knowledge. Anyone can advise?
You can get the synonyms of the synonyms as follows.
(Please note that the code uses the WordNet functions of the NodeBox Linguistics library because it offers an easier access to WordNet).
def get_remote_synonyms(s, pos):
if pos == 'a':
syns = en.adjective.senses(s)
if syns:
allsyns = sum(syns, [])
# if there are multiple senses, take only the most frequent two
if len(syns) >= 2:
syns = syns[0] + syns[1]
else:
syns = syns[0]
else:
return []
remote = []
for syn in syns:
newsyns = en.adjective.senses(syn)
remote.extend([r for r in newsyns[0] if r not in allsyns])
return [unicode(i) for i in list(set(remote))]
As far as I know, all semantic measurement functions of the NLTK are based on the hypernym / hyponym hierarchy, so that they cannot be applied to adjectives. Besides, I found a lot of synonyms to be missing in WordNet if you compare its results with the results from a thesaurus like thesaurus.com.

Categories

Resources