Too much time for spell checking in python - python

I have a dataframe with around 200,000 rows and each line has approximetely 30 tokenized words. I am trying to fix spelling mistakes, then lemmatize them.
Some words are not in the dictionary so, if the frequency of them is too high, I just pass that word, if not, I correct it.
spell = SpellChecker()
def spelling_mistake_corrector(word):
checkedWord = spell.correction(word)
if freqDist[checkedWord] >= freqDist[word]:
word = checkedWord
return word
def correctorForAll(text):
text = [spelling_mistake_corrector(word) for word in text]
return text
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
text = [lemmatizer.lemmatize(word) for word in text]
text = [word for word in text if len(word) > 2] #filtering 1 and 2 letter words out
return text
def apply_corrector_and_lemmatizer(text):
return lemmatize_words(correctorForAll(text))
df['tokenized'] = df['tokenized'].apply(apply_corrector_and_lemmatizer)
The thing is: this code is running on colab for 3 hours, what should I do to improve run time? Thanks!

Related

Stanza: Count words without punctuation

I currently want to count words in a text with stanza, but without punctuation and without removing punctuation.
Currently I try:
text = """ Q1 revenue reached €12 .7 billion ."""
doc = nlp ( text )
words = doc.num_tokens
print(words)
8
Sorry if this is too basic, but I am very new to Stanza.
Could you please explain how i Measure words without punctuation?
If you don't want to remove the punctuation, You can provide a keyword argument to pipeline tokenize_pretokenized and set it to True.
This will disable the tokenization, you will get the word count without punctuations
nlp = stanza.Pipeline(tokenize_pretokenized=True)
text = """ Q1 revenue reached €12 .7 billion ."""
doc = nlp(text)
words = doc.num_tokens
print(words) # 7

To replace internet acronyms in a dataframe using dictionary

I'm working on a text mining project where I'm trying to replace abbreviations, slang words and internet acronyms present in text (In a dataframe column) using a manually prepared dictionary.
The problem I'm facing is the code stops with the first word of the text in the dataframe column and does not replace it with lookup words from dict
Here is the sample dictionary and code I use:
abbr_dict = {"abt":"about", "b/c":"because"}
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in abbr_dict:
word = abbr_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words)
return new_text
df['new_text'] = df['text'].apply(_lookup_words)
Example Input:
df['text'] =
However, industry experts are divided ab whether a Bitcoin ETF is necessary or not.
Desired Output:
df['New_text'] =
However, industry experts are divided about whether a Bitcoin ETF is necessary or not.
Current Output:
df['New_text'] =
However
You can try as following with using lambda and join along with split:
import pandas as pd
abbr_dict = {"abt":"about", "b/c":"because"}
df = pd.DataFrame({'text': ['However, industry experts are divided abt whether a Bitcoin ETF is necessary or not.']})
df['new_text'] = df['text'].apply(lambda row: " ".join(abbr_dict[w]
if w.lower() in abbr_dict else w for w in row.split()))
Or to fix the code above, I think you need to move the join for new_text and return statement outside of the for loop:
def _lookup_words(input_text):
words = input_text.split()
new_words = []
for word in words:
if word.lower() in abbr_dict:
word = abbr_dict[word.lower()]
new_words.append(word)
new_text = " ".join(new_words) # ..... change here
return new_text # ..... change here also
df['new_text'] = df['text'].apply(_lookup_words)

extract a sentence that contains a list of keywords or phrase using python

I have used the following code to extract a sentence from file(the sentence should contain some or all of the search keywords)
search_keywords=['mother','sing','song']
with open('text.txt', 'r') as in_file:
text = in_file.read()
sentences = text.split(".")
for sentence in sentences:
if (all(map(lambda word: word in sentence, search_keywords))):
print sentence
The problem with the above code is that it does not print the required sentence if one of the search keywords do not match with the sentence words. I want a code that prints the sentence containing some or all of the search keywords. It would be great if the code can also search for a phrase and extract the corresponding sentence.
It seems like you want to count the number of search_keyboards in each sentence. You can do this as follows:
sentences = "My name is sing song. I am a mother. I am happy. You sing like my mother".split(".")
search_keywords=['mother','sing','song']
for sentence in sentences:
print("{} key words in sentence:".format(sum(1 for word in search_keywords if word in sentence)))
print(sentence + "\n")
# Outputs:
#2 key words in sentence:
#My name is sing song
#
#1 key words in sentence:
# I am a mother
#
#0 key words in sentence:
# I am happy
#
#2 key words in sentence:
# You sing like my mother
Or if you only want the sentence(s) that have the most matching search_keywords, you can make a dictionary and find the maximum values:
dct = {}
for sentence in sentences:
dct[sentence] = sum(1 for word in search_keywords if word in sentence)
best_sentences = [key for key,value in dct.items() if value == max(dct.values())]
print("\n".join(best_sentences))
# Outputs:
#My name is sing song
# You sing like my mother
If I understand correctly, you should be using any() instead of all().
if (any(map(lambda word: word in sentence, search_keywords))):
print sentence
So you want to find sentences that contain at least one keyword. You can use any() instead of all().
EDIT:
If you want to find the sentence which contains the most keywords:
sent_words = []
for sentence in sentences:
sent_words.append(set(sentence.split()))
num_keywords = [len(sent & set(search_keywords)) for sent in sent_words]
# Find only one sentence
ind = num_keywords.index(max(num_keywords))
# Find all sentences with that number of keywords
ind = [i for i, x in enumerate(num_keywords) if x == max(num_keywords)]

Python NLTK: Count list of word and make probability with valid English words

I have a dirty document which includes invalid English words, numbers, etc.
I just want to take all valid English words and then calculate the ratio of my list of words to the total number of valid English words.
For example, if my document has the sentence:
sentence= ['eishgkej he might be a good person. I might consider this.']
I want to count only "he might be a good person. I might consider this" and count "might".
So, I got the answer 2/10.
I am thinking about using the below code. However, I need to change not the line features[word] = 1 but the count of features...
all_words = nltk.FreqDist(w.lower() for w in reader.words() if w.lower() not in english_sw)
def document_features(document):
document_words = set(document)
features = {}
for word in word_features:
if word in document_words:
features[word] = 1
else:
features[word]=0
return features
According to the documentation you can use count(self, sample) to return the count of a word in a FreqDist object. So I think you want something like:
for word in word_features:
if word in document_words:
features[word] = all_words.count(word)
else:
features[word]= 0
Or you could use indexing, i.e. all_words[word] should return the same as all_words.count(word)
If you want the frequency of the word you can do all_words.freq(word)

Python code flow does not work as expected?

I am trying to process various texts by regex and NLTK of python -which is at http://www.nltk.org/book-. I am trying to create a random text generator and I am having a slight problem. Firstly, here is my code flow:
Enter a sentence as input -this is called trigger string, is assigned to a variable-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
So far, I have been able to do this only once. When I try to keep this to continue, the program only keeps printing the first sentence my search yields. It should actually look for the longest word in this new sentence and keep applying my code flow described above.
Below is my code along with a sample input/output :
Sample input
"Thane of code"
Sample output
"Thane of code Norway himselfe , with terrible numbers , Assisted by that most disloyall Traytor , The Thane of Cawdor , began a dismall Conflict , Till that Bellona ' s Bridegroome , lapt in proofe , Confronted him with selfe - comparisons , Point against Point , rebellious Arme ' gainst Arme , Curbing his lauish spirit : and to conclude , The Victorie fell on vs"
Now this should actually take the sentence that starts with 'Norway himselfe....' and look for the longest word in it and do the steps above and so on but it doesn't. Any suggestions? Thanks.
import nltk
from nltk.corpus import gutenberg
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
split_str = triggerSentence.split()#split the sentence into words
longestLength = 0
longestString = ""
montyPython = 1
while montyPython:
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
# I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way.
longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
#get longest sentence -list format with every word of sentence being an actual element-
longestSent=[longestSentence]
for word in longestSent:#convert the list longestSentence to an actual string
sstr = " ".join(word)
print triggerSentence + " "+ sstr
triggerSentence = sstr
How about this?
You find longest word in trigger
You find longest word in the longest sentence containing word found in 1.
The word of 1. is the longest word of the sentence of 2.
What happens? Hint: answer starts with "Infinite". To correct the problem you could find set of words in lower case to be useful.
BTW when you think MontyPython becomes False and the program finish?
Rather than searching the entire corpus each time, it may be faster to construct a single map from word to the longest sentence containing that word. Here's my (untested) attempt to do this.
import collections
from nltk.corpus import gutenberg
def words_in(sentence):
"""Generate all words in the sentence (lower-cased)"""
for word in sentence.split():
word = word.strip('.,"\'-:;')
if word:
yield word.lower()
def make_sentence_map(books):
"""Construct a map from words to the longest sentence containing the word."""
result = collections.defaultdict(str)
for book in books:
for sentence in book:
for word in words_in(sentence):
if len(sentence) > len(result[word]):
result[word] = sent
return result
def generate_random_text(sentence, sentence_map):
while True:
yield sentence
longest_word = max(words_in(sentence), key=len)
sentence = sentence_map[longest_word]
sentence_map = make_sentence_map(gutenberg.sents())
for sentence in generate_random_text('Thane of code.', sentence_map):
print sentence
Mr. Hankin's answer is more elegant, but the following is more in keeping with the approach you began with:
import sys
import string
import nltk
from nltk.corpus import gutenberg
def longest_element(p):
"""return the first element of p which has the greatest len()"""
max_len = 0
elem = None
for e in p:
if len(e) > max_len:
elem = e
max_len = len(e)
return elem
def downcase(p):
"""returns a list of words in p shifted to lower case"""
return map(string.lower, p)
def unique_words():
"""it turns out unique_words was never referenced so this is here
for pedagogy"""
# there are 2.6 million words in the gutenburg corpus but only ~42k unique
# ignoring case, let's pare that down a bit
for word in gutenberg.words():
words.add(word.lower())
print 'gutenberg.words() has', len(words), 'unique caseless words'
return words
print 'loading gutenburg corpus...'
sentences = []
for sentence in gutenberg.sents():
sentences.append(downcase(sentence))
trigger = sys.argv[1:]
target = longest_element(trigger).lower()
last_target = None
while target != last_target:
matched_sentences = []
for sentence in sentences:
if target in sentence:
matched_sentences.append(sentence)
print '===', target, 'matched', len(matched_sentences), 'sentences'
longestSentence = longest_element(matched_sentences)
print ' '.join(longestSentence)
trigger = longestSentence
last_target = target
target = longest_element(trigger).lower()
Given your sample sentence though, it reaches fixation in two cycles:
$ python nltkgut.py Thane of code
loading gutenburg corpus...
=== target thane matched 24 sentences
norway himselfe , with terrible
numbers , assisted by that most
disloyall traytor , the thane of
cawdor , began a dismall conflict ,
till that bellona ' s bridegroome ,
lapt in proofe , confronted him with
selfe - comparisons , point against
point , rebellious arme ' gainst arme
, curbing his lauish spirit : and to
conclude , the victorie fell on vs
=== target bridegroome matched 1 sentences
norway himselfe , with
terrible numbers , assisted by that
most disloyall traytor , the thane of
cawdor , began a dismall conflict ,
till that bellona ' s bridegroome ,
lapt in proofe , confronted him with
selfe - comparisons , point against
point , rebellious arme ' gainst arme
, curbing his lauish spirit : and to
conclude , the victorie fell on vs
Part of the trouble with the response to the last problem is that it did what you asked, but you asked a more specific question than you wanted an answer to. Thus the response got bogged down in some rather complicated list expressions that I'm not sure you understood. I suggest that you make more liberal use of print statements and don't import code if you don't know what it does. While unwrapping the list expressions I found (as noted) that you never used the corpus wordlist. Functions are a help also.
You are assigning "split_str" outside of the loop, so it gets the original value and then keeps it. You need to assign it at the beginning of the while loop, so it changes each time.
import nltk
from nltk.corpus import gutenberg
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
montyPython = 1
while montyPython:
#so this is run every time through the loop
split_str = triggerSentence.split()#split the sentence into words
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
# I tip my hat to Mr.Alex Martelli for this part, which helps me find the longest sentence
lt = longestString.lower() #this line tells you whether word list has the longest word in a case-insensitive way.
longestSentence = max((listOfWords for listOfWords in listOfSents if any(lt == word.lower() for word in listOfWords)), key = len)
#get longest sentence -list format with every word of sentence being an actual element-
longestSent=[longestSentence]
for word in longestSent:#convert the list longestSentence to an actual string
sstr = " ".join(word)
print triggerSentence + " "+ sstr
triggerSentence = sstr

Categories

Resources