I currently want to count words in a text with stanza, but without punctuation and without removing punctuation.
Currently I try:
text = """ Q1 revenue reached €12 .7 billion ."""
doc = nlp ( text )
words = doc.num_tokens
print(words)
8
Sorry if this is too basic, but I am very new to Stanza.
Could you please explain how i Measure words without punctuation?
If you don't want to remove the punctuation, You can provide a keyword argument to pipeline tokenize_pretokenized and set it to True.
This will disable the tokenization, you will get the word count without punctuations
nlp = stanza.Pipeline(tokenize_pretokenized=True)
text = """ Q1 revenue reached €12 .7 billion ."""
doc = nlp(text)
words = doc.num_tokens
print(words) # 7
Related
I want to count how many times certain (swear) words appear in my database of tweets, below a simple example. It does not give me the desired output due to the fact that after the "#" is introduced in the text it becomes comments, but I do not know how to solve this.
Thank you.
text = RT #JGalt09: #Trump never owed millions $$$ to the Bank of China. Another hoax from the #FakeNews media.
word_list = ['fakenews', 'hoax']
swearword_count = 0
text_swear_count = text.lower().replace('.,#?!', ' ').split()
for word in text_swear_count:
if word in word_list:
swearword_count += 1
I need to delete all the proper noun from the text.
result is the Dataframe.
I'm using text blob. Below is the code.
from textblob import TextBlob
strings = []
for col in result:
for i in range(result.shape[0]):
text = result[col][i]
Txtblob = TextBlob(text)
for word, pos in Txtblob.noun_phrases:
print (word, pos)
if tag != 'NNP'
print(' '.join(edited_sentence))
It just recognizes one NNP
To remove all words tagged with 'NNP' from the following text (from the documenation), you can do the following:
from textblob import TextBlob
# Sample text
text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.'''
text = TextBlob(text)
# Create a list of words that are tagged with 'NNP'
# In this case it will only be 'Blob'
words_to_remove = [word[0] for word in [tag for tag in text.tags if tag[1] == 'NNP']]
# Remove the Words from the sentence, using words_to_remove
edited_sentence = ' '.join([word for word in text.split(' ') if word not in words_to_remove])
# Show the result
print(edited_sentence)
out
# Notice the lack of the word 'Blob'
'\nThe titular threat of The has always struck me as the ultimate
movie\nmonster: an insatiably hungry, amoeba-like mass able to
penetrate\nvirtually any safeguard, capable of--as a doomed doctor
chillingly\ndescribes it--"assimilating flesh on contact.\nSnide
comparisons to gelatin be damned, it\'s a concept with the
most\ndevastating of potential consequences, not unlike the grey goo
scenario\nproposed by technological theorists fearful of\nartificial
intelligence run rampant.\n'
Comments for your sample
from textblob import TextBlob
strings = [] # This variable is not used anywhere
for col in result:
for i in range(result.shape[0]):
text = result[col][i]
txt_blob = TextBlob(text)
# txt_blob.noun_phrases will return a list of noun_phrases,
# To get the position of each list you need use the function 'enuermate', like this
for word, pos in enumerate(txt_blob.noun_phrases):
# Now you can print the word and position
print (word, pos)
# This will give you something like the following:
# 0 titular threat
# 1 blob
# 2 ultimate movie monster
# This following line does not make any sense, because tag has not yet been assigned
# and you are not iterating over the words from the previous step
if tag != 'NNP'
# You are not assigning anything to edited_sentence, so this would not work either.
print(' '.join(edited_sentence))
Your sample with new code
from textblob import TextBlob
for col in result:
for i in range(result.shape[0]):
text = result[col][i]
txt_blob = TextBlob(text)
# Create a list of words that are tagged with 'NNP'
# In this case it will only be 'Blob'
words_to_remove = [word[0] for word in [tag for tag in txt_blob.tags if tag[1] == 'NNP']]
# Remove the Words from the sentence, using words_to_remove
edited_sentence = ' '.join([word for word in text.split(' ') if word not in words_to_remove])
# Show the result
print(edited_sentence)
I have a dataframe with around 200,000 rows and each line has approximetely 30 tokenized words. I am trying to fix spelling mistakes, then lemmatize them.
Some words are not in the dictionary so, if the frequency of them is too high, I just pass that word, if not, I correct it.
spell = SpellChecker()
def spelling_mistake_corrector(word):
checkedWord = spell.correction(word)
if freqDist[checkedWord] >= freqDist[word]:
word = checkedWord
return word
def correctorForAll(text):
text = [spelling_mistake_corrector(word) for word in text]
return text
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
text = [lemmatizer.lemmatize(word) for word in text]
text = [word for word in text if len(word) > 2] #filtering 1 and 2 letter words out
return text
def apply_corrector_and_lemmatizer(text):
return lemmatize_words(correctorForAll(text))
df['tokenized'] = df['tokenized'].apply(apply_corrector_and_lemmatizer)
The thing is: this code is running on colab for 3 hours, what should I do to improve run time? Thanks!
I am currently attempting to search through multiple pdfs for certain pieces of equipment. I have figured out how to parse the pdf file in python along with the equipment list. I am currently having trouble with the actual search function. The best way I found to do it online was to tokenize the text and the search through it with the keywords (code below), but unfortunately some of the names of the equipment are multiple words long, causing those names to be tokenized into meaningless words like "blue" and "evaporate" which are found many times in the text and thus saturate the returns. The only way I have thought of to deal with this is to only look for unique words in the names of the equipment and remove the more common ones, but I was wondering if there was a more elegant solution as even the unique words have a tendency to have multiple false returns per document.
Mainly, I am looking for way to search through a text file for phrases of words such as "Blue Transmitter 3", without parsing that phrase into ["Blue", "Transmitter", "3"]
here is what I have so far
import PyPDF2
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
import re
#open up pdf and get text
pdfName = 'example.pdf'
read_pdf = PyPDF2.PdfFileReader(pdfName)
text = ""
for i in range(read_pdf.getNumPages()):
page = read_pdf.getPage(i)
text += "Page No - " + str(1+read_pdf.getPageNumber(page)) + "\n"
page_content = page.extractText()
text += page_content + "\n"
#tokenize pdf text
tokens = word_tokenize(text)
punctuations = ['(',')',';',':','[',']',',','.']
stop_words = stopwords.words('english')
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
#take out the endline symbol and join the whole equipment data set into one long string
lines = [line.rstrip('\n') for line in open('equipment.txt')]
totalEquip = " ".join(lines)
tokens = word_tokenize(totalEquip)
trash = ['Black', 'furnace', 'Evaporation', 'Evaporator', '500', 'Chamber', 'A']
searchWords = [word for word in tokens if not word in stop_words and not word in punctuations and not word in trash]
for i in searchWords:
for word in splitKeys:
if i.lower() in word.lower():
print(i)
print(word + "\n")
Any help or ideas yall might have would be much appreciated
I have a dataframe with a 'description' column with details about the product. Each of the description in the column has long paragraphs. Like
"This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much"
How do I locate/extract the sentence which has the phrase "superb product", and place it in a new column?
So for this case the result will be
expected output
I have used this,
searched_words=['superb product','SUPERB PRODUCT']
print(df['description'].apply(lambda text: [sent for sent in sent_tokenize(text)
if any(True for w in word_tokenize(sent)
if stemmer.stem(w.lower()) in searched_words)]))
The output for this is not suitable. Though it works if I put just one word in " Searched Word" List.
There are lot of methods to do that ,#ChootsMagoots gave you the good answer but SPacy is also so efficient, you can simply choose the pattern that will lead you to that sentence, but beofre that, you can need to define a function that will define the sentence here's the code :
import spacy
def product_sentencizer(doc):
''' Look for sentence start tokens by scanning for periods only. '''
for i, token in enumerate(doc[:-2]): # The last token cannot start a sentence
if token.text == ".":
doc[i+1].is_sent_start = True
else:
doc[i+1].is_sent_start = False # Tell the default sentencizer to ignore this token
return doc
nlp = spacy.load('en_core_web_sm', disable=['ner'])
nlp.add_pipe(product_sentencizer, before="parser") # Insert before the parser can build its own sentences
text = "This is a superb product. I so so loved this superb product that I wanna gift to all. This is like the quality and packaging. I like it very much."
doc = nlp(text)
matcher = spacy.matcher.Matcher(nlp.vocab)
pattern = [{'ORTH': 'SUPERB PRODUCT'}]
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start:end]
print(matched_span.text)
print(matched_span.sent)
Assuming the paragraphs are neatly formatted into sentences with ending periods, something like:
for index, paragraph in df['column_name'].iteritems():
for sentence in paragraph.split('.'):
if 'superb prod' in sentence:
print(sentence)
df['extracted_sentence'][index] = sentence
This is going to be quite slow, but idk if there's a better way.