train_data = ["Consultant, changing, Waiting"]
I'm trying to apply the stemmer to the data with the following code, but It keeps the original data:
stemmer = stem.porter.PorterStemmer()
train_data = train_stemmer
for i in range(len(train_stemmer)):
train_stemmer[i] = stemmer.stem(train_stemmer[i])
The code runs fine but does not produce my expected result, which is:
["Consult, change, Wait"]
Two things jump out:
train_data in your question is a list containing one string ["Consult, change, Wait"], rather than a list of three strings ["Consult", "change", "Wait"]
Stemming converts to lowercase automatically
If you intended for the list to contain one string, this should work fine:
from nltk.stem import porter
stemmer = porter.PorterStemmer()
# List of one string
string_in_list = ["Consult, change, Wait"]
for word in string_in_list:
print(stemmer.stem(word))
print("----")
If you wanted a list of three strings, then modify to include quotes between commas:
# List of three strings
individual_words = ["Consult", "change", "Wait"]
for word in individual_words:
print(stemmer.stem(word))
print("----")
Handling the upper vs. lowercase at the start of the word requires passing a parameter, but can make sense if you're trying to handle proper nouns (e.g. distinguish stemmed change from the name Chang).
# Stem but do not convert first character to lowercase
for word in individual_words:
print(stemmer.stem(word, to_lowercase=False))
Expected output when all three run:
consult, change, wait
----
consult
chang
wait
----
Consult
chang
Wait
Related
I would like to first extract repeating n-grams from within a single sentence using Gensim's Phrases, then use those to get rid of duplicates within sentences. Like so:
Input: "Testing test this test this testing again here testing again here"
Desired output: "Testing test this testing again here"
My code seemed to have worked for generating up to 5-grams using multiple sentences but whenever I pass it a single sentence (even a list full of the same sentence) it doesn't work. If I pass a single sentence, it splits the words into characters. If I pass the list full of the same sentence, it detects nonsense like non-repeating words while not detecting repeating words.
I thought my code was working because I used like 30MB of text and produced very intelligible n-grams up to n=5 that seemed to correspond to what I expected. I have no idea how to tell its precision and recall, though. Here is the full function, which recursively generates all n-grams from 2 to n::
def extract_n_grams(documents, maximum_number_of_words_per_group=2, threshold=10, minimum_count=6, should_print=False, should_use_keywords=False):
from gensim.models import Phrases
from gensim.models.phrases import Phraser
tokens = [doc.split(" ") for doc in documents] if type(documents) == list else [documents.split(" ") for _ in range(100)] # this is what I tried
final_n_grams = []
for current_n in range(maximum_number_of_words_per_group - 1):
n_gram = Phrases(tokens, min_count=minimum_count, threshold=threshold, connector_words=connecting_words)
n_gram_phraser = Phraser(n_gram)
resulting_tokens = []
for token in tokens:
resulting_tokens.append(n_gram_phraser[token])
current_n_gram_final = []
for token in resulting_tokens:
for word in token:
if '_' in word:
# no n_gram should have a comma between words
if ',' not in word:
word = word.replace('_', ' ')
if word not in current_n_gram_final and all([word not in gram for gram in final_n_grams]):
current_n_gram_final.append(word)
tokens = n_gram[tokens]
final_n_grams.append(current_n_gram_final)
In addition to trying repeating the sentence in the list, I also tried using NLKT's word_tokenize as suggested here. What am I doing wrong? Is there an easier approach?
The Gensim Phrases class is designed to statistically detect when certain pairs of words appear so often together, compared to independently, that it might be useful to combine them into a single token.
As such, it's unlikely to be helpful for your example task, of eliminating the duplicate 3-word ['testing', 'again', 'here'] run-of-tokens.
First, it never eliminates tokens – only combines them. So, if it saw the couplet ['again', 'here'] appearing ver often together, rather than as separate 'again' and 'here', it'd turn it into 'again_here' – not eliminate it.
But second, it does these combinations not for every repeated n-token grouping, but only if the large amount of training data implies, based on the threshold configured, that certain pairs stick out. (And it only goes beyond pairs if run repeatedly.) Your example 3-word grouping, ['testing', 'again', 'here'], does not seem likely to stick out as a composition of extra-likely pairings.
If you have a more rigorous definition of which tokens/runs-of-tokens need to be eliminated, you'd probably want to run other Python code on the lists-of-tokens to enforce that de-duplication. Can you describe in more detail, perhaps with more examples, the kinds of n-grams you want removed? (Will they only be at the beginning or end of a text, or also the middle? Do they have to be next-to each other, or can they be spread throughout the text? Why are such duplicates present in the data, & why is it thought important to remove them?)
Update: Based on the comments about the real goal, a few lines of Python that check, at each position in a token-list, whether the next N tokens match the previous N tokens (and thus can be ignored) should do the trick. For example:
def elide_repeated_ngrams(list_of_tokens):
return_tokens = []
i = 0
while i < len(list_of_tokens):
for candidate_len in range(len(return_tokens)):
if list_of_tokens[i:i+candidate_len] == return_tokens[-candidate_len:]:
i = i + candidate_len # skip the repeat
break # begin fresh forward repeat-check
else:
# this token not part of any repeat; include & proceed
return_tokens.append(list_of_tokens[i])
i += 1
return return_tokens
On your test case:
>>> elide_repeated_ngrams("Testing test this test this testing again here testing again here".split())
['Testing', 'test', 'this', 'testing', 'again', 'here']
I'm trying to remove punctuations from a tokenized text in python like so:
word_tokens = ntlk.tokenize(text)
w = word_tokens
for e in word_tokens:
if e in punctuation_marks:
w.remove(e)
This works somewhat, I manage to remove a lot of the punctuation marks but for some reason a lot of the punctuation marks in word_tokens are still left.
If I run the code another time, it again removes some more of the punctuations. After running the same code 3 times all the marks are removed. Why does this happen?
It doesn't seem to matter whether punctuation_marks is a list, a string or a dictionary. I've also tried to iterate over word_tokens.copy() which does a bit better, it almost removes all marks the first time, and all the second time.
Is there a simple way to fix this problem so that it is sufficient to run the code only once?
You are removing elements from the same list that you are iterating. It seems that you are aware of the potential problem, that's why you added the line:
w = word_tokens
However, that line doesn't actually create a copy of the object referenced by word_tokens, it only makes w reference the same object. In order to create a copy you can use the slicing operator, replacing the above line by:
w = word_tokens[:]
Why don't you add tokens that are not punctuations instead?
word_tokens = ntlk.tokenize(text)
w = list()
for e in word_tokens:
if e not in punctuation_marks:
w.append(e)
Suggestions:
I see you are creating words tokens. If that's the case I would suggest you remove punctuations before tokenizing the text. You may use the translate function (under string library) that is already available.
# Import the library
import string
# Initialize the translate to remove punctuations
tr = str.maketrans("", "", string.punctuation)
# Remove punctuations
text = text.translate(tr)
# Get the word tokens
word_tokens = ntlk.tokenize(text)
If you want to do sentence tokenization, then you may do something like the below:
from nltk.tokenize import sent_tokenize
texts = sent_tokenize(text)
for i in range(0, len(texts))
texts[i] = texts[i].translate(tr)
I suggest you try regex and append your results to a new list and not directly manipulating the word_tokens's one:
word_tokens = ntlk.tokenize(text)
w_ = list()
for e in word_tokens:
w_.append(re.sub('[.!?\\-]', e))
You are modifying the the actual word_tokens, which is wrong.
For instance, say you have something like A?!B where it's indexed as: A:0, ?:1, !:2, B:3. Your for loop has a counter (say i) that increase at each loop. Say you remove the ? (Means i=1) that makes the array indexes shift back (New indexes are: A:0, !:1, B:2) and your counter increments (i=2). So you missed the ! character here!
Best not to mess with the original string and simply copy to a new one.
I would need to add further conditions in cleaning data which include removing stopwords, day of week and months.
For day of week and months I created a separated list (I do not know if there is some already built-in package in python to include them). For numbers I would consider isdigit.
So something like this:
days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
# need to put into lower case
months=['January','February','March', 'April','May','June','July','August','September','October','November','December']
# need to put into lower case
cleaned = [w for w in remove_punc.split() if w.lower() not in stopwords.words('english')]
How could I include in the code above? I know that it is about extra if statements to take into account, but I am struggling with it.
You could convert all your lists to sets and take their union for the final set. Then it's only about checking the membership of your word in the set. Something like the following would work:
# existing code
from nltk.corpus import stopwords
days=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
# need to put into lower case
months=['January','February','March', 'April','May','June','July','August','September','October','November','December']
# need to put into lower case
# add these lines
stop_words = set(stopwords.words('english'))
lowercase_days = {item.lower() for item in days}
lowercase_months = {item.lower() for item in months}
exclusion_set = lowercase_days.union(lowercase_months).union(stop_words)
# now do the final check
cleaned = [w for w in remove_punc.split() if w.lower() not in exclusion_set and not w.isdigit()]
The following code is for an assignment that asks that a string of sentences is entered from a user and that the beginning of each sentence is capitalized by a function.
For example, if a user enters: 'hello. these are sample sentences. there are three of them.'
The output should be: 'Hello. These are sample sentences. There are three of them.'
I have created the following code:
def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalize(sentences)
#This function capitalizes the first letter of each sentence
def capitalize(user_sentences):
sent_list = user_sentences.split('. ')
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
main()
This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line. Second, it adds an extra period at the end. The output from this code using the sample input from above would be:
Hello.
These are sample sentences.
There are three of them..
Is there a way to format the output to be one line and remove the final period?
The following works for reasonably clean input:
>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(x.capitalize() for x in s.split('. '))
'Hello. These are sample sentences. There are three of them.'
If there is more varied whitespace around the full-stop, you might have to use some more sophisticated logic:
>>> '. '.join(x.strip().capitalize() for x in s.split('.'))
Which normalizes the whitespace which may or may not be what you want.
def main():
sentences = input('Enter sentences with lowercase letters: ')
capitalizeFunc(sentences)
def capitalizeFunc(user_sentences):
sent_list = user_sentences.split('. ')
print(".".join((i.capitalize() for i in sent_list)))
main()
Output:
Enter sentences with lowercase letters: "hello. these are sample sentences. there are three of them."
Hello.These are sample sentences.There are three of them.
I think this might be helpful:
>>> sentence = input()
>>> '. '.join(map(lambda s: s.strip().capitalize(), sentence.split('.')))
>>> s = 'hello. these are sample sentences. there are three of them.'
>>> '. '.join(map(str.capitalize, s.split('. ')))
'Hello. These are sample sentences. There are three of them.'
This code has two issues that I am not sure how to correct. First, it prints each sentence as a new line.
That’s because you’re printing each sentence with a separate call to print. By default, print adds a newline. If you don’t want it to, you can override what it adds with the end keyword parameter. If you don’t want it to add anything at all, just use end=''
Second, it adds an extra period at the end.
That’s because you’re explicitly adding a period to every sentence, including the last one.
One way to fix this is to keep track of the index as well as the sentence as you’re looping over them—e.g., with for index, sentence in enumerate(sentences):. Then you only add the period if index isn’t the last one. Or, slightly more simply, you add the period at the start, if the index is anything but zero.
However, theres a better way out of both of these problems. You split the string into sentences by splitting on '. '. You can join those sentences back into one big string by doing the exact opposite:
sentences = '. '.join(sentences)
Then you don’t need a loop (there’s one hidden inside join of course), you don’t need to worry about treating the last or first one special, and you only have one print instead of a bunch of them so you don’t need to worry about end.
A different trick is to put the cleverness of print to work for you instead of fighting it. Not only does it add a newline at the end by default, it also lets you print multiple things and adds a space between them by default. For example, print(1, 2, 3) or, equivalently, print(*[1, 2, 3]) will print out 1 2 3. And you can override that space separator with anything else you want. So you can print(*sentences, sep='. ', end='') to get exactly what you want in one go. However, this may be a bit opaque or over-clever to people reading your code. Personally, whenever I can use join instead (which is usually), I do that even though it’s a bit more typing, because it makes it more obvious what’s happening.
As a side note, a bit of your code is misleading:
new_sentences = []
count = 0
for count in range(len(sent_list)):
new_sentences = sent_list[count]
new_sentences = (new_sentences +'. ')
print(new_sentences.capitalize())
The logic of that loop is fine, but it would be a lot easier to understand if you called the one-new-sentence variable new_sentence instead of new_sentences, and didn’t set it to an empty list at the start. As it is, the reader is led to expect that you’re going to build up a list of new sentences and then do something with it, but actually you just throw that list away at the start and handle each sentence one by one.
And, while we’re at it, you don’t need count here; just loop over sent_list directly:
for sentence in sent_list:
new_sentence = sent + '. '
print(new_sentence.capitalize())
This does the same thing as the code you had, but I think it’s easier to understand that it does that think from a quick glance.
(Of course you still need the fixes for your two problems.)
Use nltk.sent_tokenize to tokenize the string into sentences. And capitalize each sentence, and join them again.
A sentence can't always end with a ., there can other things too, like a ?, or !. Also three consecutive dots ..., doesn't end the sentence. sent_tokenize will handle them all.
from nltk.tokenize import sent_tokenize
def capitalize(user_sentences):
sents = sent_tokenize(user_sentences)
capitalized_sents = [sent.capitalize() for sent in sents]
joined_ = ' '.join(capitalized_sents)
print(joined_)
The reason your sentences were being printed on separate lines, were because print always ends its output with a newline. So, printing sentences separately in loop will make them print on newlines. So, you should print them all at once, after joining them. Or, you can specify end='' in print statement, so it doesn't end the sentences with newline characters.
The second thing, about output being ended with an extra period, is because, you're appending '. ' with each of the sentence. The good thing about sent_tokenize is, it doesn't remove '.', '?', etc from the end of the sentences, so you don't have to append '. ' at the end manually again. Instead, you can just join the sentences with a space character, and you'll be good to go.
If you get an error for nltk not being recognized, you can install it by running pip install nltk on the terminal/cmd.
I'm looking to count the number of words per sentence, calculate the mean words per sentence, and put that info into a CSV file. Here's what I have so far. I probably just need to know how to count the number of words before a period. I might be able to figure it out from there.
#Read the data in the text file as a string
with open("PrideAndPrejudice.txt") as pride_file:
pnp = pride_file.read()
#Change '!' and '?' to '.'
for ch in ['!','?']:
if ch in pnp:
pnp = pnp.replace(ch,".")
#Remove period after Dr., Mr., Mrs. (choosing not to include etc. as that often ends a sentence although in can also be in the middle)
pnp = pnp.replace("Dr.","Dr")
pnp = pnp.replace("Mr.","Mr")
pnp = pnp.replace("Mrs.","Mrs")
To split a string into a list of strings on some character:
pnp = pnp.split('.')
Then we can split each of those sentences into a list of strings (words)
pnp = [sentence.split() for sentence in pnp]
Then we get the number of words in each sentence
pnp = [len(sentence) for sentence in pnp]
Then we can use statistics.mean to calculate the mean:
statistics.mean(pnp)
To use statistics you must put import statistics at the top of your file. If you don't recognize the ways I'm reassigning pnp, look up list comprehensions.
You might be interested in the split() function for strings. It seems like you're editing your text to make sure all sentences end in a period and every period ends a sentence.
Thus,
pnp.split('.')
is going to give you a list of all sentences. Once you have that list, for each sentence in the list,
sentence.split() # i.e., split according to whitespace by default
will give you a list of words in the sentence.
Is that enough of a start?
You can try the code below.
numbers_per_sentence = [len(element) for element in (element.split() for element in pnp.split("."))]
mean = sum(numbers_per_sentence)/len(numbers_per_sentence)
However, for real natural language processing I would probably recommend a more robust solution such as NLTK. The text manipulation you perform (replacing "?" and "!", removing commas after "Dr.","Mr." and "Mrs.") is probably not enough to be 100% sure that comma is always a sentence separator (and that there are no other sentence separators in your text, even if it happens to be true for Pride And Prejudice)