Extract non-content English language words string - python [duplicate] - python

This question already has answers here:
How to remove stop words using nltk or python
(13 answers)
Closed 8 years ago.
I am working on Python script in which I want to remove the common english words like "the","an","and","for" and many more from a String. Currently what I have done is I have made a local list of all such words and I just call remove() to remove them from the string. But I want here some pythonish way to achieve this. Have read about nltk and wordnet but totally clueless about that's what I should use and how to use it.
Edit
Well I don't understand why marked as duplicate as my question does not in any way mean that I know about Stop words and now I just want to know how to use it.....the question is about what I can use in my scenario and answer to that was stop words...but when I posted this question I din't know anything about stop words.

Do this.
vocabular = set (english_dictionary)
unique_words = [word for word in source_text.split() if word not in vocabular]
It is simple and efficient as can be. If you don't need positions of unique words, make them set too! Operator in is extremely fast on sets (and slow on lists and other containers)

this will also work:
yourString = "an elevator is made for five people and it's fast"
wordsToRemove = ["the ", "an ", "and ", "for "]
for word in wordsToRemove:
yourString = yourString .replace(word, "")

I have found that what I was looking for is this:
from nltk.corpus import stopwords
my_stop_words = stopwords.words('english')
Now I can remove or replace the words from my list/string where I find the match in my_stop_words which is a list.
For this to work I had to download the NLTK for python and the using its downloader I downloaded stopwords package.
It also contains many other packages which can be used in different situations for NLP like words,brown,wordnet etc.

Related

finding occurrences in Multiple sentences

I'm new to python however after scouring the internet and going back over my study, I cannot seem to find how to find duplicates of a word within multiple sentences. my aim is to define how many times the word python occurs within these strings. I have tried the split() method and count.(python) and even tried to make a dictionary and word_counter which initially I have been taught to do as part of the basics however nothin in my study has shown me anything similar to this before. i need to be able to display the frequency of the word. python occurs 4 times. any help would be very appreciated
python_occurs = ["welcome to our Python program", "Python is my favorite language!", "I am afraid of Pythons", "I love Python"]
A straight-forward approach is to iterate over every word using split. For each word, it's converted to lowercase and the number of times "python" occurs in it is counted using count.
I guess the reason for you approach not working might be that you forgot to change the letters to lowercase.
python_occurs = ["welcome to our Python program", "Python is my favorite language!", "I am afraid of Pythons", "I love Python"]
count = 0
for sentence in python_occurs:
for word in sentence.split():
# lower is necessary because we want to be case-insensitive
count += word.lower().count("python")

Split string on one use of a word and not the other [duplicate]

This question already has answers here:
Python split string based on conditional
(2 answers)
Closed 4 years ago.
fairly novice here, I'm looking for an effective way to use split() to split a string after a certain word.
I'm working on a voice controlled filter using the Csound API in python, and say my input command is "Set cutoff to 440", I'd like to split the string after the word "to", basically meaning that I can say the command however I like and it'll still find my frequency I'm looking for, I hope that makes sense.
So at the moment, my code for this is
string = "set cutoff to 440"
split = string.split("to")
print(split)
and my output is
['set', 'cu', 'ff', '440']
The problem is the 'to' in 'cutoff', I know I could fix this by just changing cutoff to frequency but it seems like giving in too easily. My suspicion is there's a way to do this with regular expressions, but I could easily be wrong, any advice would be really helpful, and I hope my post adhered to all the guidelines and stuff, am pretty new to Stack Overflow.
Easy way to do so is to split with spaces around the word to
string = "set cutoff to 440"
split = string.split(" to ")
print(split)
returns
['set cutoff', '440']
using regex to do so is much less efficient than simple split the word surrounded by spaces
If you want to use regex for other reasons, here is how to do it: you can find all non-whitespace characters:
import re
string = "set cutoff to 440"
split = re.findall(r'\S+',string)
print(split)
returns
['set', 'cutoff', 'to', '440']
from jamylak on this post: Split string based on a regular expression

Can't get a word out of a string using regex [duplicate]

This question already has answers here:
Print one word from a string in python
(6 answers)
Closed 5 years ago.
I've created an expression using regex in python to get the first word out of a string. However, is there any way I could find any specific word as in, PONY in this case. As they both are four lettered words and the latter one is capital, i think it is possible to find PONY using regex. I could only make an expression for the first one, though!
What i tried to find the first word:
import re
arg_str = "Jony is after PONY not phoney"
item = re.findall(r'([a-zA-Z]...+?)',arg_str)
print(item[0])
Any specific word? What about the following?
words = re.findall(r" *\w+ *", arg_str)
for word in words:
print(word)
Output:
Jony
is
after
pony
not
phoney
If you want to find the first occurrence of a word in a string, use str.find:
arg_str = "Jony is after pony not phoney"
print(arg_str.find("pony"))
If you want to find the first word:
print(arg_str.split()[0])

Reading sentences from a text file and appending into a list with Python 3 [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list. Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.' within a sentence, so I couldn't just cutoff searching through a sentence at a period. I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.
The program I'm writing is essentially going to allow for user input of a keyword search (key), and input for a number of sentences to be returned (value) before and after the sentence where the keyword is found. So it's more or less a research assistant so the user won't have to read a massive text file to find the information they want.
From what I've learned so far, putting the sentences into a list would be the easiest way to go about this, but I can't figure out the first part to it. If I could figure out this part, the rest should be easy to put together.
So I guess in short,
If I have a document of Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence.
I need a list of the document contents in the form of:
sentence_list = [Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence]
That's a pretty hard problem, and it doesn't have an easy answer. You could try and write a regular expression that captures all of the known cases, but complex regular expressions tend to be hard to maintain and debug. There are a number of existing libraries that may help you with this. Most notably is The Natural Language Toolkit, which has many tokenizers built in. You can install this with pip e.g.
pip install nltk
And then getting your sentences would be a fairly straightforward (although highly customizable) affair. Here's a simple example using the provided sentence tokenizer
import nltk
with(open('text.txt', 'r') as in_file):
text = in_file.read()
sents = nltk.sent_tokenize(text)
I'm not entirely clear how your sentences are delimited if not by normal punctuation, but running the above code on your text I get:
[
"I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list.",
"Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.'",
"within a sentence, so I couldn't just cutoff searching through a sentence at a period.",
"I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.\n\n"
]
But fails on inputs like: ["This is a sentence with.", "a period right in the middle."]
while passing on inputs like: ["This is a sentence wit.h a period right in the middle"]
I don't know if you're going to get much better than that right out of the box, though. From the nltk code:
A sentence tokenizer which uses an unsupervised algorithm to build
a model for abbreviation words, collocations, and words that start
sentences; and then uses that model to find sentence boundaries.
This approach has been shown to work well for many European
languages.
So the nltk solution is actually using machine learning to build a model of a sentence. Much better than a regular expression, but still not perfect. Damn natural languages. >:(
Hope this helps :)
First read the text file into a container.
Then use regular expressions to parse the document.
This is just a sample on how split() methods can be used for breaking the strings
import re
file = open("test.txt", "r")
doclist = [ line for line in file ]
docstr = '' . join(doclist)
sentences = re.split(r'[.!?]', docstr)

Tokenizing unsplit words from OCR using NLTK

I'm using NLTK to process some text that is extracted from PDF files. I can recover the text mostly intact, but there are lots of instances where spaces between words are not captured, so I get words like ifI instead of if I, or thatposition instead of that position, or andhe's instead of and he's.
My question is this: how can I use NLTK to look for words it does not recognize/has not learned, and see if there are "nearby" word combinations that are much more likely to occur? Is there a more graceful way to implement this kind of check than simply marching through the unrecognized word, one character at a time, splitting it, and seeing if it makes two recognizable words?
I would suggest that you consider using pyenchant instead, since it is a more robust solution for this sort of problem. You can download pyenchant here. Here is an example of how you would obtain your results after you install it:
>>> text = "IfI am inthat position, Idon't think I will." # note the lack of spaces
>>> from enchant.checker import SpellChecker
>>> checker = SpellChecker("en_US")
>>> checker.set_text(text)
>>> for error in checker:
for suggestion in error.suggest():
if error.word.replace(' ', '') == suggestion.replace(' ', ''): # make sure the suggestion has exact same characters as error in the same order as error and without considering spaces
error.replace(suggestion)
break
>>> checker.get_text()
"If I am in that position, I don't think I will." # text is now fixed

Categories

Resources