So I'm trying to do a cosine similarity with a text file I have. https://lms.uwa.edu.au/bbcswebdav/pid-1143173-dt-content-rid-16133365_1/courses/CITS1401_SEM-2_2018/CITS1401_SEM-2_2018_ImportedContent_20180713092326/CITS1401_SEM-1_2018/Unit%20Content/Resources/Project2_2018/sample.txt
I'm wondering how I print this sentence by sentence and not readline() to read line by line.
I'm trying to create the sentence variables. For example
s1 = "the mississippi is well worth reading about"
s2 = "it is not a commonplace river, but on the contrary is in all ways remarkable"
Is this first the way to go about it? If it is, my next step which I know how to do is remove the common words from the sentences and only leave unique words to compare with.
How do I stop at the full-stop and then store that sentence to a variable who looping through the text?
Thanks
Do you mean this:
with open("file.txt",'r') as in_f:
sentences = in_f.read().replace('\n','').split('.')
for each s in sentences:
#your code
Related
for example I want to save inevitable, unavoidable, certain, sure = "necessary" if mentioned words are using in my giving sentence, so my program automatically change these words into "necessary" and give me sentence
for example
it is inevitable or unavoidable or certain or sure, that person age should be 18
so my python program automatically detect these words and convert in to
"it is necessary that person age should be 18"
Your issue isn't very clear, tell us what you want to do and what you can't figure out.
I think you should split your sentence to get a list of all words in it. Then, check if one of the words belongs to your list of "changeable" words ( inevitable, unavoidable, certain, sure) if so, replace it with the word you want ("necessary" in your example).
But i'm not sure i understood your problem.
sen = "this is unavoidable that the kids must be 18"
words = sen.split()
new_words = []
for word in words:
if word in ['inevitable', 'unavoidable', 'certain', 'sure']:
word = 'necessary'
new_words.append(word)
new_sen = " ".join(new_words)
print(new_sen)
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list. Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.' within a sentence, so I couldn't just cutoff searching through a sentence at a period. I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.
The program I'm writing is essentially going to allow for user input of a keyword search (key), and input for a number of sentences to be returned (value) before and after the sentence where the keyword is found. So it's more or less a research assistant so the user won't have to read a massive text file to find the information they want.
From what I've learned so far, putting the sentences into a list would be the easiest way to go about this, but I can't figure out the first part to it. If I could figure out this part, the rest should be easy to put together.
So I guess in short,
If I have a document of Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence.
I need a list of the document contents in the form of:
sentence_list = [Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence]
That's a pretty hard problem, and it doesn't have an easy answer. You could try and write a regular expression that captures all of the known cases, but complex regular expressions tend to be hard to maintain and debug. There are a number of existing libraries that may help you with this. Most notably is The Natural Language Toolkit, which has many tokenizers built in. You can install this with pip e.g.
pip install nltk
And then getting your sentences would be a fairly straightforward (although highly customizable) affair. Here's a simple example using the provided sentence tokenizer
import nltk
with(open('text.txt', 'r') as in_file):
text = in_file.read()
sents = nltk.sent_tokenize(text)
I'm not entirely clear how your sentences are delimited if not by normal punctuation, but running the above code on your text I get:
[
"I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list.",
"Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.'",
"within a sentence, so I couldn't just cutoff searching through a sentence at a period.",
"I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.\n\n"
]
But fails on inputs like: ["This is a sentence with.", "a period right in the middle."]
while passing on inputs like: ["This is a sentence wit.h a period right in the middle"]
I don't know if you're going to get much better than that right out of the box, though. From the nltk code:
A sentence tokenizer which uses an unsupervised algorithm to build
a model for abbreviation words, collocations, and words that start
sentences; and then uses that model to find sentence boundaries.
This approach has been shown to work well for many European
languages.
So the nltk solution is actually using machine learning to build a model of a sentence. Much better than a regular expression, but still not perfect. Damn natural languages. >:(
Hope this helps :)
First read the text file into a container.
Then use regular expressions to parse the document.
This is just a sample on how split() methods can be used for breaking the strings
import re
file = open("test.txt", "r")
doclist = [ line for line in file ]
docstr = '' . join(doclist)
sentences = re.split(r'[.!?]', docstr)
I'm using sklearn to perform cosine similarity.
Is there a way to consider all the words starting with a capital letter as stop words?
The following regex will take as input a string, and remove/replace all sequences of alphanumeric characters that begin with an uppercase character with the empty string. See http://docs.python.org/2.7/library/re.html for more options.
s1 = "The cat Went to The store To get Some food doNotMatch"
r1 = re.compile('\\b[A-Z]\w+')
r1.sub('',s1)
' cat to store get food doNotMatch'
Sklearn also has many great facilities for text feature generation, such as sklearn.feature_extraction.text Also you might want to consider NLTK to assist in sentence segmentation, removing stop words, etc...
I want to look for a phrase, match up to a few words following it, but stop early if I find another specific phrase.
For example, I want to match up to three words following "going to the", but stop the matching process if I encounter "to try". So for example "going to the luna park" will result with "luna park"; "going to the capital city of Peru" will result with "capital city of" and "going to the moon to try some cheesecake" will result with "moon".
Can it be done with a single, simple regular expression (preferably in Python)? I've tried all the combinations I could think of, but failed miserably :).
This one matches up to 3 ({1,3}) words following going to the as long as they are not followed by to try ((?!to try)):
import re
infile = open("input", "r")
for line in infile:
m = re.match("going to the ((?:\w+\s*(?!to try)){1,3})", line)
if m:
print m.group(1).rstrip()
Output
luna park
capital city of
moon
I think you are looking for a way to extract Proper Nouns out of sentences. You should look at NLTK for proper approach. Regex can be only helpful of a limited context free grammer. On the other hand you seem to asking for ability to parse human language which is non-trivial (for computers).
Should I use NLTK or regular expressions to split it?
How can I do the selection for pronouns (he/she). I want to select any sentence that has a pronoun.
This is a part of a larger project and I am new to Python. Could you please point me to any helpful code?
I am working on a NLP project which has similar needs. I suggest you to use NLTK since it makes things really easy and gives us a lot of flexibility. Since you need to collect all sentences having pronouns, you can split all sentences in the text and hold them in a list. Then, you can iterate over the list and look for sentences containing pronouns. Also make sure you note down the index of the sentence (in the list) or you can form a new list.
Sample code below:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
sentences = ['alice loves to read crime novels.', 'she also loves to play chess with him']
sentences_with_pronouns = []
for sentence in sentences:
words = word_tokenize(sentence)
for word in words:
word_pos = pos_tag([word])
if word_pos[0][1] == 'PRP':
sentences_with_pronouns.append(sentence)
break
print sentences_with_pronouns
Output:
['she also loves to play chess.']
P.S. Also check pylucene and whoosh libraries which are pretty useful NLP python
packages.
NLTK is your best bet. Given a string of sentences as input, you can obtain a list of those sentences containing pronouns by doing:
from nltk import pos_tag, sent_tokenize, word_tokenize
paragraph = "This is a sentence with no pronouns. Take it or leave it."
print [sentence for sentence in sent_tokenize(paragraph)
if 'PRP' in {pos for _,pos in pos_tag(word_tokenize(sentence))}]
Returns:
['Take it or leave it.']
Basically we split the string into a list of sentences, those sentences into a list of words and convert the list of words for each sentence into a set of part of speech tags (this is important since if we don't, when we have multiple pronouns in a sentence, we would get get duplicate sentences).