I have a text file containing clean tweets and after every 15th term I need to insert a period.
In Python how do I add a character after a specific word using regex? Right now I am parsing the line word by word and I don't understand regex enough to write the code.
Basically, so that each line becomes its own string after a period.
Or is there an alternative way to split a paragraph into individual sentences.
Splitting paragraphs into sentences can be achieved with functions in nltk package. Please refer to this answer Python split text on sentences
Related
I have a string with sentences I wanted to separate into individual sentences. The string has a lot of subtleties that are difficult to capture and split. I cannot use the nltk library either. My current regex does the best job among all others I have tried, but misses some sentences that start in a new line (implying a new paragraph). I was wondering if there was an easy way to modify the current expression to also split when there is a new line.
import re
file = open('data.txt','r')
text = file.read()
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
The current regexp is as follows:
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
I would essentially need to modify the expression to also split when there is a new line.
I'm doing the cipher for python. I'm confused on how to use Regular Expression to find a paired word in a text dictionary.
For example, there is dictionary.txt with many English words in it. I need to find word paired with "th" at the beginning. Like they, them, the, their .....
What kind of Regular Expression should I use to find "th" at the beginning?
Thank you!
If you got a list of words (so that every word is a string), you find words beginning with 'th' with this:
yourRegEx = re.compile(r'^th\w*') # ^ for beginning of string
^(th\w*)
gives you all results where the string begins with th . If there is more than one word in the string you will only get the first.
(^|\s)(th\w*)
wil give you all the words begining with th even if there is more than one word begining with th
(th)\w*
notice you have this great online tool to generate python code and test regex:
REGEX WEBSITE
I have a large stockpile of PDFs of documents. I use Apache Tika to convert them to text, and now I'd like to split them into paragraphs. I can't use regular expressions because the text conversion makes the distinction between paragraphs impossible: some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs (using Tika's conversion to HTML instead of text does not help).
Python's NLTK book have a way of splitting sentences using machine learning, so I thought trying something similar with paragraphs, but I couldn't find training data for that.
Is there training data for that? should I try some complex regular expression that might work?
I will try to give an easier way to deal with your problem: What you need to do is check for the double \nl then if you find double \nl then sort data considering that, and if you do not find double \nl then just sort data according to single \nl.
Another thing, i am thinking \nl is not a special character since i could not get any ASCII value for it, it is probably newline character but since you have asked for \nl i am giving the example accordingly(if it is indeed \n then you need to just change the part checking for double \nl).
Rough example to detect the way for new paragraph used in the file:
f=open('yourfile','r')
a=f.read()
f.close()
temp=0
for z in range(len(a)-4):
if a[z:z+4]=='\nl\nl':
temp=1
break
#temp=1 if formatting is by double \nl otherwise 0
After this you can use simple string formatting to check for single \nl or double \nl and replace them according to your need to distinguish new line or new paragraph.(Please read the file in chunks if the file size is too big, otherwise you might have memory problems or slower code)
You say
some documents have the standard way of a \n between paragraphs, but some have a \n between lines in the same paragraph and then a double \n between paragraphs
so I would preprocess all the files to detect with use the double newline between paragraphs. The files with double \n need to be stripped of all single new line characters, and all double new lines reduced to single ones.
You can then pass all the files to the next stage where you detect paragraphs using a single \n character.
from nltk import tokenize
tk=tokenize
a='para here'
tk.sent_tokenize(a)
#output =list of sentences
#thats all u need
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list. Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.' within a sentence, so I couldn't just cutoff searching through a sentence at a period. I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.
The program I'm writing is essentially going to allow for user input of a keyword search (key), and input for a number of sentences to be returned (value) before and after the sentence where the keyword is found. So it's more or less a research assistant so the user won't have to read a massive text file to find the information they want.
From what I've learned so far, putting the sentences into a list would be the easiest way to go about this, but I can't figure out the first part to it. If I could figure out this part, the rest should be easy to put together.
So I guess in short,
If I have a document of Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence. Sentence.
I need a list of the document contents in the form of:
sentence_list = [Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence, Sentence]
That's a pretty hard problem, and it doesn't have an easy answer. You could try and write a regular expression that captures all of the known cases, but complex regular expressions tend to be hard to maintain and debug. There are a number of existing libraries that may help you with this. Most notably is The Natural Language Toolkit, which has many tokenizers built in. You can install this with pip e.g.
pip install nltk
And then getting your sentences would be a fairly straightforward (although highly customizable) affair. Here's a simple example using the provided sentence tokenizer
import nltk
with(open('text.txt', 'r') as in_file):
text = in_file.read()
sents = nltk.sent_tokenize(text)
I'm not entirely clear how your sentences are delimited if not by normal punctuation, but running the above code on your text I get:
[
"I'm having trouble figuring out how I would take a text file of a lengthy document, and append each sentence within that text file to a list.",
"Not all sentences will end in a period, so all end characters would have to be taken into consideration, but there could also be a '.'",
"within a sentence, so I couldn't just cutoff searching through a sentence at a period.",
"I'm assuming this could be fixed by also adding a condition where after the period it should be followed by a space, but I have no idea how to set this up so I get each sentence from the text file put into a list as an element.\n\n"
]
But fails on inputs like: ["This is a sentence with.", "a period right in the middle."]
while passing on inputs like: ["This is a sentence wit.h a period right in the middle"]
I don't know if you're going to get much better than that right out of the box, though. From the nltk code:
A sentence tokenizer which uses an unsupervised algorithm to build
a model for abbreviation words, collocations, and words that start
sentences; and then uses that model to find sentence boundaries.
This approach has been shown to work well for many European
languages.
So the nltk solution is actually using machine learning to build a model of a sentence. Much better than a regular expression, but still not perfect. Damn natural languages. >:(
Hope this helps :)
First read the text file into a container.
Then use regular expressions to parse the document.
This is just a sample on how split() methods can be used for breaking the strings
import re
file = open("test.txt", "r")
doclist = [ line for line in file ]
docstr = '' . join(doclist)
sentences = re.split(r'[.!?]', docstr)
I'm using Python with nltk. I need to process some text in English without any whitespace, but word_tokenize function in nltk couldn't deal with problems like this. So how to tokenize text without any whitespace. Is there any tools in Python?
I am not aware of such tools, but the solution of your problem depends on the language.
For the Turkish language you can scan input text letter by letter and accumulate letters into a word. When you are sure that accumulated word forms a valid word from a dictionary, you save it as a separate token, erase the buffer for accumulating new word and continue the process.
You can try this for English, but I assume that you may find situations when ending of one word may be a beginning of some dictionary word, and this can cause you some problems.