Need to modify my regexp to split string at every new line - python

I have a string with sentences I wanted to separate into individual sentences. The string has a lot of subtleties that are difficult to capture and split. I cannot use the nltk library either. My current regex does the best job among all others I have tried, but misses some sentences that start in a new line (implying a new paragraph). I was wondering if there was an easy way to modify the current expression to also split when there is a new line.
import re
file = open('data.txt','r')
text = file.read()
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
The current regexp is as follows:
sentences = re.split('(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
I would essentially need to modify the expression to also split when there is a new line.

Related

Make a list from the words of fileA and check with that list Against fileB in python

With a bit more detail, i have a list of common words in a txt file and i want to check if any of those words (around 2000) exist in another file (html) and if they do replace them with a constant string (sssss for example). Regex didn't help me much using either of these \b \b(?:one|two|three)\b or \w or ?:^|(?<= ))(one|common|word|or|another)(?:(?= )|$) .
Now i know how to open a file and import the first list but i don't know how to check every entry of that list against a huge text and replace their instances. I don't mind if it would take time I just really need this done and don't know how.
import re
import string
f = open('test2.txt', 'r')
lines = f.readlines()
print (lines)
Here's a hint for you. Parse each file into a set where each word would be an entry.
Then you can do a comparison between both sets with one of the aggregation functions: union, intersection, difference, or symmetric difference.
Regular expressions is not necessary unless you plan to make additional correlations with each word (comparing cat to cats). But if you plan to go down this road, then you're probably better off generating a Trie (prefix tree). I can expand more if you are willing to show some more code (progress).

Regex Python: Adding . after every 15 terms

I have a text file containing clean tweets and after every 15th term I need to insert a period.
In Python how do I add a character after a specific word using regex? Right now I am parsing the line word by word and I don't understand regex enough to write the code.
Basically, so that each line becomes its own string after a period.
Or is there an alternative way to split a paragraph into individual sentences.
Splitting paragraphs into sentences can be achieved with functions in nltk package. Please refer to this answer Python split text on sentences

Need help finding the correct regex pattern for my string pattern

I'm terrible with RegEx patterns, and I'm writing a simple python program that requires splitting lines of a file into a 'content' part and a 'tags' part, and then further splitting the tags parts into individual tags. Here's a simple example of what one line of my file might look like:
The Beatles <music,rock,60s,70s>
I've opened my file with begun reading lines like this:
def Load(self, filename):
file = open(filename, r)
for line in file:
#Ignore comments and empty lines..
if not line.startswith('#') and not line.strip():
#...
Forgive my likely terrible Python, it's my first few days with the language. Anyway, next I was thinking it would be useful to use a regex to break my string into sections - with a variable to store the 'content' (for example, "The Beatles"), and a list/set to store each of the tags. As such, I need a regex (or two?) that can:
Split the raw part from the <> part.
And split the tags part into a list based on the commas.
Finally, I want to make sure that the content part retains its capitalization and inner spacing. But I want to make sure the tags are all lower-case and without white space.
I'm wondering if any of the regex experts out there can help me find the correct pattern(s) to achieve my goals here?
This is a solution that gets around the problem without using by relying on multiple splits.
# This separates the string into the content and the remainder
content, tagStr = line.split('<')
# This splits the tagStr into individual tags. [:-1] is used to remove trailing '>'
tags = tagStr[:-1].split(',')
print content
print tags
The problem with this is that it leaves a trailing whitespace after the content.
You can remove this with:
content = content[:-1]

Creating a corpus from data in a custom format

I have hundreds of files containing text I want to use with NLTK. Here is one such file:
বে,বচা ইয়াণ্ঠা,র্চা ঢার্বিত তোখাটহ নতুন, অ প্রবঃাশিত।
তবে ' এ বং মুশায়েরা ' পত্রিব্যায় প্রকাশিত তিনটি লেখাই বইযে
সংব্যজান ব্যরার জনা বিশেষভাবে পরিবর্ধিত। পাচ দাপনিকেব
ড:বন নিয়ে এই বই তৈরি বাবার পরিব্যল্পনাও ম্ভ্রাসুনতন
সামন্তেরই। তার আর তার সহকারীদেব নিষ্ঠা ছাডা অল্প সময়ে
এই বই প্রব্যাশিত হতে পারত না।,তাঁদের সকলকে আমাধ
নমস্কার জানাই।
বতাব্যাতা শ্রাবন্তা জ্জাণ্ণিক
জানুয়ারি ২ ণ্ট ণ্ট ৮
Total characters: 378
Note that each line does not contain a new sentence. Rather, the sentence terminator - the equivalent of the period in English - is the '।' symbol.
Could someone please help me create my corpus? If imported into a variable MyData, I would need to access MyData.words() and MyData.sents(). Also, the last line should not appear in the corpus (it merely contains a character count).
Please note that I will need to run operations on data from all the files at once.
Thanks in advance!
You don't need to input the files yourself or to provide words and sents methods.
Read in your corpus with PlaintextCorpusReader, and it will provide those for you.
The corpus reader constructor accepts arguments for the path and filename pattern of the files, and for the input encoding (be sure to specify it).
The constructor also has optional arguments for the sentence and word tokenization functions, so you can pass it your own method to break up the text into sentences. If word and sentence detection is really simple, i.e., if the | character has other uses, you can configure a tokenization function from the nltk's RegexpTokenizer family, or you can write your own from scratch. (Before you write your own, study the docs and code or write a stub to find out what kind of input it's called with.)
If recognizing sentence boundaries is non-trivial, you can later figure out how to train the nltk's PunktSentenceTokenizer, which uses an unsupervized statistical algorithm to learn which uses of the sentence terminator actually end a sentence.
If the configuration of your corpus reader is fairly complex, you may find it useful to create a class that specializes PlaintextCorpusReader. But much of the time that's not necessary. Take a look at the NLTK code to see how the gutenberg corpus is implemented: It's just a PlainTextCorpusReader instance with appropriate arguments for the constructor.
1) to get rid of the last line is rather straightforward.
f = open('corpus.txt', 'r')
for l in f.readlines()[:-1]:
....
The [:-1] in the for loop will skip the last line for you.
2) The built-in readlines() function of a file object breaks the content in the file into lines by using the newline character as a delimiter. So you need to write some code to cache the lines until the '|' is seen. When a '|' is encountered, treat the cached lines as one single sentence and put it in your MyData class

Splitting words in running text using Python?

I am writing a piece of code which will extract words from running text. This text can contain delimiters like \r,\n etc. which might be there in text.
I want to discard all these delimiters and only extract full words. How can I do this with Python? any library available for crunching text in python?
Assuming your definition of "word" agrees with that of the regular expression module (re), that is, letters, digits and underscores, it's easy:
import re
fullwords = re.findall(r'\w+', thetext)
where thetext is the string in question (e.g., coming from an f.read() of a file object f open for reading, if that's where you get your text from).
If you define words differently (e.g. you want to include apostrophes so for example "it's" will be considered "one word"), it isn't much harder -- just use as the first argument of findall the appropriate pattern, e.g. r"[\w']+" for the apostrophe case.
If you need to be very, very sophisticated (e.g., deal with languages that use no breaks between words), then the problem suddenly becomes much harder and you'll need some third-party package like nltk.
Assuming your delimiters are whitespace characters (like space, \r and \n), then basic str.split() does what you want:
>>> "asdf\nfoo\r\nbar too\tbaz".split()
['asdf', 'foo', 'bar', 'too', 'baz']

Categories

Resources