Using regular expressions in Python - python

I'm struggling with the problem to cut the very first sentence from the string.
It wouldn't be such a problem if I there were no abbreviations ended with dot.
So my example is:
string = 'I like cheese, cars, etc. but my the most favorite website is stackoverflow. My new horse is called Randy.'
And the result should be:
result = 'I like cheese, cars, etc. but my the most favorite website is stackoverflow.'
Normally I would do with:
re.findall(r'^(\s*.*?\s*)(?:\.|$)', event)
but I would like to skip some pre-defined words, like above mentioned etc.
I came with couple of expression but none of them worked.

You could try NLTK's Punkt sentence tokenizer, which does this kind of thing using a real algorithm to figure out what the abbreviations are instead of your ad-hoc collection of abbreviations.
NLTK includes a pre-trained one for English; load it with:
nltk.data.load('tokenizers/punkt/english.pickle')
From the source code:
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print '\n-----\n'.join(sent_detector.tokenize(text.strip()))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

How about looking for the first capital letter after a sentence-ending character? It's not foolproof, of course.
import re
r = re.compile("^(.+?[.?!])\s*[A-Z]")
print r.match('I like cheese, cars, etc. but my the most favorite website is stackoverflow. My new horse is called Randy.').group(1)
outputs
'I like cheese, cars, etc. but my the most favorite website is stackoverflow.'

Related

Removing "\n"s when printing sentences from text file in python?

I am trying to print a list of sentences from a text file (one of the Project Gutenberg eBooks). When I print the file as a single string string it looks fine:
file = open('11.txt','r+')
alice = file.read()
print(alice[:500])
Output is:
ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d
Now, when I split it into sentences (The assignment was specifically to do this by "splitting at the periods," so it's a very simplified split), I get this:
>>> print(sentences[:5])
["ALICE'S ADVENTURES IN WONDERLAND\n\nLewis Carroll\n\nTHE MILLENNIUM FULCRUM EDITION 3", '0\n\n\n\n\nCHAPTER I', " Down the Rabbit-Hole\n\nAlice was beginning to get very tired of sitting by her sister on the\nbank, and of having nothing to do: once or twice she had peeped into the\nbook her sister was reading, but it had no pictures or conversations in\nit, 'and what is the use of a book,' thought Alice 'without pictures or\nconversations?'\n\nSo she was considering in her own mind (as well as she could, for the\nhot day made her feel very sleepy and stupid), whether the pleasure\nof making a daisy-chain would be worth the trouble of getting up and\npicking the daisies, when suddenly a White Rabbit with pink eyes ran\nclose by her", "\n\nThere was nothing so VERY remarkable in that; nor did Alice think it so\nVERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\nOh dear! I shall be late!' (when she thought it over afterwards, it\noccurred to her that she ought to have wondered at this, but at the time\nit all seemed quite natural); but when the Rabbit actually TOOK A WATCH\nOUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,\nAlice started to her feet, for it flashed across her mind that she had\nnever before seen a rabbit with either a waistcoat-pocket, or a watch\nto take out of it, and burning with curiosity, she ran across the field\nafter it, and fortunately was just in time to see it pop down a large\nrabbit-hole under the hedge", '\n\nIn another moment down went Alice after it, never once considering how\nin the world she was to get out again']
Where do the extra "\n" characters come from and how can I remove them?
If you want to replace all the newlines with one space, do this:
import re
new_sentences = [re.sub(r'\n+', ' ', s) for s in sentences]
You may not want to use regex, but I would do:
import re
new_sentences = []
for s in sentences:
new_sentences.append(re.sub(r'\n{2,}', '\n', s))
This should replace all instances of two or more '\n' with a single newline, so you still have newlines, but don't have "extra" newlines.
If you want to avoid creating a new list, and instead modify the existing one (credit to #gavriel and Andrew L.: I hadn't thought of using enumerate when I first posted my answer):
import re
for i, s in enumerate(sentences):
sentences[i] = re.sub(r'\n{2,}', '\n', s)
The extra newlines aren't really extra, by which I mean they are meant to be there and are visible in the text in your question: the more '\n' there are, the more space there is visible between the lines of text (i.e., one between the chapter heading and the first paragraph, and many between the edition and the chapter heading.
You'll understand where the \n characters come from with this little example:
alice = """ALICE'S ADVENTURES IN WONDERLAND
Lewis Carroll
THE MILLENNIUM FULCRUM EDITION 3.0
CHAPTER I. Down the Rabbit-Hole
Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, 'and what is the use of a book,' thought Alice 'without pictures or
conversations?'
So she was considering in her own mind (as well as she could, for the
hot d"""
print len(alice.split("."))
print len(alice.split("\n"))
It all depends the way you're splitting your text, the above example will give this output:
3
19
Which means there are 3 substrings if you were to split the text using . or 19 substrings if you splitted using \n as separator. You can read more about str.split
In your case you've splitted your text using ., so the 3 substrings will contain multiple newlines characters \n, to get rid of them you can either split these substrings again or just get rid of them using str.replace
The text uses newlines to delimit sentences as well as fullstops. You have an issue where just replacing the new line characters with an empty string will result in having words without spaces between them. Before you split alice by '.', I would use something along the lines of #elethan's solution to replace all of the multiple new lines in alice with a '.' Then you could do alice.split('.') and all of the sentences separated with multiple new lines would be split appropriately along with the sentences separated with . initially.
Then your only issue is the decimal point in the version number.
file = open('11.txt','r+')
file.read().split('\n')

NLTK Sentence Tokenizer, custom sentence starters

I'm trying to split a text into sentences with the PunktSentenceTokenizer from nltk. The text contains listings starting with bullet points, but they are not recognized as new sentences. I tried to add some parameters but that didn't work. Is there another way?
Here is some example code:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktParameters
params = PunktParameters()
params.sent_starters = set(['•'])
tokenizer = PunktSentenceTokenizer(params)
tokenizer.tokenize('• I am a sentence • I am another sentence')
['• I am a sentence • I am another sentence']
You can subclass PunktLanguageVars and adapt the sent_end_chars attribute to fit your needs like so:
from nltk.tokenize.punkt import PunktSentenceTokenizer, PunktLanguageVars
class BulletPointLangVars(PunktLanguageVars):
sent_end_chars = ('.', '?', '!', '•')
tokenizer = PunktSentenceTokenizer(lang_vars = BulletPointLangVars())
tokenizer.tokenize(u"• I am a sentence • I am another sentence")
This will result in the following output:
['•', 'I am a sentence •', 'I am another sentence']
However, this makes • a sentence end marker, while in your case it is more of a sentence start marker. Thus this example text:
I introduce a list of sentences.
I am sentence one
I am sentence two
And I am one, too!
Would, depending on the details of your text, result in something like the following:
>>> tokenizer.tokenize("""
Look at these sentences:
• I am sentence one
• I am sentence two
But I am one, too!
""")
['\nLook at these sentences:\n\n•', 'I am sentence one\n•', 'I am sentence two\n\nBut I am one, too!\n']
One reason why PunktSentenceTokenizer is used for sentence tokenization instead of simply employing something like a multi-delimiter split function is, because it is able to learn how to distinguish between punctuation used for sentences and punctuation used for other purposes, as in "Mr.", for example.
There should, however, be no such complications for •, so you I would advise you to write a simple parser to preprocess the bullet point formatting instead of abusing PunktSentenceTokenizer for something it is not really designed for.
How this might be achieved in detail is dependent on how exactly this kind of markup is used in the text.

Split Sentence on Punctuation or Camel-Case

I have a very long string in python and i'm trying to break it up into a list of sentences. Only some of these sentences are missing puntuation and spaces between them.
Example
I have 9 sheep in my garageVideo games are super cool.
I can't figure out the regex to separate the two! It's drive me nuts.
There are properly punctuated sentences as well, so I thought i'd make several different regex patterns, each splitting off different styles of combination.
Input
I have 9 sheep in my garageVideo games are super cool. Some peanuts can sing, though they taste a whole lot better than they sound!
Output
['I have 9 sheep in my garage',
'Video games are super cool.'
'Some peanuts can sing, though they taste a whole lot better than they sound!']
Thanks!
Position Split: Use the regex module
I will give you both a "Split" and a "Match All" option. Let's start with "Split".
In many engines, but not Python's re module, you can split at a position defined by a zero-width match.
In Python, to split on a position, I would use Matthew Barnett's outstanding regex module, whose features far outstrip those of Python's default re engine. That is my default regex engine in Python.
With your input, you can use this regex:
(?V1)(?<=[a-z])(?=[A-Z])|(?<=[.!?]) +(?=[A-Z])
Note that if you had strangely-formatted acronyms such as B. B. C., we would need to tweak this.
Sample Python Code:
string = "I have 9 sheep in my garageVideo games are super cool. Some peanuts can sing, though they taste a whole lot better than they sound!"
result = regex.split("(?V1)(?<=[a-z])(?=[A-Z])|(?<=[.!?]) +(?=[A-Z])", string)
print(result)
Output:
['I have 9 sheep in my garage',
'Video games are super cool.',
'Some peanuts can sing, though they taste a whole lot better than they sound!']
Explanation
(?V1) instructs the engine to use the new behavior, where we can split on zero-width matches.
(?<=[a-z])(?=[A-Z]) matches a position where the lookbehind (?<=[a-z]) can assert that what precedes is a lower-case letter and the lookahead (?=[A-Z]) can assert that what follows is an uppercase letter.
| OR...
(?<=[.!?]) +(?=[A-Z]) matches one or more spaces + where the lookbehind (?<=[.!?]) can assert that what precedes is a dot, bang, question mark and a space, and where the lookahead (?=[A-Z]) can assert that what follows is a capital letter.
Option 2: Use findall (again with the regex module)
Since the "Split" and "Match All" operations are two sides of the same coin, you can do this:
print(regex.findall(r".+?(?:(?<=[.!?])|(?<=[a-z])(?=[A-Z]))",string))
Again, this would not work with re (which would skip the V that starts the second sentence Video).

String Replacement and Matching in Python 2

I have user posts that I would like to match up with a predetermined list of patterns(see example). If the post matches a pattern, I would like to write the post and the pattern to a file. What is the best way to do this? So far I've only thought of brute forcing it with 4 for loops and then doing some comparisons. I already have the lists of all the data I need, below are just some very simple examples to give you an idea of what I am looking for.
Example
Posts:
posts =['When I ate at McDonald\'s, I felt sick.',
'I like eating at Burger King.',
'Wendy\'s made me feel happy.']
Pattern:
patterns = ['When I ate at [RESTAURANT]',
'I like eating at [RESTAURANT]',
'[RESTAURANT] made me feel [FEELING]',
'I felt [FEELING]']
Lists:
restaurant_names = ['McDonald\'s', 'Burger King', 'Wendy\'s']
feelings = ['happy', 'sick', 'tired']
OutputFile:
When I ate at [RESTAURANT], When I ate at McDonald's, I felt sick.
I felt [FEELING], When I ate at McDonald's, I felt sick.
[RESTAURANT] made me feel [FEELING], Wendy's made me feel happy.
I like eating at [RESTAURANT], I like eating at Burger King.
-Sorry for the formatting, but this is my first post on stackoverflow after lurking for a while. Thanks in advance for the help!
How about something like this:
>>> sentences = ["When I ate at McDonald's, I felt sick.", 'I like eating at Burger King.',
"Wendy's made me feel happy."]
>>> patterns = {"McDonald's": "[RESTAURANT]", "Burger King": "[RESTAURANT]",
"Wendy's": "[RESTAURANT]", "happy": "[FEELING]", "sick": "[FEELING]",
"tired": "[FEELING]"}
Then you can do
>>> for sentence in sentences:
... replaced = sentence
... for pattern in patterns:
... if pattern in sentence:
... replaced = replaced.replace(pattern, patterns[pattern])
... print sentence
... print replaced
...
When I ate at McDonald's, I felt sick.
When I ate at [RESTAURANT], I felt [FEELING].
I like eating at Burger King.
I like eating at [RESTAURANT].
Wendy's made me feel happy.
[RESTAURANT] made me feel [FEELING].
This still needs some work (for example, right now, the word carsick would become car[FEELING]), and you might want to avoid all the repetition in the patterns values by creating another list of replacement texts that you can refer to by index, but perhaps this is enough to get you started?
I'm not sure I understand. Could you please post the exact code you have so far, what you intend to do and why? Thanks.
In general, there are 4 alternatives:
1) Use a single, but complex, RegEx pattern and strict lists
r"(When I ate at (?P<rest1>McDonald's|Burger King|Wendy's), I felt (?P<feel1>happy|sick|tired)\.)|(I like eating at (?P<rest2>McDonald's|Burger King|Wendy's)\.)"
Analysis of the named capture groups rest1, feel1, rest2 would allow you to determine what sentence type was used if you need it. Otherwise, you can output the whole match. The pattern can, of course, be assembled programatically from your lists. Just be careful of using re.escape() when concatenating elements.
2) Use a single, but complex, RegEx pattern and loose lists
r"(When I ate at (?P<rest1>[^,]+), I felt (?P<feel1>[a-z]+).)|(I like eating at (?P<rest2>[^.]+)\.)"
This has the advantage that you can capture new restaurant names, feelings, etc. The disadvantage is dependency on punctuation / grammar. Example: the first pattern would not recognize a restaurant name with an embedded , .
3) Do what you are probably already doing. Natural language analysis is much more complex than what RegEx's can do by themselves.
4) If it's not just about a few fixed patterns, but about analyzing the meaning of a post regardless of specific wording, then you should use NLTK as other posters have suggested.

Regular Expressions: Match up to a word or a maximum number of words

I want to look for a phrase, match up to a few words following it, but stop early if I find another specific phrase.
For example, I want to match up to three words following "going to the", but stop the matching process if I encounter "to try". So for example "going to the luna park" will result with "luna park"; "going to the capital city of Peru" will result with "capital city of" and "going to the moon to try some cheesecake" will result with "moon".
Can it be done with a single, simple regular expression (preferably in Python)? I've tried all the combinations I could think of, but failed miserably :).
This one matches up to 3 ({1,3}) words following going to the as long as they are not followed by to try ((?!to try)):
import re
infile = open("input", "r")
for line in infile:
m = re.match("going to the ((?:\w+\s*(?!to try)){1,3})", line)
if m:
print m.group(1).rstrip()
Output
luna park
capital city of
moon
I think you are looking for a way to extract Proper Nouns out of sentences. You should look at NLTK for proper approach. Regex can be only helpful of a limited context free grammer. On the other hand you seem to asking for ability to parse human language which is non-trivial (for computers).

Categories

Resources