Split Sentence on Punctuation or Camel-Case - python

I have a very long string in python and i'm trying to break it up into a list of sentences. Only some of these sentences are missing puntuation and spaces between them.
Example
I have 9 sheep in my garageVideo games are super cool.
I can't figure out the regex to separate the two! It's drive me nuts.
There are properly punctuated sentences as well, so I thought i'd make several different regex patterns, each splitting off different styles of combination.
Input
I have 9 sheep in my garageVideo games are super cool. Some peanuts can sing, though they taste a whole lot better than they sound!
Output
['I have 9 sheep in my garage',
'Video games are super cool.'
'Some peanuts can sing, though they taste a whole lot better than they sound!']
Thanks!

Position Split: Use the regex module
I will give you both a "Split" and a "Match All" option. Let's start with "Split".
In many engines, but not Python's re module, you can split at a position defined by a zero-width match.
In Python, to split on a position, I would use Matthew Barnett's outstanding regex module, whose features far outstrip those of Python's default re engine. That is my default regex engine in Python.
With your input, you can use this regex:
(?V1)(?<=[a-z])(?=[A-Z])|(?<=[.!?]) +(?=[A-Z])
Note that if you had strangely-formatted acronyms such as B. B. C., we would need to tweak this.
Sample Python Code:
string = "I have 9 sheep in my garageVideo games are super cool. Some peanuts can sing, though they taste a whole lot better than they sound!"
result = regex.split("(?V1)(?<=[a-z])(?=[A-Z])|(?<=[.!?]) +(?=[A-Z])", string)
print(result)
Output:
['I have 9 sheep in my garage',
'Video games are super cool.',
'Some peanuts can sing, though they taste a whole lot better than they sound!']
Explanation
(?V1) instructs the engine to use the new behavior, where we can split on zero-width matches.
(?<=[a-z])(?=[A-Z]) matches a position where the lookbehind (?<=[a-z]) can assert that what precedes is a lower-case letter and the lookahead (?=[A-Z]) can assert that what follows is an uppercase letter.
| OR...
(?<=[.!?]) +(?=[A-Z]) matches one or more spaces + where the lookbehind (?<=[.!?]) can assert that what precedes is a dot, bang, question mark and a space, and where the lookahead (?=[A-Z]) can assert that what follows is a capital letter.
Option 2: Use findall (again with the regex module)
Since the "Split" and "Match All" operations are two sides of the same coin, you can do this:
print(regex.findall(r".+?(?:(?<=[.!?])|(?<=[a-z])(?=[A-Z]))",string))
Again, this would not work with re (which would skip the V that starts the second sentence Video).

Related

Python regex A|B|C matches C even though B should match

I've been sitting on this problem for several hours now and I really don't know anymore...
Essentially, I have an A|B|C - type separated regex and for whatever reason C matches over B, even though the individual regexes should be tested from left-to-right and stopped in a non-greedy fashion (i.e. once a match is found, the other regex' are not tested anymore).
This is my code:
text = 'Patients with end stage heart failure fall into stage D of the ABCD classification of the American College of Cardiology (ACC)/American Heart Association (AHA), and class III–IV of the New York Heart Association (NYHA) functional classification; they are characterised by advanced structural heart disease and pronounced symptoms of heart failure at rest or upon minimal physical exertion, despite maximal medical treatment according to current guidelines.'
expansion = "American Heart Association"
re_exp = re.compile(expansion + "|" + r"(?<=\W)" + expansion + "|"\
+ expansion.split()[0] + r"[-\s].*?\s*?" + expansion.split()[-1])
m = re_exp.search(text)
print(m.group(0))
I want regex to find the "expansion" string. In my dataset, sometimes the text has the expansion string slightly edited, for example having articles or prepositions like "for" or "the" between the main nouns. This is why I first try to just match the String as is, then try to match it if it is after any non-word character (i.e. parentheses or, like in the example above, a whole lot of stuff as the space was omitted) and finally, I just go full wild-card to find the string by search for the beginning and ending of the string with wildcards inbetween.
Either way, with the example above I would expect to get the followinging output:
American Heart Association
but what I'm getting is
American College of Cardiology (ACC)/American Heart Association
which is the match for the final regex.
If I delete the final regex or just call re.findall(r"(?<=\W)"+ expansion, text), I get the output I want, meaning the regex is in fact matching properly.
What gives?
The regex looks like this:
American Heart Association|(?<=\W)American Heart Association|American[-\s].*?\s*?Association
The first 2 alternatives match the same text, only the second one has a positive lookbehind prepended.
You can omit that second alternative, as the first alternative without any assertions has either already matched it, or the second part will also not match it if the first one did not match it.
As the pattern matches from left to right and encounters the first occurrence with American, the first and the second alternatives can not match American College of Cardiology.
Then the third alternation can match it, and due to the .*? it can stretch until the first occurrence of Association.
What you might do is for example exclude possible characters to match using a negated character class:
\bAmerican\b[^/,.]*\bAssociation\b
Regex demo
Or you might use a tempered greedy token approach to not allow specific words between the first and last part:
\bAmerican\b(?:(?!American\b|Association\b).)*\bHeart Association\b
Regex demo
So re.findall(r"(?<=\W)"+ expansion, text) works because before the match is a non-word character (denoted \w), "/". Your regex will match "American [whatever random stuff here] Heart Association". This means you match "American College of Cardiology (ACC)/American Heart Association" before you will match the inner string "American Heart Association". E.g. if you deleted the first "American" in your string you would get the match you are looking for with your regex.
You need to be more restrictive with your regex to rule out situations like these.

I don't understand the usefulness of len() in this exercice

I've started learning python few days ago and I'm training myself on codewars. One of the exercises was to calculate how many times a given word appears in the sentences. I made it my way but in the correction, some people are doing it this way:
import re
def sum_of_a_beach(beach):
return len(re.findall('Sand|Water|Fish|Sun', beach, re.IGNORECASE))
I understand most of it but I don't understand why is len() used.
re.findall('Sand|Water|Fish|Sun', beach, re.IGNORECASE) finds all the occurrences of the words (no word boundary, that is...).
len just counts those occurrences.
Using count on beach would work too, but you'd have to lowercase and perform a loop. regex avoids to convert to lowercase, and the loop is done with |
If you test it with:
s = "The sand is touching the water, it's fishy"
You'll get 3 occurrences. Maybe it's not what you want. So while you're using regular expressions, maybe you want to add the "word only" feature:
def sum_of_a_beach(beach):
return len(re.findall(r'\b(Sand|Water|Fish|Sun)\b', beach, re.IGNORECASE))
This will only match whole words thanks to \b word boundary delimiter

Regular expression pattern questions?

I am having a hard time understanding regular expression pattern. Could someone help me regular expression pattern to match all words ending in s. And start with a and end with a (like ana).
How do I write ending?
Word boundaries are given by \b so the following regex matches words ending with ing or s: "\b(\w+?(?:ing|s))\b" where as \b is a word boundary, \w+ is one or more "word character" and (?:ing|s) is an uncaptured group of either ing or s.
As you asked "how to develop a regex":
First: Don't use regex for complex tasks. They are hard to read, write and maintain. For example there is a regex that validates email addresses - but its computer generated and nothing you should use in practice.
Start simple and add edge cases. At the beginning plan what characters you need to use: You said you need words ending with s or ing. So you probably need something to represent a word, endings of words and the literal characters s and ing. What is a word? This might change from case to case, but at least every alphabetical character. Looking up in the python documentation on regexes you can find \w which is [a-zA-Z0-9_], which fits my impression of a word character. There you can also find \b which is a word boundary.
So the "first pseudo code try" is something like \b\w...\w\b which matches a word. We still need to "formalize" ... which we want to have the meaning of "one ore more characters", which directly translates to \b\w+\b. We can now match a word! We still need the s or ing. | translates to or, so how is the following: \b\w+ing|s\b? If you test this, you'll see that it will match confusing things like ingest which should not match our regex. What is happening? As you probably already saw the | can't know "which part it should or", so we need to introduce parenthesis: \b\w+(ing|s)\b. Congratulations, you have now arrived at a working regex!
Why (and how) does this differ from the example I gave first? First I wrote \w+? instead of \w+, the ? turns the + into a non-greedy version. If you know what the difference between greedy and non greedy is, skip this paragraph. Consider the following: AaAAbA and we want to match the things enclosed with big letter A. A naive try: A\w+A, so one or more word characters enclosed with A. This matches AaA, but also AaAAbA, A is still something that can be matched by \w. Without further config the *+? quantifier all try to match as much as possible. Sometimes, like in the A example, you don't want that, you can then use a ? after the quantifier to signal you want a non-greedy version, a version that matches as little as possible.
But in our case this isn't needed, the words are well seperated by whitespaces, which are not part of \w. So in fact you can just let + be greedy and everything will be alright. If you use . (any character) you often need to be careful not to match to much.
The other difference is using (?:s|ing) instead of (s|ing). What does the ?: do here? It changes a capturing group to a non capturing group. Generally you don't want to get "everything" from the regex. Consider the following regex: I want to go to \w+. You are not interested in the whole sentence, but only in the \w+, so you can capture it in a group: I want to go to (\w+). This means that you are interested in this specific piece of information and want to retrieve it later. Sometimes (like when using |) you need to group expressions together, but are not interested in their content, you can then declare it as non capturing. Otherwise you will get the group (s or ing) but not the actual word!
So to summarize:
* start small
* add one case after another
* always test with examples
In fact I just tried re.findall(\b\w+(?:ing|s)\b, "fishing words") and it didn't work. \w+(?:ing|s) works. I've no idea why, maybe someone else can explain that. Regex are an arcane thing, only use them for easy and easy to test tasks.
Generally speaking I'd use \b to match "word boundaries" with \w which matches word components (short cut for [A-Za-z0-9_]). Then you can do an or grouping to match "s" or "ing". Result is:
/\b\w+(s|ing)\b/

How to add tags to negated words in strings that follow "not", "no" and "never"

How do I add the tag NEG_ to all words that follow not, no and never until the next punctuation mark in a string(used for sentiment analysis)? I assume that regular expressions could be used, but I'm not sure how.
Input:It was never going to work, he thought. He did not play so well, so he had to practice some more.
Desired output:It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more.
Any idea how to solve this?
To make up for Python's re regex engine's lack of some Perl abilities, you can use a lambda expression in a re.sub function to create a dynamic replacement:
import re
string = "It was never going to work, he thought. He did not play so well, so he had to practice some more. Not foobar !"
transformed = re.sub(r'\b(?:not|never|no)\b[\w\s]+[^\w\s]',
lambda match: re.sub(r'(\s+)(\w+)', r'\1NEG_\2', match.group(0)),
string,
flags=re.IGNORECASE)
Will print (demo here)
It was never NEG_going NEG_to NEG_work, he thought. He did not NEG_play NEG_so NEG_well, so he had to practice some more. Not NEG_foobar !
Explanation
The first step is to select the parts of your string you're interested in. This is done with
\b(?:not|never|no)\b[\w\s]+[^\w\s]
Your negative keyword (\b is a word boundary, (?:...) a non capturing group), followed by alpahnum and spaces (\w is [0-9a-zA-Z_], \s is all kind of whitespaces), up until something that's neither an alphanum nor a space (acting as punctuation).
Note that the punctuation is mandatory here, but you could safely remove [^\w\s] to match end of string as well.
Now you're dealing with never going to work, kind of strings. Just select the words preceded by spaces with
(\s+)(\w+)
And replace them with what you want
\1NEG_\2
I would not do this with regexp. Rather I would;
Split the input on punctuation characters.
For each fragment do
Set negation counter to 0
Split input into words
For each word
Add negation counter number of NEG_ to the word. (Or mod 2, or 1 if greater than 0)
If original word is in {No,Never,Not} increase negation counter by one.
You will need to do this in several steps (at least in Python - .NET languages can use a regex engine that has more capabilities):
First, match a part of a string starting with not, no or never. The regex \b(?:not?|never)\b([^.,:;!?]+) would be a good starting point. You might need to add more punctuation characters to that list if they occur in your texts.
Then, use the match result's group 1 as the target of your second step: Find all words (for example by splitting on whitespace and/or punctuation) and prepend NEG_ to them.
Join the string together again and insert the result in your original string in the place of the first regex's match.

Simple Python Regex Find pattern

I have a sentence. I want to find all occurrences of a word that start with a specific character in that sentence. I am very new to programming and Python, but from the little I know, this sounds like a Regex question.
What is the pattern match code that will let me find all words that match my pattern?
Many thanks in advance,
Brock
import re
print re.findall(r'\bv\w+', thesentence)
will print every word in the sentence that starts with 'v', for example.
Using the split method of strings, as another answer suggests, would not identify words, but space-separated chunks that may include punctuation. This re-based solution does identify words (letters and digits, net of punctuation).
I second the Dive Into Python recommendation. But it's basically:
m = re.findall(r'\bf.*?\b', 'a fast and friendly dog')
print(m)
\b means word boundary, and .*? ensures we store the whole word, but back off to avoid going too far (technically, ? is called a lazy operator).
You could do (doesn't use re though):
matching_words = [x for x in sentence.split() if x.startswith(CHAR_TO_FIND)]
Regular expressions work too (see the other answers) but I think this solution will be a little more readable, and as a beginner learning Python, you'll find list comprehensions (like the solution above) important to gain a comfort level with.
>>> sentence="a quick brown fox for you"
>>> pattern="fo"
>>> for word in sentence.split():
... if word.startswith(pattern):
... print word
...
fox
for
Split the sentence on spaces, use a loop to search for the pattern and print them out.
import re
s = "Your sentence that contains the word ROAD"
s = re.sub(r'\bROAD', 'RD.', s)
print s
Read: http://diveintopython3.org/regular-expressions.html

Categories

Resources