How to match patterns in one sentence using regex in python? - python

Here are 2 examples,
1. I need to take this apple. I just finished the first one.
2. I need to get some sleep. apple is not working.
I want to match the text with need and apple in the same sentence.
By using need.*apple it will match both examples. But I want it works only for the first one. How do I change the code, or do we have other string methods in Python?

The comment posted by #ctwheels concerning splitting on . and then testing to see if if it contains apple and need is a good one not requiring the use of regular expressions. I would first, however, split again on white space and then test these words against the resulting list to ensure you do not match against applesauce. But here is a regex solution:
import re
text = """I need to take this apple. I just finished the first one.
I need to get some sleep. apple is not working."""
regex = re.compile(r"""
[^.]* # match 0 or more non-period characters
(
\bneed\b # match 'need' on a word boundary
[^.]* # match 0 or more non-period characters
\bapple\b # match 'apple' on a word boundary
| # or
\bapple\b # match 'apple' on a word boundary
[^.]* # match 0 or more non-period characters
\bneed\b # match 'need' on a word boundary
)
[^.]* # match 0 or more non-period characters
\. # match a period
""", flags=re.VERBOSE)
for m in regex.finditer(text):
print(m.group(0))
Prints:
I need to take this apple.
The problem with both of these solutions is if the sentence contains a period whose usage is for purposes other than ending a sentence, such as I need to take John Q. Public's apple. In this case you need a more powerful mechanism for dividing the text up into sentences. Then the regex that operates against these sentences, of course, becomes simpler but splitting on white space still seems to make the most sense.

Related

Split string from digits/number according to sentence length

I have cases that I need to seperate chars/words from digits/numbers which are written consecutively, but I need to do this only when char/word length more than 3.
For example,
input
ferrari03
output must be:
ferrari 03
However, it shouldn't do any action for the followings:
fe03, 03fe, 03ferrari etc.
Can you help me on this one ? I'm trying to do this without coding any logic, but re lib in python.
Using re.sub() we can try:
inp = ["ferrari03", "fe03", "03ferrari", "03fe"]
output = [re.sub(r'^([A-Za-z]{3,})([0-9]+)$', r'\1 \2', i) for i in inp]
print(output) # ['ferrari 03', 'fe03', '03ferrari', '03fe']
Given an input word, the above regex will match should that word begin with 3 or more letters and end in 1 or more digits. In that case, we capture the letters and numbers in the \1 and \2 capture groups, respectively. We replace by inserting a separating space.

How to replace single digit by same digit followed by punctuation?

I want to replace any single digit by the same digit followed by punctuation (comma ,) using python regex?
text = 'I am going at 5pm to type 3 and the 9 later'
I want this to be converted to
text = 'I am going at 5pm to type 3, and the 9, later'
My attempt:
match = re.search('\s\d{1}\s', x)
I could able to detect them but dont now how to replace by the same digit followed by comma.
Regex #1
See regex in use here
(?<=\b\d)\b
Replace with ,
How it works:
(?<=(?:)\d) positive lookbehind ensuring the following precedes:
\b assert position as a word boundary
\d match a digit
\b assert position as a word boundary
To prevent it from matching locations like 3, a simply append (?!,) to the regex.
Regex #2
To prevent matching a single digit at the start and end of the string, you can use the following regex:
See regex in use here
(?<=(?<!^)\b\d)\b(?!$)
Same as above regex, but adds following:
(?<!^) ensures the word boundary \b that it precedes doesn't match the start of the line
(?!$) ensure the word boundary \b that it follows doesn't match the end of the line
You can remove either token if that's not the behaviour you want.
To prevent it from matching locations like 3, a simply change the negative lookahead to (?!,|$) or append (?!,) to the regex.
Regex #3
If \b can't be used (e.g. if you have some numbers like 3.3), you can use the following instead:
See regex in use here
(?:(?<=\s\d)|(?<=^\d))(?=\s)
How it works:
(?:(?<=\s\d)|(?<=^\d)) match either of the following:
(?<=\s\d) positive lookbehind ensuring what precedes is a whitespace character
(?<=^\d) positive lookbehind ensuring what precedes is the start of the line
(?=\s) positive lookahead ensuring what follows is a whitespace character
Regex #4
If you don't need to match digits at the start of the string, modify the second regex by removing the second lookbehind as such:
See regex in use here
(?<=\s\d)(?=\s)
Code
Sample code (replace regex pattern with whichever pattern works best for you):
import re
x = 'I am going at 5pm to type 3 and the 9 later'
r = re.sub(r'(?<=\b\d)\b', ',', x)
print(r)
You could use a word boundary and a capture group to achieve this:
import re
text = 'I am going at 5pm to type 3 and the 9 later'
re.sub(r'\b(\d)\b', r"\1,", text)
# => 'I am going at 5pm to type 3, and the 9, later'

Regex to match single dots but not numbers or reticences

I'm working on a sentencizer and tokenizer for a tutorial. This means splitting a document string into sentences and sentences into words. Examples:
#Sentencizing
"This is a sentence. This is another sentence! A third..."=>["This is a sentence.", "This is another sentence!", "A third..."]
#Tokenizatiion
"Tokens are 'individual' bits of a sentence."=>["Tokens", "are", "'individual'", "bits", "of", "a", "sentence", "."]
As seen, there's a need for something more than just a string.split(). I'm using re.sub() appending a 'special' tag for each match (and later splitting in this tag), first for sentences and then for tokens.
So far it works great, but there's a problem: how to make a regex that can split at dots, but not at (...) or at numbers (3.14)?
I've been working with these options with lookahead (I need to match the group and then be able to recall it for appending), but none works:
#Do a negative look behind for preceding numbers or dots, central capture group is a dot, do the same as first for a look ahead.
(?![\d\.])(\.)(?<![\d\.])
The application is:
sentence = re.sub(pattern, '\g<0>'+special_tag, raw_sentence)
I used the following to find the periods that it looked like were relevant:
import re
m = re.compile(r'[0-9]\.[^0-9.]|[^0-9]\.[^0-9.]|[!?]')
st = "This is a sentence. This is another sentence! A third... Pi is 3.14. This is 1984. Hello?"
m.findall(st)
# if you want to use lookahead, you can use something like this:
m = re.compile(r'(?<=[0-9])\.(?=[^0-9.])|(?<=[^0-9])\.(?=[^0-9.])|[!?]')
It's not particularly elegant, but I also tried to deal with the case of "We have a .1% chance of success."
Good luck!
This might be overkill, or need a bit of cleanup, but here is the best regex I could come up with:
((([^\.\n ]+|(\.+\d+))\b[^\.]? ?)+)([\.?!\)\"]+)
To break it down:
[^\.\n ]+ // Matches 1+ times any char that isn't a dot, newline or space.
(\.+\d+) // Captures the special case of decimal numbers
\b[^\.]? ? // \b is a word boundary. This may be optionally
// followed by any non-dot character, and optionally a space.
All these previous parts are matches 1+ times. In order to determine that a sentence is finished, we use the following:
[\.?!\)\"] // Matches any of the common sentences terminators 1+ times
Try it out!

insert space between regex match

I want to un-join typos in my string by locating them using regex and insert a space character between the matched expression.
I tried the solution to a similar question ... but it did not work for me -(Insert space between characters regex); solution- to use the replace string as '\1 \2' in re.sub .
import re
corpus = '''
This is my corpus1a.I am looking to convert it into a 2corpus 2b.
'''
clean = re.compile('\.[^(\d,\s)]')
corpus = re.sub(clean,' ', corpus)
clean2 = re.compile('\d+[^(\d,\s,\.)]')
corpus = re.sub(clean2,'\1 \2', corpus)
EXPECTED OUTPUT:
This is my corpus 1 a. I am looking to convert it into a 2 corpus 2 b.
You need to put the capture group parentheses around the patterns that match each string that you want to copy to the result.
There's also no need to use + after \d. You only need to match the last digit of the number.
clean = re.compile(r'(\d)([^\d,\s])')
corpus = re.sub(clean, r'\1 \2', corpus)
DEMO
I'm not sure about other possible inputs, we might be able to add spaces using an expression similar to:
(\d+)([a-z]+)\b
after that we would replace any two spaces with a single space and it might work, not sure though:
import re
print(re.sub(r"\s{2,}", " ", re.sub(r"(\d+)([a-z]+)\b", " \\1 \\2", "This is my corpus1a.I am looking to convert it into a 2corpus 2b")))
The expression is explained on the top right panel of this demo, if you wish to explore further or modify it, and in this link, you can watch how it would match against some sample inputs step by step, if you like.
Capture groups, marked by parenthesis ( and ), should be around the patterns you want to match.
So this should work for you
clean = re.compile(r'(\d+)([^\d,\s])')
corpus = re.sub(clean,'\1 \2', corpus)
The regex (\d+)([^\d,\s]) reads: match 1 or more digits (\d+) as group 1 (first set of parenthesis), match non-digit and non-whitespace as group 2.
The reason why your's doesn't work was that you did not have parenthesis surrounding the patterns you want to reuse.

regex continue only if positive lookahead has been matched at least once

Using python: How do i get the regex to continue only if a positive lookahead has been matched at least once.
I'm trying to match:
Clinton-Orfalea-Brittingham Fellowship Program
Here's the code I'm using now:
dp2= r'[A-Z][a-z]+(?:-\w+|\s[A-Z][a-z]+)+'
print np.unique(re.findall(dp2, tt))
I'm matching the word, but it's also matching a bunch of other extraneous words.
My thought was that I'd like the \s[A-Z][a-z] to kick in ONLY IF -\w+ has been hit at least once (or maybe twice). would appreciate any thoughts.
To clarify: I'm not aiming to match specifically this set of words, but to be able to generically match Proper noun- Proper noun- (indefinite number of times) and then a non-hyphenated Proper noun.
eg.
Noun-Noun-Noun Noun Noun
Noun-Noun Noun
Noun-Noun-Noun Noun
THE LATEST ITERATION:
dp5= r'(?:[A-Z][a-z]+-?){2,3}(?:\s\w+){2,4}'
The {m,n} notation can be used to force the regex to ONLY MATCH if the previous expression exists between m and n times. Maybe something like
(?:[A-Z][a-z]+-?){2,3}\s\w+\s\w+ # matches 'Clinton-Orfalea-Brittingham Fellowship Program'
If you're SPECIFICALLY looking for "Clinton-Orfalea-Brittingham Fellowship Program", why are you using Regex to find it? Just use word in string. If you're looking for things of the form: Name-Name-Name Noun Noun, this should work, but be aware that Name-Name-Name-Name Noun Noun won't, nor will Name-Name-Name Noun Noun Noun (In fact, something like "Alice-Bob-Catherine Program" will match not only that but whatever word comes after it!)
# Explanation
RE = r"""(?: # Begins the group so we can repeat it
[A-Z][a-z]+ # Matches one cap letter then any number of lowercase
-? # Allows a hyphen at the end of the word w/o requiring it
){2,3} # Ends the group and requires the group match 2 or 3 times in a row
\s\w+ # Matches a space and the next word
\s\w+ # Does so again
# those last two lines could just as easily be (?:\s\w+){2}
"""
RE = re.compile(RE,re.verbose) # will compile the expression as written
If you're looking specifically for hyphenated proper nouns followed by non-hyphenated proper nouns, I would do this:
[A-Z][a-z]+-(?:[A-Z][a-z]+(?:-|\s))+
# Explanation
RE = r"""[A-Z][a-z]+- # Cap letter+small letters ending with a hyphen
(?: # start a non-cap group so we can repeat it
[A-Z][a-z]+# As before, but doesn't require a hyphen
(?:
-|\s # but if it doesn't have a hyphen, it MUST have a space
) # (this group is just to give precedence to the |
)+ # can match multiple of these.
"""

Categories

Resources