Regex replace multiple punctuation in python - python

I would like to find multiple occurrences of exclamation marks, question marks and periods (such as !!?!, ...?, ...!) and replace them with just the final punctuation.
i.e. !?!?!? would become ?
and ....! would become !
Is this possible?

text = re.sub(r'[\?\.\!]+(?=[\?\.\!])', '', text)
That is, remove any sequence of ?!. characters that are going to be followed by another ?!. character.
[...] is a character class. It matches any character inside the brackets.
+ means "1 or more of these".
(?=...) is a lookahead. It looks to see what is going to come next in the string.

text = re.search('[.?!]*([.?!])', text).group(1)
The way this works is that the parentheses create a capture group, allowing you to access the matched text via the group function.

Related

Match between two single quotes and continue matching if two single quotes appear in a row or if '.' appears in the middle?

I'm trying to setup a regex pattern match that will match 'hello' twice in 'hello'blah blah blah 'hello' but that will match the full string 'hello''hello' and the full string 'hello''hello' as well as the full string 'hello'.'hello'.
To put it simply, I want to start a match when there is a single quote, and continue the match until I encounter another single quote, unless there is another single quote immediately after it or if there is a .' after it, in which case I want to continue matching until I encounter a single quote that doesn't match those conditions.
This is what I have to match values between single quotes currently:
\'[^\']*\'
I already read the solution here: How to replace single quote to two single quotes, but do nothing if two single quotes are next to each other in perl but it doesn't quite fit what I'm looking for and can't get it to match the in-between stuff.
You can use this regex:
('[^']+'(?:\.?'[^']+')*)
It looks for a set of characters enclosed in single quotes, followed optionally by some number of sets of characters enclosed in single quotes, possibly preceded by a period.
Demo on regex101
'.*?'(\.?'.*?')*
If I understand your need correctly, it's easy to construct the above regex.

Match only final character of regex

I have some Python code that involves a lot of re.sub() commands. In some cases, I want to replace a character but only if it comes after certain other characters. The following is an example of how I currently am doing this in python:
secStress = "[aeiou],"[-1]
So my input for this would be a string like "a,s I walk, I hum." And I want to replace that first comma but not the "a" that comes before it.
The problem is that Python doesn't like when I give it a variable as input for re.sub(). Is there a way I can write a regex that specifies that only its final character is supposed to be matched?
You are looking for either a capturing group/backreference or a positive lookbehind solution:
s = "a,s I walk, I hum."
# Capturing group / backreference
print(re.sub(r"([aeiou]),", r"\1", s))
# Positive lookbehind
print(re.sub(r"(?<=[aeiou]),", "", s))
See the Python demo.
First approach details
The ([aeiou]) is a capturing group that matches a vowel and stores it in a special memory buffer that you can refer to from the replacement pattern using backreferences. Here, the Group ID is 1, so you can access that value using r"\1".
Second approach details
The (?<=[aeiou]) is a positive lookbehind that only checks (but does not add the text to the match value) if there is a vowel immediately before the current position. So, only those commas are matched that are preceded with a vowel and it is enough to replace with an empty string to get rid of the comma since it is the only symbol kept in the match.
If I understand you correctly,
>>> import re
>>> def doit(matchobj):
... return matchobj.group()[0]
...
>>> re.sub(r'[aeiou],', doit, "a,s I walk, I hum.")
'as I walk, I hum.'
If the regex matches then doit is called with the object that matched. Whatever string doit returns (and it must be a string) is put in place of the match.

Regex to ignore pattern found in quotes (Python or R)

I am trying to create a regex that allows me to find instances of a string where I have an unspaced /
eg:
some characters/morecharacters
I have come up with the expression below which allows me to find word characters or closing parenthesis before my / and word characters or open parenthesis characters afterwards.
(\w|\))/(\(|\w)
This works great for most situations, however I am coming unstuck when I have a / enclosed in quotes. In this case I'd like it to be ignored. I have seen a few different posts here and here. However, I can't quite get them to work in my situation.
What I'd like is for first three cases identified below to match and the last cast to be ignored allowing me to extract item 1 and item 3.
some text/more text
(formula)/dividethis
divideme/(byme)
"dont match/me"
It ain't pretty, but this will do what you want:
(?<!")(?:\(|\b)[^"\n]+\/[^"\n]+(?:\)|\b)(?!")
Demo on Regex101
Let's break it down a bit:
(?<!")(?:\(|\b) will match either an open bracket or a word boundary, as long as it's not preceded by a quotation mark. It does this by employing a negative lookbehind.
[^"\n]+ will match one or more characters, as long as they're neither a quotation mark or a line break (\n).
\/ will match a literal slash character.
Finally, (?:\)|\b)(?!") will match either a closing bracket or a word boundary as long as it's not followed by a quotation mark. It does this by employing a negative lookahead. Note that the (?:\)|\b) will only work 100% correctly in this order - if you reverse them, it'll drop the match on the bracket, because it encounters a word boundary before it gets to the bracket.
This will only match word/word which is not inside quotation marks.
import re
text = """
some text/more text "dont match/me" divideme/(byme)
(formula)/dividethis
divideme/(byme) "dont match/me hel d/b lo a/b" divideme/(byme)
"dont match/me"
"""
groups=re.findall("(?:\".*?\")|(\S+/\S+)", text, flags=re.MULTILINE)
print filter(None,groups)
Output:
['text/more', 'divideme/(byme)', '(formula)/dividethis', 'divideme/(byme)', 'divideme/(byme)']
(?:\".*?\") This will match everything inside quotes but this group won't be captured.
(\S+/\S+) This will match word/word only outside the quotations and this group will be captured.
Demo on Regex101

Pattern for '.' separated words with arbitrary number of whitespaces

It's the first time that I'm using regular expressions in Python and I just can't get it to work.
Here is what I want to achieve: I want to find all strings, where there is a word followed by a dot followed by another word. After that an unknown number of whitespaces followed by either (off) or (on). For example:
word1.word2 (off)
Here is what I have come up so far.
string_group = re.search(r'\w+\.\w+\s+[(\(on\))(\(off\))]', analyzed_string)
\w+ for the first word
\. for the dot
\w+ for the second word
\s+ for the whitespaces
[(\(on\))(\(off\))] for the (off) or (on)
I think that the last expression might not be doing what I need it to. With the implementation right now, the program does find the right place in the string, but the output of
string_group.group(0)
Is just
word1.word2 (
instead of the whole expression I'm looking for. Could you please give me a hint what I am doing wrong?
[ ... ] is used for character class, and will match any one character inside them unless you put a quantifier: [ ... ]+ for one or more time.
But simply adding that won't work...
\w+\.\w+\s+[(\(on\))(\(off\))]+
Will match garbage stuff like word1.word2 )(fno(nofn too, so you actually don't want to use a character class, because it'll match the characters in any order. What you can use is a capturing group, and a non-capturing group along with an OR operator |:
\w+\.\w+\s+(\((?:on|off)\))
(?:on|off) will match either on or off
Now, if you don't like the parentheses, to be caught too in the first group, you can change that to:
\w+\.\w+\s+\((on|off)\)
You've got your logical OR mixed up.
[(\(on\))(\(off\))]
should be
\((?:on|off)\)
[]s are just for matching single characters.
The square brackets are a character class, which matches any one of the characters in the brackets. You appear to be trying to use it to match one of the sub-regexes (\(one\)) and (\(two\)). The way to do that is with an alternation operation, the pipe symbol: (\(one\)|\(two\)).
I think your problem may be with the square brackets []
they indicate a set of single characters to match. So your expression would match a single instance of any of the following chars: "()ofn"
So for the string "word1.word2 (on)", you are matching only this part: "word1.word2 ("
Try using this one instead:
re.search(r'\w+\.\w+\s+\((on|off)\)', analyzed_string)
This match assumes that the () will be there, and looks for either "on" or "off" inside the parenthesis.

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources