Regex to ignore pattern found in quotes (Python or R)

Regex to ignore pattern found in quotes (Python or R) - python

I am trying to create a regex that allows me to find instances of a string where I have an unspaced /
eg:
some characters/morecharacters
I have come up with the expression below which allows me to find word characters or closing parenthesis before my / and word characters or open parenthesis characters afterwards.
(\w|\))/(\(|\w)
This works great for most situations, however I am coming unstuck when I have a / enclosed in quotes. In this case I'd like it to be ignored. I have seen a few different posts here and here. However, I can't quite get them to work in my situation.
What I'd like is for first three cases identified below to match and the last cast to be ignored allowing me to extract item 1 and item 3.
some text/more text
(formula)/dividethis
divideme/(byme)
"dont match/me"

It ain't pretty, but this will do what you want:
(?<!")(?:\(|\b)[^"\n]+\/[^"\n]+(?:\)|\b)(?!")
Demo on Regex101
Let's break it down a bit:
(?<!")(?:\(|\b) will match either an open bracket or a word boundary, as long as it's not preceded by a quotation mark. It does this by employing a negative lookbehind.
[^"\n]+ will match one or more characters, as long as they're neither a quotation mark or a line break (\n).
\/ will match a literal slash character.
Finally, (?:\)|\b)(?!") will match either a closing bracket or a word boundary as long as it's not followed by a quotation mark. It does this by employing a negative lookahead. Note that the (?:\)|\b) will only work 100% correctly in this order - if you reverse them, it'll drop the match on the bracket, because it encounters a word boundary before it gets to the bracket.

This will only match word/word which is not inside quotation marks.
import re
text = """
some text/more text "dont match/me" divideme/(byme)
(formula)/dividethis
divideme/(byme) "dont match/me hel d/b lo a/b" divideme/(byme)
"dont match/me"
"""
groups=re.findall("(?:\".*?\")|(\S+/\S+)", text, flags=re.MULTILINE)
print filter(None,groups)
Output:
['text/more', 'divideme/(byme)', '(formula)/dividethis', 'divideme/(byme)', 'divideme/(byme)']
(?:\".*?\") This will match everything inside quotes but this group won't be captured.
(\S+/\S+) This will match word/word only outside the quotations and this group will be captured.
Demo on Regex101

Related

How to ignore comments inside string literals

I'm doing a lexer as a part of a university course. One of the brain teasers (extra assignments that don't contribute to the scoring) our professor gave us is how could we implement comments inside string literals.
Our string literals start and end with exclamation mark. e.g. !this is a string literal!
Our comments start and end with three periods. e.g. ...This is a comment...
Removing comments from string literals was relatively straightforward. Just match string literal via /!.*!/ and remove the comment via regex. If there's more than three consecutive commas, but no ending commas, throw an error.
However, I want to take this even further. I want to implement the escaping of the exclamation mark within the string literal. Unfortunately, I can't seem to get both comments and exclamation mark escapes working together.
What I want to create are string literals that can contain both comments and exclamation mark escapes. How could this be done?
Examples:
!Normal string!
!String with escaped \! exclamation mark!
!String with a comment ... comment ...!
!String \! with both ... comments can have unescaped exclamation marks!!!... !
This is my current code that can't ignore exclamation marks inside comments:
def t_STRING_LITERAL(t):
r'![^!\\]*(?:\\.[^!\\]*)*!'
# remove the escape characters from the string
t.value = re.sub(r'\\!', "!", t.value)
# remove single line comments
t.value = re.sub(r'\.\.\.[^\r\n]*\.\.\.', "", t.value)
return t

Perhaps this might be another option.
Match 0+ times any character except a backslash, dot or exclamation mark using the first negated character class.
Then when you do match a character that the first character class does not matches, use an alternation to match either:
repeat 0+ times matching either a dot that is not directly followed by 2 dots
or match from 3 dots to the next first match of 3 dots
or match only an escaped character
To prevent catastrophic backtracking, you can mimic an atomic group in Python using a positive lookahead with a capturing group inside. If the assertion is true, then use the backreference to \1 to match.
For example
(?<!\\)![^!\\.]*(?:(?:\.(?!\.\.)|(?=(\.{3}.*?\.{3}))\1|\\.)[^!\\.]*)*!
Explanation
(?<!\\)! Match ! not directly preceded by \
[^!\\.]* Match 1+ times any char except ! \ or .
(?: Non capture group
(?:\.(?!\.\.) Match a dot not directly followed by 2 dots
| Or
(?=(\.{3}.*?\.{3}))\1 Assert and capture in group 1 from ... to the nearest ...
| Or
\\. Match an escaped char
) Close group
[^!\\.]* Match 1+ times any char except ! \ or .
)*! Close non capture group and repeat 0+ times, then match !
Regex demo

Look at this regex to match string literals: https://regex101.com/r/v2bjWi/2.
(?<!\\)!(?:\\!|(?:\.\.\.(?P<comment>.*?)\.\.\.)|[^!])*?(?<!\\)!.
It is surrounded by two (?<!\\)! meaning unescaped exclamation mark,
It consists of alternating escaped exclamation marks \\!, comments (?:\.\.\.(?P<comment>.*?)\.\.\.) and non-exclamation marks [^!].
Note that this is about as much as you can achieve with a regular expression. Any additional request, and it will not be sufficient any more.

regular expression ending

I have ONE string as plain text and want to extract phone numbers of any format from it.
Here is my regex:
r = re.compile(r"(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)[-\s*]\d{3}[-\.\s]??\d{4})")
It extracts the following matches correctly:
617.933.6444
(880)-567-4565
(880) 567-4565
222-333-8888
555 666 4444
9999999999
But how can I avoid getting 7986815059 when I have 798681505951 in the text?
How to make an ending for my regex? (it should not contain letters and digits after and before, exact number count must be 10)
!!!!
Decision
If somebody needs to find US phone numbers in string, use link from the last Wiktor Stribiżew comment.

You need to use word boundaries, but placing them into your pattern is not obvious. It is due to the fact that the second alternative starts with a non-word char, \(. Thus, the first \b must be added at the beginning of the first alternative, and the trailing one at the very end of the pattern:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^ ^^
See the regex demo
You may also require a non-word char or start of string before (. Then add \B at the second alternative start:
r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\B\(\d{3}\)[-\s*]\d{3}[-.\s]?\d{4})\b'
^^
See another demo
Also, note that there is no need escaping a . inside a character class, it is already parsed as a literal dot in [.]. And no need using a lazy ?? quantifier, it does not make sense here and a greedy version, ?, will work equally well and will look "cleaner".

Regex replace multiple punctuation in python

I would like to find multiple occurrences of exclamation marks, question marks and periods (such as !!?!, ...?, ...!) and replace them with just the final punctuation.
i.e. !?!?!? would become ?
and ....! would become !
Is this possible?

text = re.sub(r'[\?\.\!]+(?=[\?\.\!])', '', text)
That is, remove any sequence of ?!. characters that are going to be followed by another ?!. character.
[...] is a character class. It matches any character inside the brackets.
+ means "1 or more of these".
(?=...) is a lookahead. It looks to see what is going to come next in the string.

text = re.search('[.?!]*([.?!])', text).group(1)
The way this works is that the parentheses create a capture group, allowing you to access the matched text via the group function.

Python, regex negative lookbehind behavior

I have a regular experssion that should find up to 10 words in a line. THat is, it should include the word just preceding the line feed but not the words after the line feed. I am using a negative lookbehind with "\n".
a = re.compile(r"((\w)+[\s /]){0,10}(?<!\n)")
r = a.search("THe car is parked in the garage\nBut the sun is shining hot.")
When I execute this regex and call the method r.group(), I am getting back the whole sentence but the last word that contains a period. I was expecting only the complete string preceding the new line. That is, "THe car is parked in the garage\n".
What is the mistake that I am making here with the negative look behind...?

I don't know why you would use negative lookahead. You are saying that you want a maximum of 10 words before a linefeed. The regex below should work. It uses a positive lookahead to ensure that there is a linefeed after the words. Also when searching for words use `b\w+\b` instead of what you were using.
/(\b\w+\b)*(?=.*\\n)/
Python :
result = re.findall(r"(\b\w+\b)*(?=.*\\n)", subject)
Explanation :
# (\b\w+\b)*(?=.*\\n)
#
# Match the regular expression below and capture its match into backreference number 1 «(\b\w+\b)*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Note: You repeated the capturing group itself. The group will capture only the last iteration. Put a capturing group around the repeated group to capture all iterations. «*»
# Assert position at a word boundary «\b»
# Match a single character that is a “word character” (letters, digits, etc.) «\w+»
# Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
# Assert position at a word boundary «\b»
# Assert that the regex below can be matched, starting at this position (positive lookahead) «(?=.*\\n)»
# Match any single character that is not a line break character «.*»
# Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
# Match the character “\” literally «\\»
# Match the character “n” literally «n»
You may also wish to consider the fact that there could be no \n at your string.

If I read you right, you want to read up to 10 words, or the first newline, whichever comes first:
((?:(?<!\n)\w+\b[\s.]*){0,10})
This uses a negative lookbehind, but just before the word match, so it blocks getting any word after a newline.
This will need some tuning for imperfect input, but it's a start.

For this task there is the anchor $ to find the the end of the string and together with the modifier re.MULTILINE/re.M it will find the end of the line. So you would end up with something like this
(\b\w+\b[.\s /]{0,2}){0,10}$
See it here on Regexr
The \b is a word boundary. I included [.\s /]{0,2} to match a dot followed by a whitespace in my example. If you don't want the dots you need to make this part at least optional like this [\s /]? otherwise it will be missing at the last word and then the \s is matching the \n.
Update/Idea 2
OK, maybe I misunderstood your question with my first solution.
If you just want to not match a newline and continue in the second row, then just don't allow it. The problem is that the newline is matched by the \s in your character class. The \s is a class for whitespace and this includes also the newline characters \r and \n
You already have a space in the class then just replace the \s with \t in case you want to allow tab and then you should be fine without lookbehind. And of course, make the character class optional otherwise the last word will also not be matched.
((\w)+[\t /]?){0,10}
See it here on Regexr

I think you shouldn't be using a lookbehind at all. If you want to match up to ten words not including a newline, try this:
\S+(?:[ \t]+\S+){0,9}
A word is defined here as one or more non-whitespace characters, which includes periods, apostrophes, and other sentence punctuation as well as letters. If you know the text you're matching is regular prose, there's no point limiting yourself to \w+, which isn't really meant to match natural-language words anyway.
After the first word, it repeatedly matches one or more horizontal whitespace characters (space or TAB) followed by another word, for a maximum of ten words. If it encounters a newline before the tenth word, it simply stops matching at that point. There's no need to mention newlines in the regex at all.

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.

>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.

It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.

If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.

mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex to ignore pattern found in quotes (Python or R) - python

Related

How to ignore comments inside string literals

regular expression ending

Regex replace multiple punctuation in python

Python, regex negative lookbehind behavior

Python: Regex to extract part of URL found between parentheses

Categories

Resources