Python Regex (Search Multiple values in one string) - python

In python regex how would I match against a large string of text and flag if any one of the regex values are matched... I have tried this with "|" or statements and i have tried making a regex list.. neither worked for me.. here is an example of what I am trying to do with the or..
I think my "or" gets commented out
patterns=re.compile(r'[\btext String1\b] | [\bText String2\b]')
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")
The above code always says it matches regardless if the string appears and if I change it around a bit I get matches on the first regex but never checks the second.... I believe this is because the "Raw" is commenting out my or statement but how would I get around this??
I also tried to get around this by taking out the "Raw" statement and putting double slashes on my \b for escaping but that didn't work either :(
patterns=re.compile(\\btext String1\\b | \\bText String2\\b)
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")
I then tried to do 2 separate raw statements with the or and the interpreter complains about unsupported str opperands...
patterns=re.compile(r'\btext String1\b' | r'\bText String2\b')
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")

patterns=re.compile(r'(\btext String1\b)|(\bText String2\b)')
You want a group (optionally capturing), not a character class. Technically, you don't need a group here:
patterns=re.compile(r'\btext String1\b|\bText String2\b')
will also work (without any capture).
The way you had it, it checked for either one of the characters between the first square brackets, or one of those between the second pair. You may find a regex tutorial helpful.
It should be clear where the "unsupported str operands" error comes from. You can't OR strings, and you have to remember the | is processed before the argument even gets to compile.

This part [\btext String1\b] means is there a "word separator" or one of the letters in "text String1" present. So that matches anything but an empty line I think.

In a RE pattern, square brackets [ ] indicate a "character class" (depending on what's inside them, "any one of these character" or "any character except one of these", the latter indicate by a caret ^ as the first character after the opening [). This is what you're expressing and it has absolutely nothing to do with what you want -- just remove the brackets and you should be fine;-).

Related

Do character classes count as groups in regular expressions?

A small project I got assigned is supposed to extract website URLs from given text. Here's how the most relevant portion of it looks like :
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+-\\/_]+
)''',re.VERBOSE)
This does do its job properly, but I noticed that it also includes the ','s and '.' in URL strings it prints. So my first question is, how do I make it exclude any punctuation symbols in the end of the string it detects ?
My second question is referring to the title itself ( finally ), but doesn't really seem to affect this particular program I'm working on : Do character classes ( in this case [a-zA-Z0-9.%+-\/_]+ ) count as groups ( group[3] in this case ) ?
Thanks in advance.
To exclude some symbols at the end of string you can use negative lookbehind. For example, to disallow . ,:
.*(?<![.,])$
answering in reverse:
No, character classes are just shorthand for bracketed text. They don't provide groups in the same way that surrounding with parenthesis would. They only allow the regular expression engine to select the specified characters -- nothing more, nothing less.
With regards to finding comma and dot: Actually, I see the problem here, though the below may still be valuable, so I'll leave it. Essentially, you have this: [a-zA-Z0-9.%+-\\/_]+ the - character has special meaning: everything between these two characters -- by ascii code. so [A-a] is a valid range. It include A-Z, but also a bunch of other characters that aren't A-Z. If you want to include - in the range, then it needs to be the last character: [a-zA-Z0-9.%+\\/_-]+ should work
For comma, I actually don't see it represented in your regex, so I can't comment specifically on that. It shouldn't be allowed anywhere in the url. In general though, you'll just want to add more groups/more conditions.
First, break apart the url into the specifc groups you'll want:
(scheme)://(domain)(endpoint)
Each section gets a different set of requirements: e.g. maybe domain needs to end with a slash:
[a-zA-Z0-9]+\.com/ should match any domain that uses an alphanumeric character, and ends -- specifically -- with .com (note the \., otherwise it'll capture any single character followed by com/
For the endpoint section, you'll probably still want to allow special characters, but if you're confident you don't want the url to end with, say, a dot, then you could do something [A-Za-z0-9] -- note the lack of a dot here, plus, it's length -- only a single character. This will change the rest of your regex, so you need to think about that.
A couple of random thoughts:
If you're confident you want to match the whole line, add a $ to the end of the regex, to signify the end of the line. One possibility here is that your regex does match some portion of the text, but ignores the junk at the end, since you didn't say to read the whole line.
Regexes get complicated really fast -- they're kind of write-only code. Add some comments to help. E.g.
web_url_regex = re.compile(
r'(http://|https://)' # Capture the scheme name
r'([a-zA-Z0-9.%+-\\/_])' # Everything else, apparently
)
Do not try to be exhaustive in your validation -- as noted, urls are hard to validate because you can't know for sure that one is valid. But the form is pretty consistent, as laid out above: scheme, domain, endpoint (and query string)
To answer the second question first, no a character class is not a group (unless you explicitly make it into one by putting it in parentheses).
Regarding the first question of how to make it exclude the punctuation symbols at the end, the code below should answer that.
Firstly though, your regex had an issue separate from the fact that it was matching the final punctuation, namely that the last - does not appear to be intended as defining a range of characters (see footnote below re why I believe this to be the case), but was doing so. I've moved it to the end of the character class to avoid this problem.
Now a character class to match the final character is added at the end of the regexp, which is the same as the previous character class except that it does not include . (other punctuation is now already not included). So the matched pattern cannot end in .. The + (one or more) on the previous character class is now reduced to * (zero or more).
If for any reason the exact set of characters matched needs tweaking, then the same principle can still be employed: match a single character at the end from a reduced set of possibilities, preceded by any number of characters from a wider set which includes characters that are permitted to be included but not at the end.
import re
webURLregex = re.compile(r'''(
(https://|http://)
[a-zA-Z0-9.%+\\/_-]*
[a-zA-Z0-9%+\\/_-]
)''',re.VERBOSE)
str = "... at http://www.google.com/. It says"
m = re.search(webURLregex, str)
if m:
print(m.group())
Outputs:
http://www.google.com/
[*] The observation that the second - does not appear to be intended to define a character range is based on the fact that, if it was, such a range would be from 056-134 (octal) which would include also the alphabetical characters, making the a-zA-Z redundant.

regex: replace hyphens with en-dashes with re.sub

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!
I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo
re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)

Regular expression matching lines that are not commented out

Given the following code
print("aaa")
#print("bbb")
# print("ccc")
def doSomething():
print("doSomething")
How can I use regular expression in Atom text editor to find all the print functions that are not commented out? I mean I only want to match the prints in print("aaa") and print("doSomething").
I've tried [^#]print, but this also matches the print in # print("ccc"), which is something that is not desired.
[^# ]print doesn't match any line here.
The reason I want to do this is that I want to disable the log messages inside a legacy project written by others.
Since you confirm my first suggestion (^(?![ \t]*#)[ \t]*print) worked for you (I deleted that first comment), I believe you just want to find the print on single lines.
The \s matches any whitespace, incl. newline symbols. If you need to just match tabs or spaces, use a [ \t] character class.
Use
^[ \t]*print
or (a bit safer in order not to find any printers):
^[ \t]*print\(
I you want to match only the print (and not all arguments), you can use :
^\s*(print)
See this live sample : http://refiddle.com/refiddles/57b56c8075622d22e8080000

regex noob questions

so this is my string:
"""$10. 2109 W. Chicago Ave., 773-772-0406, theoldoaktap.com"""
and i know that this is the proper regex formula to give me what I want (output follows):
age = re.match(r'\$([\d.]+)\. (.+), ([\d-]+)', example)
print age.groups()
output ====> ('10', '2109 W. Chicago Ave.', '773-772-0406')
but i have some questions about the regex formula even after reading the doc:
When grouped with the ()parenthesis, those are the separate tuple values the regex is ultimately returning, right?
If I delete the $ sign, why does the whole thing completely break down with error:unbalanced parenthesis? shouldn't the regular expression be able to grab the price after the $ regardless of if I specified $ beforehand? And building off that, if I want the output to be $10, not 10, why can't i move the $ inside and simply run r'\($[\d.]+)? it throws me another unbalanced parenthesis error.
after the (.+), in the middle, is the comma the only way python knows we are done with the value to be slotted into the second tuple value slot? So, (.+) doesn't really mean 'any character' does it? a comma would move it on to the next character if it happened to be follow by a digit, right?
could someone explain the placement of the + signs inside the parenthesis rather than outside and how that makes a difference?
sorry for the terribly noob questions. ill get good one day. thanks in advance.
When grouped with the ()parenthesis, those are the separate tuple values the regex is ultimately returning, right?
Correct
If I delete the $ sign, why does the whole thing completely break down with error:unbalanced parenthesis? shouldn't the regular expression be able to grab the price after the $ regardless of if I specified $ beforehand?
If you delete the dollar sign, your escape character \ escapes the opening parentheses character (, tell the regex engine not to treat it as a literal character it needs to search for in your string.
after the (.+), in the middle, is the comma the only way python knows we are done with the value to be slotted into the second tuple value slot?
Yes it tells Python to capture 1 or more of almost any character up until the last comma. . match almost any single character. .+ matches 1 or more of almost any character.
Note that .+ is greedy meaning it will keep capturing commas up until before the last one. If you want it to stop before the first comma, you can make it lazy using .+?
could someone explain the placement of the + signs inside the parenthesis rather than outside and how that makes a difference?
It doesn't change the behaviour of the +, whether its on the inside or outside. It just changes what gets captured into the group.
EDIT:
Why can't i move the $ inside and simply run r'($[\d.]+)? it throws me another unbalanced parenthesis error.
This is because $ also has a special meaning (means match end-of-line) just like ( and ) in regex, meaning you need to escape it you want to match the literal character just like you escaped your parenthesis: \$.

Regular Expressions in unicode strings

I have some unicode text that I want to clean up using regular expressions. For example I have cases where u'(2'. This exists because for formatting reasons the closing paren ends up in an adjacent html cell. My initial solution to this problem was to look ahead at the contents of the next cell and using a string function determine if it held the closing paren. I knew this was not a great solution but it worked. Now I want to fix it but I can't seem to make the regular expression work.
missingParen=re.compile(r"^\(\d[^\)]$")
My understanding of what I think I am doing:
^ at the beginning of the string I want to find
( an open paren, the paren has to be backslashed because it is a special character
\d I also want to find a single digit
[ I am creating a special character class
^ I don't want to find what follows
) which is a close paren
$ at the end of the string
And of course the plot thickens I made a silly assumption that because I placed a \d I would not find (33 but I am wrong so I added a {1} to my regular expression and that did not help, it matched (3333, so my problem is more complicated than I thought. I want the string to be only an open paren and a single digit. Is this the more clever approach
missingParen=re.compile(r"^\(\d$")
And note S Lott _I already tagged it beginner so you can't pick up any cheap points Not that I don't appreciate your insights I keep meaning to read your book, it probably has the answer
Okay sorry for using this a a stream of consciousness thinking stimulator but it appears that writing out my original question got me on the path. It seems to me that this is a solution for what I am trying to do:
missingParen=re.compile(r"^\(\d$")

Categories

Resources