regex noob questions

regex noob questions - python

so this is my string:
"""$10. 2109 W. Chicago Ave., 773-772-0406, theoldoaktap.com"""
and i know that this is the proper regex formula to give me what I want (output follows):
age = re.match(r'\$([\d.]+)\. (.+), ([\d-]+)', example)
print age.groups()
output ====> ('10', '2109 W. Chicago Ave.', '773-772-0406')
but i have some questions about the regex formula even after reading the doc:
When grouped with the ()parenthesis, those are the separate tuple values the regex is ultimately returning, right?
If I delete the $ sign, why does the whole thing completely break down with error:unbalanced parenthesis? shouldn't the regular expression be able to grab the price after the $ regardless of if I specified $ beforehand? And building off that, if I want the output to be $10, not 10, why can't i move the $ inside and simply run r'\($[\d.]+)? it throws me another unbalanced parenthesis error.
after the (.+), in the middle, is the comma the only way python knows we are done with the value to be slotted into the second tuple value slot? So, (.+) doesn't really mean 'any character' does it? a comma would move it on to the next character if it happened to be follow by a digit, right?
could someone explain the placement of the + signs inside the parenthesis rather than outside and how that makes a difference?
sorry for the terribly noob questions. ill get good one day. thanks in advance.

When grouped with the ()parenthesis, those are the separate tuple values the regex is ultimately returning, right?
Correct
If I delete the $ sign, why does the whole thing completely break down with error:unbalanced parenthesis? shouldn't the regular expression be able to grab the price after the $ regardless of if I specified $ beforehand?
If you delete the dollar sign, your escape character \ escapes the opening parentheses character (, tell the regex engine not to treat it as a literal character it needs to search for in your string.
after the (.+), in the middle, is the comma the only way python knows we are done with the value to be slotted into the second tuple value slot?
Yes it tells Python to capture 1 or more of almost any character up until the last comma. . match almost any single character. .+ matches 1 or more of almost any character.
Note that .+ is greedy meaning it will keep capturing commas up until before the last one. If you want it to stop before the first comma, you can make it lazy using .+?
could someone explain the placement of the + signs inside the parenthesis rather than outside and how that makes a difference?
It doesn't change the behaviour of the +, whether its on the inside or outside. It just changes what gets captured into the group.
EDIT:
Why can't i move the $ inside and simply run r'($[\d.]+)? it throws me another unbalanced parenthesis error.
This is because $ also has a special meaning (means match end-of-line) just like ( and ) in regex, meaning you need to escape it you want to match the literal character just like you escaped your parenthesis: \$.

Related

Python regex OR expression

I have a file named Document.pdf and sometimes it is called Document-12345678.pdf where -12345678 is a random number.
I want to check a file is downloaded in folder. When the file is not finished it display Document.pdf.fkasfmq or Document-12345678.pdf.fkasfmq where .fkasfmq is a random hash from the downloader and I don't want it to match.
I try make a regex like r'Document(?:[\-0-9]+).pdf' and test it with either Document.pdf or Document-12345678.pdf it will always return false.
From my understanding (?:[\-0-9]+) means it can be or not in the set that matches any hyphen and any numbers before .pdf, is that correct? I am very very rusty with regex...

The parentheses only perform grouping, not optionality. If you want to make the expression optional, the ? quantifier does that (and actually the parentheses are unnecessary, as the character class is a single expression). Though as #anubhava notes in a comment, you might as well use the * quantifier then.
r'Document[-0-9]*\.pdf'
Notice also the backslash to match a literal dot; an unescaped . matches any character (other than newline). Inside a character class, an initial or final hyphen does not need to be backslash-escaped.
On the other hand, perhaps prefer a more precise expression:
r'^Document(-\d)?\.pdf$'
which says, opionally, a hyphen followed by numbers, and nothing before or after.

You should mark it as optional with the "?" symbol. Otherwise, you are requiring that the name should have the numbers and/or digits part.
r'Document(?:[\-0-9]+)?\.pdf'
Or as #anubhava pointed out in the comments, it can be simplified to:
r'Document[\-0-9]*\.pdf'
This way, it will also match e.g. "Document.pdf"
Also, you should consider putting the mark "$" to signify end of string so that it doesn't match e.g. "Document.pdf.fkasfmq"
r'^Document(?:[\-0-9]+)?\.pdf$'
Or
r'^Document[\-0-9]*\.pdf$'

You can just use (\d{8}) to see if there's a document there with 8 digits in the filename.

Pattern for '.' separated words with arbitrary number of whitespaces

It's the first time that I'm using regular expressions in Python and I just can't get it to work.
Here is what I want to achieve: I want to find all strings, where there is a word followed by a dot followed by another word. After that an unknown number of whitespaces followed by either (off) or (on). For example:
word1.word2 (off)
Here is what I have come up so far.
string_group = re.search(r'\w+\.\w+\s+[(\(on\))(\(off\))]', analyzed_string)
\w+ for the first word
\. for the dot
\w+ for the second word
\s+ for the whitespaces
[(\(on\))(\(off\))] for the (off) or (on)
I think that the last expression might not be doing what I need it to. With the implementation right now, the program does find the right place in the string, but the output of
string_group.group(0)
Is just
word1.word2 (
instead of the whole expression I'm looking for. Could you please give me a hint what I am doing wrong?

[ ... ] is used for character class, and will match any one character inside them unless you put a quantifier: [ ... ]+ for one or more time.
But simply adding that won't work...
\w+\.\w+\s+[(\(on\))(\(off\))]+
Will match garbage stuff like word1.word2 )(fno(nofn too, so you actually don't want to use a character class, because it'll match the characters in any order. What you can use is a capturing group, and a non-capturing group along with an OR operator |:
\w+\.\w+\s+(\((?:on|off)\))
(?:on|off) will match either on or off
Now, if you don't like the parentheses, to be caught too in the first group, you can change that to:
\w+\.\w+\s+\((on|off)\)

You've got your logical OR mixed up.
[(\(on\))(\(off\))]
should be
\((?:on|off)\)
[]s are just for matching single characters.

The square brackets are a character class, which matches any one of the characters in the brackets. You appear to be trying to use it to match one of the sub-regexes (\(one\)) and (\(two\)). The way to do that is with an alternation operation, the pipe symbol: (\(one\)|\(two\)).

I think your problem may be with the square brackets []
they indicate a set of single characters to match. So your expression would match a single instance of any of the following chars: "()ofn"
So for the string "word1.word2 (on)", you are matching only this part: "word1.word2 ("
Try using this one instead:
re.search(r'\w+\.\w+\s+\((on|off)\)', analyzed_string)
This match assumes that the () will be there, and looks for either "on" or "off" inside the parenthesis.

How could I get regex to start when it has reached a specific point within a string?

Say I have a string like {{ComputersRule}} and a regex like: [^\}]+. How would I get regular expressions to start at a specified point in the string, i.e. Once it has reached the third character in the string. If it's relevant, and I doubt it is, I'm working in Python version 2.7.3. Thank you.

I'd recommend using Python to grab the substring from the third character onwards, and then apply the regex to the rest.
Otherwise, you could just use the regex . (any character except newline) to gobble up the first n characters:
^.{3}([^\}]+)
Notice the ^.{3} which forces the [^\}]+ to not include the first three characters of the string (the ^ anchors to the start of the string/line). The brackets capture the bit you want to extract (so get capturing group 1).
In your particular case, if it's just a case of "I want the text inside the {{ and }}" you could do \{\{([^\}]+)\}\} or [^\{\}]+.

It appears that what you want to do is to match text within the double braces.
The trick is to specify the braces in the regex but capture the part within. In this case try
\{\{([^}]+)\}\}

beginning and ending sign in regular expression in python

'[A-Za-z0-9-_]*'
'^[A-Za-z0-9-_]*$'
I want to check if a string only contains the sign in the above expression, just want to make sure no more weird sign like #%&/() are in the strings.
I am wondering if there's any difference between these two regular expression? Did the beginning and ending sign matter? Will it affect the result somehow?

Python regular expressions are anchored at the beginning of strings (like in many other languages): hence the ^ sign at the beginning doesn’t make any difference. However, the $ sign does very much make one: if you don’t include it, you’re only going to match the beginning of your string, and the end could contain anything – including the characters you want to exclude. Just try re.match("[a-z0-9]", "abcdef/%&").
In addition to that, you may want to use a regular expression that simply excludes the characters you’re testing for, it’s much safe (hence [^#%&/()] – or maybe you have to do something to escape the parentheses; can’t remember how it works at the moment).

The beginning and end sign match the beginning and end of a String.
The first will match any String that contains zero or more ocurrences of the class [A-Za-z0-9-_] (basically any string whatsoever...).
The second will match an empty String, but not one that contains characters not defined in [A-Za-z0-9-_]

Yes it will. A regex can match anywhere in its input. # will match in your first regex.

Python Regex (Search Multiple values in one string)

In python regex how would I match against a large string of text and flag if any one of the regex values are matched... I have tried this with "|" or statements and i have tried making a regex list.. neither worked for me.. here is an example of what I am trying to do with the or..
I think my "or" gets commented out
patterns=re.compile(r'[\btext String1\b] | [\bText String2\b]')
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")
The above code always says it matches regardless if the string appears and if I change it around a bit I get matches on the first regex but never checks the second.... I believe this is because the "Raw" is commenting out my or statement but how would I get around this??
I also tried to get around this by taking out the "Raw" statement and putting double slashes on my \b for escaping but that didn't work either :(
patterns=re.compile(\\btext String1\\b | \\bText String2\\b)
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")
I then tried to do 2 separate raw statements with the or and the interpreter complains about unsupported str opperands...
patterns=re.compile(r'\btext String1\b' | r'\bText String2\b')
if(patterns.search(MyTextFile)):
print ("YAY one of your text patterns is in this file")

patterns=re.compile(r'(\btext String1\b)|(\bText String2\b)')
You want a group (optionally capturing), not a character class. Technically, you don't need a group here:
patterns=re.compile(r'\btext String1\b|\bText String2\b')
will also work (without any capture).
The way you had it, it checked for either one of the characters between the first square brackets, or one of those between the second pair. You may find a regex tutorial helpful.
It should be clear where the "unsupported str operands" error comes from. You can't OR strings, and you have to remember the | is processed before the argument even gets to compile.

This part [\btext String1\b] means is there a "word separator" or one of the letters in "text String1" present. So that matches anything but an empty line I think.

In a RE pattern, square brackets [ ] indicate a "character class" (depending on what's inside them, "any one of these character" or "any character except one of these", the latter indicate by a caret ^ as the first character after the opening [). This is what you're expressing and it has absolutely nothing to do with what you want -- just remove the brackets and you should be fine;-).

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.