I want to parse this particular address through the regex in python.
address = "16220 Scottsdale Road, Suite 100 Scottsdale, AZ 85254"
why this regex is returning None
try:
print re.search('/[0-9]{1,5} (.*?), (.*?) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?/', address)
except:
None
Remove the leading and trailing slashes and use raw strings instead:
>>> re.search(r'[0-9]{1,5} (.*?), (.*?) [a-zA-Z]{2} [0-9]{5}(-[0-9]{4})?', address)
<_sre.SRE_Match object; span=(0, 53), match='16220 Scottsdale Road, Suite 100 Scottsdale, AZ 8>
Here is the difference between greedy and non greedy matching (see the matched string):
>>> re.search(r'.*?,', "abcd,abcde,abc, f")
<_sre.SRE_Match object; span=(0, 5), match='abcd,'>
>>> re.search(r'.*,', "abcd,abcde,abc, f")
<_sre.SRE_Match object; span=(0, 15), match='abcd,abcde,abc,'>
Related
I have an issue with the following task.
I have a string:
ABCD[A] or A7D3[A,B,C]
First 4 Characters are 0-9 or A-Z.
5th character is [.
6th to nth character is A-Z followed by , in case there is more than one letter
e.g. A, E,F, A,B,C,D,F I don't know if there is a character limit with the middle part, so I have to assume it is 26 (A-Z).
last character is ].
I need to verify, that the structure of the string is as stated above.
ABCD[A,B]
BD1F[E,G,A,R]
S4P5[C]
I tried with regex ( in python)
r = re.match('^[0-9A-Z]{4}[[A-Z,]+$',text)
text being an example of the string, however it is not working.
A true / false or 0 or 1 as result would be fine
Any ideas how this could be done? What I've seen on google so far regex would work, however I'm not proficient enough with it to solve this by myself.
You can use '[0-9A-Z]{4}\[[A-Z](?:,[A-Z]){,25}\]':
import re
for s in ['ABCD[A,B]', 'BD1F[E,G,A,R]', 'S4P5[C]']:
print(re.fullmatch(r'[0-9A-Z]{4}\[[A-Z](?:,[A-Z]){,25}\]', s))
Note that the (?:,[A-Z]){,25} limits the number of letters in the square brackets but does not ensure that they are non-duplicates.
Output:
<re.Match object; span=(0, 9), match='ABCD[A,B]'>
<re.Match object; span=(0, 13), match='BD1F[E,G,A,R]'>
<re.Match object; span=(0, 7), match='S4P5[C]'>
regex demo
You can try:
import re
lst = ["ABCD[A,B]", "BD1F[E,G,A,R]", "S4P5[C]", "S4P5[CD]"]
pattern = r"^[A-Z0-9]{4}\[[A-Z](?:,[A-Z])*]$"
for string in lst:
m = re.match(pattern, string)
print(bool(m), m)
output:
True <re.Match object; span=(0, 9), match='ABCD[A,B]'>
True <re.Match object; span=(0, 13), match='BD1F[E,G,A,R]'>
True <re.Match object; span=(0, 7), match='S4P5[C]'>
False None
Explanation:
^: beginning of the string.
[A-Z0-9]{4} for getting the first 4 characters.
\[ for escaping the bracket.
[A-Z] first character inside bracket is mandatory.
(?:,[A-Z])* the rest would be optional.
]$: end of the string.
Note-1: You could restrict the inside characters to 25 by changing * to {,25}.
Note-2: I didn't escape the last bracket but doing so doesn't hurt if you want (maybe better).
I'm trying to define a regex in serveral lines using re.VERBOSE but python is adding a newline symbol. eg
When not using multiline
In [1]: pat = re.compile(r'''(?P<host>(\d{1,3}\.){3}\d{1,3})( - )(?P<user_name>(\w+|-)).''')
...: pat
re.compile(r'(?P<host>(\d{1,3}\.){3}\d{1,3})( - )(?P<user_name>(\w+|-)).',re.UNICODE)
But when trying to define as multiline
In [2]: pat = re.compile(r'''\
...: (?P<host>(\d{1,3}\.){3}\d{1,3})\
...: ( - )(?P<user_name>(\w+|-)).''', re.MULTILINE|re.VERBOSE)
In [4]: pat
re.compile(r'\\n(?P<host>(\d{1,3}\.){3}\d{1,3})\\n( - )(?P<user_name>(\w+|-)).',
re.MULTILINE|re.UNICODE|re.VERBOSE)
I keep getting a \n where the next part of regex is define but it shouldn't.
How am I supouse to define a multiline regex?
There's no inherent problem with having newlines in your regex when you use the re.VERBOSE flag, as whitespace is ignored, with an important caveat:
Whitespace within the pattern is ignored, except when in a character
class, or when preceded by an unescaped backslash
Your first problem is that you are adding an unnecessary \ to the end of each of the lines in your regex, and they are then appearing in the regex, making the newlines preceded by an unescaped backslash and thus required for a match. Consider this trivial example:
pat = re.compile(r'''
\d+
-
\d+''', re.VERBOSE)
pat
# re.compile('\n\\d+\n-\n\\d+', re.VERBOSE) - note newlines in the regex
pat.match('24-34')
# <re.Match object; span=(0, 5), match='24-34'> - but it still matches fine
pat = re.compile(r'''\
\d+\
-\
\d+''', re.VERBOSE)
pat
# re.compile('\\\n\\d+\\\n-\\\n\\d+', re.VERBOSE)
pat.match('24-34')
# nothing
pat.match('\n24\n-\n34')
# <re.Match object; span=(0, 8), match='\n24\n-\n34'> - newlines required to be matched
Your other problem is that your regex is attempting to match whitespace in this capture group:
( - )
To match whitespace when you have the re.VERBOSE flag set, you must follow the rules and escape it or put it in a character class. For example:
pat = re.compile(r'( - )', re.VERBOSE)
pat.match(' - ')
# nothing - the spaces in the regex are ignored
pat.match('-')
# <re.Match object; span=(0, 1), match='-'> - matches just the `-`
pat = re.compile(r'(\ -[ ])', re.VERBOSE) # important whitespace treated appropriately
pat.match(' - ')
# <re.Match object; span=(0, 3), match=' - '> - matches the string because whitespace rules followed
Demo on regex101
I want to find all occurrences of LocalizedString(...) in a text file. Between parenthesis, anything could be included. How can I find this using regular expressions?
I searched online but had no luck.
Thank you
Lets say we have a file with the following string:
class MyClass {
func myFunc(){
var text = LocalizedString("arrow.Find")
}
}
r"LocalizedString\([^)]*\)"
LocalizedString\( matches LocalizedString( literally.
[^)]* matches anything that isn't a closing parenthesis.
\) matches a closing parenthesis literally.
Example:
>>> import re
>>> regex = re.compile(r"LocalizedString\([^)]*\)")
>>> regex.match('LocalizedString(a,b)')
<re.Match object; span=(0, 20), match='LocalizedString(a,b)'>
Note that the match for code like LocalizedString(a(x),b) is not the entire string:
>>> regex.match('LocalizedString(a(x),b)')
<re.Match object; span=(0, 20), match='LocalizedString(a(x)'>
This happens because regular expressions can't handle arbitrarily nested parentheses such as a(b(c(x()))).
You could also eagerly match everything that's between any two parentheses, going as far as possible till there are no more closing parentheses:
>>> regex2 = re.compile(r"LocalizedString\(.*\)")
>>> regex2.match('LocalizedString(a(x),b)')
<re.Match object; span=(0, 23), match='LocalizedString(a(x),b)'>
>>> regex2.match('LocalizedString(a(x),b) && print("Hello!")')
<re.Match object; span=(0, 42), match='LocalizedString(a(x),b) && print("Hello!")'>
To match just what's inside the parentheses:
r"LocalizedString\(([^)]*)\)"
Then the capture group ([^)]*) will contain the required data:
>>> regex3 = re.compile(r"LocalizedString\(([^)]*)\)")
>>> regex3.match('LocalizedString(a,b)').groups()
('a,b',)
>>> regex3.match('LocalizedString(a(x),b)').groups()
('a(x',)
Why the following pattern string results in a match of "A cat", instead of "a hat" since match is greedy by default?
>>> m = re.match(r'(\w+) (\w+)', "A cat jumpped over a hat")
>>> m
<_sre.SRE_Match object; span=(0, 5), match='A cat'>
Could someone shed some light on them?
From the official Python documentation on regexes
re.match() checks for a match only at the beginning of the string
From official document:
If zero or more characters at the beginning of string match the regular expression pattern, return a corresponding match object. Return None if the string does not match the pattern; note that this is different from a zero-length match.
As others have alluded, re.match starts from the beginning of the string-to-match and only checks for what is necessary. Notice match='A cat' at the end of the object's string representation denotes what was matched: r'(\w+) (\w+)' of "A cat jumpped over a hat".
If you were to add a $ to the end of your pattern, indicating the string-to-match should end there, it will not result in a match. And if you were to take that same pattern and shorten it to only two words, it would match once again:
>>> re.match(r'(\w+) (\w+)', "A cat jumpped over a hat")
<_sre.SRE_Match object; span=(0, 5), match='A cat'>
>>> re.match(r'(\w+) (\w+)$', "A cat jumpped over a hat")
>>> re.match(r'(\w+) (\w+)$', "A cat")
<_sre.SRE_Match object; span=(0, 5), match='A cat'>
I am trying to match the following strings in python 2.7 using the python regular expression package re and am having trouble coming up with the regex code:
https://www.this.com/john-smith/e5609239
https://www.this.com/jane-johnson/e426609216
https://www.this.com/wendy-saad/e172645609215
https://www.this.com/nick-madison/e7265609214
https://www.this.com/tom-taylor/e17265709211
https://www.this.com/james-bates/e9212
So the prefix is fixed "https://www.this.com/" and then there are a variable number of lowercase letters, then "-", then "e", then a variable number of digits.
Here is what I have tried to no avail:
href=re.compile("https://www.this.com/people-search/[a-z]+[\-](?P<firstNumBlock>\d+)/")
href=re.compile("https://www.this.com/people-search/[a-z][\-][a-z]+/e[0-9]+")
Thanks for your help!
You are running into issues with escaping special characters. Since you're not using raw strings, the backslash has special meaning in your string literal itself. Additionally, character classes (with []) don't require escaping in a regular expression. You can simplify your expression as follows:
expression = r"https://www.mylife.com/people-search/[a-z]+-[a-z]+/e\d+"
With the following data:
strings = ['https://www.mylife.com/people-search/john-smith/e5609239',
'https://www.this.com/people-search/jane-johnson/e426609216',
'https://www.this.com/people-search/wendy-saad/e172645609215',
'https://www.this.com/people-search/nick-madison/e7265609214',
'https://www.this.com/people-search/tom-taylor/e17265709211',
'https://www.this.com/people-search/james-bates/e9212']
Result:
>>> for s in strings:
... print(re.match(expression, s))
...
<_sre.SRE_Match object; span=(0, 56), match='https://www.this.com/people-search/john-smith/e>
<_sre.SRE_Match object; span=(0, 60), match='https://www.this.com/people-search/jane-johnson>
<_sre.SRE_Match object; span=(0, 61), match='https://www.this.com/people-search/wendy-saad/e>
<_sre.SRE_Match object; span=(0, 61), match='https://www.this.com/people-search/nick-madison>
<_sre.SRE_Match object; span=(0, 60), match='https://www.this.com/people-search/tom-taylor/e>
<_sre.SRE_Match object; span=(0, 54), match='https://www.this.com/people-search/james-bates/>
href=re.compile("https://www\.mylife\.com/people-search/[a-z]+-[a-z]+/e[0-9]+")
Try out here.
re.compile(r'https://www.this.com/[a-z-]+/e\d+')
[a-z-]+ match john-smith
e\d+ match e5609239
text = '''https://www.this.com/john-smith/e5609239
https://www.this.com/jane-johnson/e426609216
https://www.this.com/wendy-saad/e172645609215
https://www.this.com/nick-madison/e7265609214
https://www.this.com/tom-taylor/e17265709211
https://www.this.com/james-bates/e9212'''
href = re.compile(r'https://www\.this\.com/[a-zA-Z]+\-[a-zA-Z]+/e[0-9]+')
m = href.findall(text)
pprint(m)
Outputs:
['https://www.this.com/john-smith/e5609239',
'https://www.this.com/jane-johnson/e426609216',
'https://www.this.com/wendy-saad/e172645609215',
'https://www.this.com/nick-madison/e7265609214',
'https://www.this.com/tom-taylor/e17265709211',
'https://www.this.com/james-bates/e9212']