Python Regex back reference a named group - python

I'm attempting to parse phone numbers that can come through in different ways. For example:
(321) 123-4567
(321) 1234567
321-123-4567
321123-4567
I then want to graph each of the three parts separately. My thought is to use named groups and some and or situation like so:
(^\s*(?P<area>[0-9]{3})\-?(?P<fst>[0-9]{3})\-(?P<lst>[0-9]{4}))|(^\s*\(\area\)\s*(\fst)\-?(\lst))
Problem with that, I believe, is that I am not calling the named groups properly. I'm trying to use https://regex101.com/ to help but am still getting stuck. Because the parentheses around the area code should either both be there or neither should be there I don't want to use the "?" character like:
\(?(?P<area>[0-9]{3})\)?
Can anyone Help me with this? Thank you so much.
I'm using python 3.6 and the re package.

There were a few issues with your regex. You didn't make the brackets optional, and you didn't allow optional spaces between area code and first part. Without seeing your Python code it's not easy to know how you were doing things, but I did this by splitting into a compiled regex, and then using the regex against the list of numbers.
from __future__ import print_function
import re
phone_numbers = [
'(321) 123-4567',
'(321) 1234567',
'321-123-4567',
'321123-4567',
]
regex = re.compile(r'^\s*\(?(?P<area>[0-9]{3})[) -]*(?P<fst>[0-9]{3})-?(?P<sec>[0-9]{4})')
for p in phone_numbers:
print(regex.sub(r'(\g<area>) \g<fst>-\g<sec>', p))
This isn't perfect as it will allow things that aren't valid syntax (according to your list) to be parsed, but this shouldn't be a problem. For example '(321))- - )) 123-4567' would be parsed correctly.

I'd use group testing: ^(\()?(?P<area>\d{3})(?(1)\))[ -]?(?P<fst>\d{3})-?(?P<lst>\d{4})$.
In there:
(\()? captures an opening parenthese in group 1 when exists.
(?(1)\)) tests for existence of a captured group 1, if so matches a closing parenthese.
The rest is pretty straightforward.

Related

Improving accuracy/brevity of regex for inconsistent url filtering

So, for some lulz, a friend and I were playing with the idea of filtering a list (100k+) of urls to retrieve only the parent domain (ex. "domain.com|org|etc"). The only caveat is that they are not all nice and matching in format.
So, to explain, some may be "http://www.domain.com/urlstuff", some have country codes like "www.domain.co.uk/urlstuff", while others can be a bit more odd, more akin to "hello.in.con.sistent.urls.com/urlstuff".
So, story aside, I have a regex that works:
import re
firsturl = 'www.foobar.com/fizz/buzz'
m = re.search('\w+(?=(\..{3}/|\..{2}\..{2}/))\.(.{3}|.{2}\..{2})', firsturl)
m.group(0)
which returns:
foobar.com
It looks up the first "/" at the end of the url, then returns the two "." separated fields before it.
So, my query, would anyone in the stack hive mind have any wisdom to shed on how this could be done with better/shorter regex, or regex that doesn't rely on a forward lookup of the "/" within the string?
Appreciation for all of the help in this!
I do think that regex is just the right tool for this. Regex is pattern matching, which is put to best use when you have a known pattern that might have several variations, as in this case.
In your explanation of and attempted solution to the problem, I think you are greatly oversimplifying it, though. TLDs come in many more flavors than "2-digit country codes" and "3-digit" others. See ICANN's list of top-level domains for the hundreds currently available, with lengths from 2 digits and up. Also, you may have URLs without any slashes and some with multiple slashes and dots after the domain name.
So here's my solution (see on regex101):
^(?:https?://)?(?:[^/]+\.)*([^/]+\.[a-z]{2,})
What you want is captured in the first matching group.
Breakdown:
^(?:https?://)? matches a possible protocol at the beginning
(?:[^/]+\.)* matches possible multiple non-slash sequences, each followed by a dot
([^/]+\.[a-z]{2,}) matches (and captures) one final non-slash sequence followed by a dot and the TLD (2+ letters)
You can use this regex instead:
import re
firsturl = 'www.foobar.com/fizz/buzz'
domain = re.match("(.+?)\/", firsturl).group()
Notice, though, that this will only work without 'http://'.

Find string in possibly multiple parentheses?

I am looking for a regular expression that discriminates between a string that contains a numerical value enclosed between parentheses, and a string that contains outside of them. The problem is, parentheses may be embedded into each other:
So, for example the expression should match the following strings:
hey(example1)
also(this(onetoo2(hard)))
but(here(is(a(harder)one)maybe23)Hehe)
But it should not match any of the following:
this(one)is22misleading
how(to(go)on)with(multiple)3parent(heses(around))
So far I've tried
\d[A-Za-z] \)
and easy things like this one. The problem with this one is it does not match the example 2, because it has a ( string after it.
How could I solve this one?
The problem is not one of pattern matching. That means regular expressions are not the right tool for this.
Instead, you need lexical analysis and parsing. There are many libraries available for that job.
You might try the parsing or pyparsing libraries.
These type of regexes are not always easy, but sometimes it's possible to come up with a way provided the input remains somewhat consistent. A pattern generally like this should work:
(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)
Code:
import re
p = re.compile(ur'(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)', re.MULTILINE)
result = re.findall(p, searchtext)
print(result)
Result:
https://regex101.com/r/aL8bB8/1

How to combine multiple regular expressions into one line?

My script works fine doing this:
images = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", doc)
videos = re.findall("\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)", doc)
However, I believe it is inefficient to search through the whole document twice.
Here's a sample document if it helps: http://pastebin.com/5kRZXjij
I would expect the following output from the above:
images = http://37.media.tumblr.com/tumblr_lnmh4tD3sM1qi02clo1_500.jpg
videos = http://bassrx.tumblr.com/video_file/86319903607/tumblr_lo8i76CWSP1qi02cl
Instead it would be better to do something like:
image_and_video_links = re.findall(" <match-image-links-or-video links> ", doc)
How can I combine the two re.findall lines into one?
I have tried using the | character but I always fail to match anything. So I'm sure I'm completely confused as to how to use it properly.
As mentioned in the comments, a pipe (|) should do the trick.
The regular expression
(src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg))|(\S*?(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*))
catches either of the two patterns.
Demo on Regex Tester
If you really want efficient...
For starters, I would cut out the \S*? in the second regex. It serves no purpose apart from an opportunity for lots of backtracking.
src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)|(http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*)
Other ideas
You can get rid of the capture groups by using a small lookbehind in the first one, allowing you to get rid of all parentheses and directly matching what you want. Not faster, but tidier:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-zA-Z0-9]*
Do you intend for the periods after src and media to mean "any character", or to mean "a literal period"? If the latter, escape them: \.
You can use the re.IGNORECASE option and get rid of some letters:
(?<=src.\")\S*?media.tumblr\S*?tumblr_\S*?jpg|http\S*?video_file\S*?tumblr_[a-z0-9]*

trouble understanding regular expressions, python

I am quite new to the re module in python, but have been trying to write a regular expression to grab the version number of a file. In most cases this snippet seems to work:
test = "filename.ver3_576.exr"
print(re.search("(?!(v|ver|version|vers))\d+", test.lower()).group())
but if I change the test string a little, it does not give me the results I would expect:
test2 = "filename.ver_3_576.exr" # expects None, because of the underscore, gets 3
test3 = "filenameVe2_version201_1001.exr" # expects 201, gets2, "ve"(exactly) is not something I want to search for
I am obviously doing something wrong here, but a struggling to identify what that might be.
Any help would be greatly appreciated, cheers
re.search('(version|vers|ver|v)(\d+)', test.lower()).group(2)
To answer your comment, you didn't use a lookbehind expression. That's a negative lookahead expression. The expression you used is identical to '\d+' (not so easy to explain why).
It's not easy to use a positive lookbehind re in this case because it requires a fixed width pattern. The following re, for example, will throw an error: '(?<=(version|vers|ver|v))\d+', so I suggest you use the re that I posted bucause it's the most streight forward.

Generate random string from regex character set

I assume there's some beautiful Pythonic way to do this, but I haven't quite figured it out yet. Basically I'm looking to create a testing module and would like a nice simple way for users to define a character set to pull from. I could potentially concatenate a list of the various charsets associated with string, but that strikes me as a very unclean solution. Is there any way to get the charset that the regex represents?
Example:
def foo(regex_set):
re.something(re.compile(regex_set))
foo("[a-z]")
>>> abcdefghijklmnopqrstuvwxyz
The compile is of course optional, but in my mind that's what this function would look like.
Paul McGuire, author of Pyparsing, has written an inverse regex parser, with which you could do this:
import invRegex
print(''.join(invRegex.invert('[a-z]')))
# abcdefghijklmnopqrstuvwxyz
If you do not want to install Pyparsing, there is also a regex inverter that uses only modules from the standard library with which you could write:
import inverse_regex
print(''.join(inverse_regex.ipermute('[a-z]')))
# abcdefghijklmnopqrstuvwxyz
Note: neither module can invert all regex patterns.
And there are differences between the two modules:
import invRegex
import inverse_regex
print(repr(''.join(invRegex.invert('.'))))
print(repr(''.join(inverse_regex.ipermute('.'))))
yields
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~'
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
Here is another difference, this time pyparsing enumerates a larger set of matches:
x = list(invRegex.invert('[a-z][0-9]?.'))
y = list(inverse_regex.ipermute('[a-z][0-9]?.'))
print(len(x))
# 26884
print(len(y))
# 1100
A regex is not needed here. If you want to have users select a character set, let them just pick characters. As I said in my comment, simply listing all the characters and putting checkboxes by them would be sufficent. If you want something that is more compact, or just looks cooler, you could do something like one of these:
Of course, if you actually use this, what you come up with will undoubtedly look better than these (And they will also actually have all the letters in them, not just "A").
If you need, you could include a button to invert the selection, select all, clear selection, save selection, or anything else you need to do.
if its just simple ranges you could manually parse it
def range_parse(rng):
min,max = rng.split("-")
return "".join(chr(i) for i in range(ord(min),ord(max)+1))
print range_parse("a-z")+range_parse('A-Z')
but its gross ...
Another solution I thought of to simplify the problem:
Stick your own [ and ] on the line as part of the prompt, and disallow those characters in the input. After you scan the input and verify it doesn't contain anything matching [\[\]], you can prepend [ and append ] to the string, and use it like a regex against a string of all the characters needed ("abcdefghijklmnopqrstuvwxyz", fort instance).

Categories

Resources