Extract domain using regular expression

Extract domain using regular expression - python

Suppose I got these urls.
http://abdd.eesfea.domainname.com/b/33tA$/0021/file
http://mail.domainname.org/abc/abc/aaa
http://domainname.edu
I just want to extract "domainame.com" or "domainname.org" or "domainname.edu" out.
How can I do this?
I think, I need to find the last "dot" just before "com|org|edu..." and print out content from this "dot"'s previous dot to this dot's next dot(if it has).
Need help about the regular-expres.
Thanks a lot!!!
I am using Python.

why use regex?
http://docs.python.org/library/urlparse.html

If you would like to go the regex route...
RFC-3986 is the authority regarding URIs. Appendix B provides this regex to break one down into its components:
re_3986 = r"^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?"
# Where:
# scheme = $2
# authority = $4
# path = $5
# query = $7
# fragment = $9
Here is an enhanced, Python friendly version which utilizes named capture groups. It is presented in a function within a working script:
import re
def get_domain(url):
"""Return top two domain levels from URI"""
re_3986_enhanced = re.compile(r"""
# Parse and capture RFC-3986 Generic URI components.
^ # anchor to beginning of string
(?: (?P<scheme> [^:/?#\s]+): )? # capture optional scheme
(?://(?P<authority> [^/?#\s]*) )? # capture optional authority
(?P<path> [^?#\s]*) # capture required path
(?:\?(?P<query> [^#\s]*) )? # capture optional query
(?:\#(?P<fragment> [^\s]*) )? # capture optional fragment
$ # anchor to end of string
""", re.MULTILINE | re.VERBOSE)
re_domain = re.compile(r"""
# Pick out top two levels of DNS domain from authority.
(?P<domain>[^.]+\.[A-Za-z]{2,6}) # $domain: top two domain levels.
(?::[0-9]*)? # Optional port number.
$ # Anchor to end of string.
""",
re.MULTILINE | re.VERBOSE)
result = ""
m_uri = re_3986_enhanced.match(url)
if m_uri and m_uri.group("authority"):
auth = m_uri.group("authority")
m_domain = re_domain.search(auth)
if m_domain and m_domain.group("domain"):
result = m_domain.group("domain");
return result
data_list = [
r"http://abdd.eesfea.domainname.com/b/33tA$/0021/file",
r"http://mail.domainname.org/abc/abc/aaa",
r"http://domainname.edu",
r"http://domainname.com:80",
r"http://domainname.com?query=one",
r"http://domainname.com#fragment",
]
cnt = 0
for data in data_list:
cnt += 1
print("Data[%d] domain = \"%s\"" %
(cnt, get_domain(data)))
For more information regarding the picking apart and validation of a URI according to RFC-3986, you may want to take a look at an article I've been working on: Regular Expression URI Validation

In addition to Jase' answer.
If you don't wan't to use urlparse, just split the URL's.
Strip of the protocol (http:// or https://)
The you just split the string by first occurrence of '/'. This will leave you with something like:
'mail.domainname.org' on the second URL. This can then be split by '.' and the you just select the last two from the list by [-2]
This will always yield the domainname.org or whatever. Provided you get the protocol stripped out right, and that the URL are valid.
I would just use urlparse, but it can be done.
Dunno about the regex, but this is how I would do it.

Should you need more flexibility than urlparse provides, here's an example to get you started:
import re
def getDomain(url):
#requires 'http://' or 'https://'
#pat = r'(https?):\/\/(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
#'http://' or 'https://' is optional
pat = r'((https?):\/\/)?(\w+\.)*(?P<domain>\w+)\.(\w+)(\/.*)?'
m = re.match(pat, url)
if m:
domain = m.group('domain')
return domain
else:
return False
I used the named group (?P<domain>\w+) to grab the match, which is then indexed by its name, m.group('domain').
The great thing about learning regular expressions is that once you are comfortable with them, solving even the most complicated parsing problems is relatively simple. This pattern could be improved to be more or less forgiving if necessary -- this one for example will return '678' if you pass it 'http://123.45.678.90', but should work great on just about any other URL you can come up with. Regexr is a great resource for learning and testing regexes.

Related

Convert vimeo link into an embed link in python

I am using Python and django, and I have some vimeo URLs I need to convert to their embed versions. For example, this:
https://vimeo.com/76979871
has to be converted into this:
https://player.vimeo.com/video/76979871
but Not converted
My code is below:
_vm = re.compile(
r'/(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/([^\/]*)\/videos\/|album\/(\d+)\/video\/|)(\d+)(?:$|\/|\?)/', re.I)
_vm_format = 'https://player.vimeo.com/video/{0}'
def replace(match):
groups = match.groups()
print(_vm_format)
return _vm_format.format(groups[5])
return _vm.sub(replace, text)

The given regular expression fits several variants of Vimeo URL:
https://vimeo.com/76979871
https://vimeo.com/channels/76979871
https://vimeo.com/groups/sdf/videos/76979871
https://vimeo.com/album/12321/video/76979871
The video number, provided it is really the only thing that you need for your player, will be in capture group 1 (groups[1]) after you slightly correct the regular expression: r'(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/(?:[^\/]*)\/videos\/|album\/(?:\d+)\/video\/|)(\d+)(?:$|\/|\?)'. All other parentheses are non-capturing groups.
If, however, the player code is different for different URL types, then you better split your regular expression in four; and there will be different replacements for each.

You have to remove \ from both the end and use capture group 3 to get video id
(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/([^\/]*)\/videos\/|album\/(\d+)\/video\/|)(\d+)(?:$|\/|\?)
Example
import re
_vm = re.compile(
r'(?:https?:\/\/(?:www\.)?)?vimeo.com\/(?:channels\/|groups\/([^\/]*)\/videos\/|album\/(\d+)\/video\/|)(\d+)(?:$|\/|\?)', re.I)
_vm_format = 'https://player.vimeo.com/video/{0}'
def replace(match):
groups = match.groups()
return _vm_format.format(groups[2])
urls=["https://vimeo.com/76979871",
"https://vimeo.com/channels/76979871",
"https://vimeo.com/groups/sdf/videos/76979871",
"https://vimeo.com/album/12321/video/76979871"]
for u in urls:
print(_vm.sub(replace, u))
Output
https://player.vimeo.com/video/76979871
https://player.vimeo.com/video/76979871
https://player.vimeo.com/video/76979871
https://player.vimeo.com/video/76979871

How can I use a recursive regex or another method to recursively validate this BBcode-like markup in Python?

I am attempting to write a program that validates documents written in a markup language similar to BBcode.
This markup language has both matching ([b]bold[/b] text) and non-matching (today is [date]) tags. Unfortunately, using a different markup language is not an option.
However, my regex is not acting the way I want it to. It seems to always stop at the first matching closing tag instead of identifying that nested tag with the recursive (?R).
I am using the regex module, which supports (?R), and not re.
My questions are:
How can I effectively use a recursive regex to match nested tags without terminating on the first tag?
If there's a better method than a regular expression, what is that method?
Here is the regex once I build it:
\[(b|i|u|h1|h2|h3|large|small|list|table|grid)\](?:((?!\[\/\1\]).)*?|(?R))*\[\/\1\]
Here is a test string that doesn't work as expected:
[large]test1 [large]test2[/large] test3[/large] (it should match this whole string but stops before test3)
Here is the regex on regex101.com: https://regex101.com/r/laJSLZ/1
This test doesn't need to finish in milliseconds or even seconds, but it does need to be able to validate about 100 files of 1,000 to 10,000 characters each in a time that is reasonable for a Travis-CI build.
Here is what the logic using this regex looks like, for context:
import io, regex # https://pypi.org/project/regex/
# All the tags that must have opening and closing tags
matching_tags = 'b', 'i', 'u', 'h1', 'h2', 'h3', 'large', 'small', 'list', 'table', 'grid'
# our first part matches an opening tag:
# \[(b|i|u|h1|h2|h3|large|small|list|table|grid)\]
# our middle part matches the text in the middle, including any properly formed tag sets in between:
# (?:((?!\[\/\1\]).)*?|(?R))*
# our last part matches the closing tag for our first match:
# \[\/\1\]
pattern = r'\[(' + '|'.join(matching_tags) + r')\](?:((?!\[\/\1\]).)*?|(?R))*\[\/\1\]'
myRegex = re.compile(pattern)
data = ''
with open('input.txt', 'r') as file:
data = '[br]'.join(file.readlines())
def validate(text):
valid = True
for node in all_nodes(text):
valid = valid and is_valid(node)
return valid
# (Only important thing here is that I call this on every node, this
# should work fine but the regex to get me those nodes does not.)
# markup should be valid iff opening and closing tag counts are equal
# in the whole file, in each matching top-level pair of tags, and in
# each child all the way down to the smallest unit (a string that has
# no tags at all)
def is_valid(text):
valid = True
for tag in matching_tags:
valid = valid and text.count(f'[{tag}]') == text.count(f'[/{tag}]')
return valid
# this returns each child of the text given to it
# this call:
# all_nodes('[b]some [large]text to[/large] validate [i]with [u]regex[/u]![/i] love[/b] to use [b]regex to [i]do stuff[/i][/b]')
# should return a list containing these strings:
# [b]some [large]text to[/large] validate [i]with [u]regex[/u]![/i] love[/b]
# [large]text to[/large]
# [i]with [u]regex[/u]![/i]
# [u]regex[/u]
# [b]regex to [i]do stuff[/i][/b]
# [i]do stuff[/i]
def all_nodes(text):
matches = myRegex.findall(text)
if len(matches) > 0:
for m in matches:
result += all_nodes(m)
return result
exit(0 if validate(data) else 1)

Your main issue is within the ((?!\[\/\1\]).)*? tempered greedy token.
First, it is inefficient since you quantified it and then you quantify the whole group it is in, so making the regex engine look for more ways to match a string, and that makes it rather fragile.
Second, you only match up to the closing tag and you did not restrict the starting tag. The first step is to make the / before \1 optional, \/?. It won't stop before [tag] like tags with no attributes. To add attribute support, add an optional group after \1, (?:\s[^]]*)?. It matches an optional sequence of a whitespace and then any 0+ chars other than ].
A fixed regex will look like
\[([biu]|h[123]|l(?:arge|ist)|small|table|grid)](?:(?!\[/?\1(?:\s[^]]*)?]).|(?R))*\[/\1]
Do not forget to compile it with regex.DOTALL to match across multiple newlines.

Combine compiled Python regexes

Is there any mechanism in Python for combining compiled regular expressions?
I know it's possible to compile a new expression by extracting the plain-old-string .pattern property from existing pattern objects. But this fails in several ways. For example:
import re
first = re.compile(r"(hello?\s*)")
# one-two-three or one/two/three - but not one-two/three or one/two-three
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
# Incorrect - back-reference \1 would refer to the wrong capturing group now,
# and we get an error "redefinition of group name 'r1' as group 3; was
# group 2 at position 47" for the `(?P)` group.
# Result is also now case-sensitive, unlike 'second' which is IGNORECASE
both = re.compile(first.pattern + second.pattern + second.pattern)
The result I'm looking for is achievable like so in Perl:
$first = qr{(hello?\s*)};
# one-two-three or one/two/three - but not one-two/three or one/two-three
$second = qr{one([-/])two\g{-1}three}i;
$both = qr{$first$second$second};
A test shows the results:
test($second, "...one-two-three..."); # Matches
test($both, "...hello one-two-THREEone-two-three..."); # Matches
test($both, "...hellone/Two/ThreeONE-TWO-THREE..."); # Matches
test($both, "...HELLO one/Two/ThreeONE-TWO-THREE..."); # No match
sub test {
my ($pat, $str) = #_;
print $str =~ $pat ? "Matches\n" : "No match\n";
}
Is there a library somewhere that makes this use case possible in Python? Or a built-in feature I'm missing somewhere?
(Note - one very useful feature in the Perl regex above is \g{-1}, which unambiguously refers to the immediately preceding capture group, so that there are no collisions of the type that Python is complaining about when I try to compile the combined expression. I haven't seen that anywhere in Python world, not sure if there's an alternative I haven't thought of.)

Ken, this is an interesting problem. I agree with you that the Perl solution is very slick.
I came up with something, but it is not so elegant. Maybe it gives you some idea to further explore the solution using Python. The idea is to simulate the concatenation using Python re methods.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
str="...hello one-two-THREEone/two/three..."
#str="...hellone/Two/ThreeONE-TWO-THREE..."
if re.search(first,str):
first_end_pos = re.search(first,str).end()
if re.match(second,str[first_end_pos:]):
second_end_pos = re.match(second,str[first_end_pos:]).end() + first_end_pos
if re.match(second,str[second_end_pos:]):
print ('Matches')
It will work for most of the cases but it is not working for the below case:
...hellone/Two/ThreeONE-TWO-THREE...
So, yes I admit it is not a complete solution to your problem. Hope this helps though.

I'm not a perl expert, but it doesn't seem like you're comparing apples to apples. You're using named capture groups in python, but I don't see any named capture groups in the perl example. This causes the error you mention, because this
both = re.compile(first.pattern + second.pattern + second.pattern)
tries to create two capture groups named r1
For example, if you use the regex below, then try to access group_one by name, would you get the numbers before "some text" or after?
# Not actually a valid regex
r'(?P<group_one>[0-9]*)some text(?P<group_one>[0-9]*)'
Solution 1
An easy solution is probably to remove the names from the capture groups. Also add the re.IGNORECASE to both. The code below works, although I'm not sure the resulting regex pattern will match what you want it to match.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one([-/])two([-/])three", re.IGNORECASE)
both = re.compile(first.pattern + second.pattern + second.pattern, re.IGNORECASE)
Solution 2
What I'd probably do instead is define the separate regular expressions as strings, then you can combine them however you'd like.
pattern1 = r"(hello?\s*)"
pattern2 = r"one([-/])two([-/])three"
first = re.compile(pattern1, re.IGNORECASE)
second = re.compile(pattern2, re.IGNORECASE)
both = re.compile(r"{}{}{}".format(pattern1, pattern2, pattern2), re.IGNORECASE)
Or better yet, for this specific example, don't repeat pattern2 twice, just account for the fact that it'll repeat in the regex:
both = re.compile("{}({}){{2}}".format(pattern1, pattern2), re.IGNORECASE)
which gives you the following regex:
r'(hello?\s*)(one([-/])two([-/])three){2}'

RegEx: How to match Prefix + Shared OR Shared + Postfix?

Assume I want to match:
PREFIXsomething
or:
somethingPOSTFIX
But certainly NOT:
PREFIXsomethingPOSTFIX
Where something is a certain shared pattern, and PREFIX/POSTFIX are in reality also certain different patterns.
I can (or thought) solve this in Python. However this construct works for 'PREFIXabc' but does not work for 'abcPOSTFIX'. How to solve this?
import re
prefix_pattern = "PREFIX"
postfix_pattern = "POSTFIX"
shared_pattern = "[a-zA-z]*"
test_pattern ="("+prefix_pattern+shared_pattern+")|("+shared_pattern+postfix_pattern+")$"
pattern = re.compile(test_pattern)
#test = 'PREFIXabc' # Match
test = 'abcPOSTFIX' # No match
x = re.match(pattern,test)
if x:
print(x.group())
else:
print("Not found")

Note that your pattern, when used with re.match, follows the scheme like ^(alternative1)|^(alternative2)$. That means that the $ end of string anchor only affects the second alternative and in case test = 'PREFIXabc123', PREFIXabc will get matched.
There are two ways to solve it depending on your requirements.
Either you need to remove $ and then you will also match abcPOSTFIX in test = 'abcPOSTIFX123', or group the two alternatives:
test_pattern=r"(?:{0}{1}|{1}{2})$".format(prefix_pattern, shared_pattern, postfix_pattern)
Then, partial matches won't be found any longer.
And FYI: If the prefix_pattern, shared_pattern and postfix_pattern are literal strings, do not forget to use re.escape().

RegEx to match a term before OR after another specific term

I'm looking for a squaremeter term in some kind of text using this RegExpression:
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
Works pretty well.
Now, this thing should only be matched if before OR after it, a string like "Wohnfläche"/"Wohnfl"/"Wfl" exists. In other words: the latter term is mandatory, however its positon is not.
Writing a RegEx for this is not the issue in general, my problem is how to write it most elegant. Currently I only see one approach:
^[.]*[Wohnfläche|Wohnfl|Wfl]([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
new search, kombined with 'or' statement (I'm using Python)
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2][.]*[Wohnfläche|Wohnfl|Wfl]$
Ugly, isn't it? ;)

You can use alternation like this:
(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)
And check which capture group matched. It is just not possible to use the restrictive strings optionally in the regex on both sides, the will just be ignored.
See the regex demo
IDEONE demo:
import re
pat = re.compile(r'(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)')
strs = ["12,56m qm Wohnfläche", "14.54 mqm Wohnfl", "Wfl 134 m qm"]
for x in strs:
m = pat.search(x)
if m:
if m.group(1): # First alternative found a match
print("{}".format(m.group(1), " - ", m.group(2)))
else: # Second alternative "won"
print("{}".format(m.group(3), " - ", m.group(4)))

Specify a logical conjunction in the controlling application, like (pseudo-code) <area-regex>.match(string) and <text-regex>.match(string).
This assumes that any pair of matches of the two regexen on the same string will never overlap ( if they did, you'd get a false positive ). Your regexen meet this requirement.
Note that your regex for the textual context contains the additional restriction that your test string either starts or ends with a match, while in your informal description you just require a match to either occur before or after the area spec. This difference is incorporated in pt vs pt_anchored in the code below.
Python fragment (untested):
import re
...
# pa: <area_regex>
# pt: <text_regex>
# pt_anchored: <text_regex>, anchored
#
pa = re.compile ( r'([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]' )
pt = re.compile ( r'[.]*[Wohnfläche|Wohnfl|Wfl]' )
pt_anchored = re.compile ( r'^[.]*[Wohnfläche|Wohnfl|Wfl]|[.]*[Wohnfläche|Wohnfl|Wfl]$' )
if pa.match(<teststring>) and pt.match(<teststring>):
print 'Match found: '
else:
print 'No match'
...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract domain using regular expression - python

why use regex? http://docs.python.org/library/urlparse.html

Related

Convert vimeo link into an embed link in python

How can I use a recursive regex or another method to recursively validate this BBcode-like markup in Python?

Combine compiled Python regexes

RegEx: How to match Prefix + Shared OR Shared + Postfix?

RegEx to match a term before OR after another specific term

Categories

Resources