Regex (Python) to count elements in domain name - python

I would like to parse an URL and count the number of "elements" in its domain name.
If I for example had an url http://news.bbc.co.uk/foo/bar/xyzzy.html, I would be interested in number 4 (news, bbc, co, uk).
I have always shunned regular expressions as too cryptic. I would normally do this by splitting the string between // and / and counting dots in between. This time I decided to move away from my comfort zone and boldly try some self-improvement and do this with regular expressions, counting the number of match groups.
This is what I tried:
pattern = r"^.*//(([^./]+\.)+)/.*$"
but this does not match anything. I know there is a problem somewhere there, at least in handling the final part of the domain uk/ (should be counted in but then something else than a dot should be consumed), but still breaking the match group pattern so that parsing enters the tail part.
My idea was to first consume everything until // including //. This part probably works. Then I would start matching groups where a group is anything that is not . or /, repeat until a dot, then consume the dot, until all such groups have been consumed. These would be the match groups I am interested in. Then consume / and deal with the rest as I am not interested in it anymore. This goes wrong.
Or is this a futile attempt to use regex somewhere where it is not suitable?

Assuming consistent input, you can do:
^[^:]+://([^/]+)
^[^:]+ matches one or more characters from start till first :
:// matches the characters literally
([^/]+) the captured group contains one or more characters till next /
You would get e.g. news.bbc.co.uk using the above, then its a matter of simple str.split('.').
Note: The obvious one, don't use Regex for this, use a proper URL parser library (e.g.urlparse).
Example:
In [49]: s = 'http://news.bbc.co.uk/foo/bar/xyzzy.html'
In [50]: re.search(r'^[^:]+://([^/]+)', s).group(1).split('.')
Out[50]: ['news', 'bbc', 'co', 'uk']

You can try this regex :
import re
pattern=r'(?:\/\/)(\w+)|(?<=\.)(\w+)'
string='http://news.bbc.co.uk/foo/bar/xyzzy.html'
result=[]
match=re.finditer(pattern,string)
for i in match:
if i.group(1)!=None:
result.append(i.group(1))
elif i.group(2)!=None and i.group(2)!='html':
result.append(i.group(2))
print(result)
output:
['news', 'bbc', 'co', 'uk']
But Cool thing is you can do this thing in one line:
import tldextract
result=tldextract.extract("http://news.bbc.co.uk/foo/bar/xyzzy.html")
print([i.split('.') for i in result])
output:
[['news'], ['bbc'], ['co', 'uk']]

Related

Python regex expression example

I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.
For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.

Positive Lookbehind Stripping Out Metacharacters

I need to get the sequence at the end of many urls to label csv files. The approach I have taken gives me the result I want, but I am struggling to understand how I might use a positive lookbehind to capture all the characters after the word 'series' in the url while ignoring any metacharacters? I know I can use re.sub() to delete them, however, I am interested in learning how I can complete the whole process in one regex.
I have searched through many posts on how I might do this, and experimented with lots of different approaches but I haven't been able to figure it out. Mainly with replacing the .+ after the (?<=series\-) with something to negate that - but it hasn't worked.
url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
res = re.search(r"(?<=series\-).+", url).group(0)
re.sub('-', '', res)
Which gives the desired result 'kbw10a'
Is it possible to strip out the metacharacter '-' in the positive lookbehind? Is there a better approach to this without the lookaround?
More examples;
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1014416/yanmar-marine-marine-main-engine-small-qm-series-kbw10',
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a',
'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1018923/yanmar-marine-marine-main-engine-small-qm-series-kh18-a',
You cannot "ignore" chars in a lookaround the way you describe, because in order to match a part of a string, the regex engine needs to consume the part, from left to right, matching all subsequent subpatterns in your regex.
The only way to achieve that is through additional step, removing the hyphens once the match is found. Note that you do not need another regex to remove hyphens, .replace('-', '') will suffice:
url = 'https://yanmarshop.com/en-GB/catalog/all/browse/yanmardata-1019044/yanmar-marine-marine-main-engine-small-qm-series-kbw-10a'
resObj = re.search(r"series-(.+)", url)
if resObj:
res = resObj.group(1).replace('-', '')
Note it is much safer to first run re.search to get the match data object and then access the .group(), else, when there is no match, you may get an exception.
Also, there is no need of any lookarounds in the pattern, a capturing group will work as well.

Python, regular expression matching digits, x,xxx,xxx but not xx,xx,x,

first time posting, I've lurked for a little while, really excited about the helpful community here.
So, working with "Automate the boring stuff" by Al Sweigart
Doing an exercise that requires I build a regex that finds numbers in standard number format. Three digit, comma, three digits, comma, etc...
So hopefully will match 1,234 and 23,322 and 1,234,567 and 12 but not 1,23,1 or ,,1111, or anything else silly.
I have the following.
import re
testStr = '1,234,343'
matches = []
numComma = re.compile(r'^(\d{1,3})*(,\d{3})*$')
for group in numComma.findall(str(testStr)):
Num = group
print(str(Num) + '-') #Printing here to test each loop
matches.append(str(Num[0]))
#if len(matches) > 0:
# print(''.join(matches))
Which outputs this....
('1', ',343')-
I'm not sure why the middle ",234" is being skipped over. Something wrong with the regex, I'm sure. Just can't seem to wrap my head around this one.
Any help or explanation would be appreciated.
FOLLOW UP EDIT. So after following all your advice that I could assimilate, I got it to work perfectly for several inputs.
import re
testStr = '1,234,343'
numComma = re.compile(r'^(?:\d{1,3})(?:,\d{3})*$')
Num = numComma.findall(testStr)
print(Num)
gives me....
['1,234,343']
Great! BUT! What about when I change the string input to something like
'1,234,343 and 12,345'
Same code returns....
[]
Grrr... lol, this is fun, I must admit.
So the purpose of the exercise is to be able to eventually scan a block of text and pick out all the numbers in this format. Any insight? I thought this would add an additional tuple, not return an empty one...
FOLLOW UP EDIT:
So, a day later(Been busy with 3 daughters and Honey-do lists), I've finally been able to sit down and examine all the help I've received. Here's what I've come up with, and it appears to work flawlessly. Included comments for my own personal understanding. Thanks again for everything, Blckknght, Saleem, mhawke, and BHustus.
My final code:
import re
testStr = '12,454 So hopefully will match 1,234 and 23,322 and 1,234,567 and 12 but not 1,23,1 or ,,1111, or anything else silly.'
numComma = re.compile(r'''
(?:(?<=^)|(?<=\s)) # Looks behind the Match for start of line and whitespace
((?:\d{1,3}) # Matches on groups of 1-3 numbers.
(?:,\d{3})*) # Matches on groups of 3 numbers preceded by a comma
(?=\s|$)''', re.VERBOSE) # Looks ahead of match for end of line and whitespace
Num = numComma.findall(testStr)
print(Num)
Which returns:
['12,454', '1,234', '23,322', '1,234,567', '12']
Thanks again! I have had such a positive first posting experience here, amazing. =)
The issue is due to the fact you're using a repeated capturing group, (,\d{3})* in your pattern. Python's regex engine will match that against both the thousands and ones groups of your number, but only the last repetition will be captured.
I suspect you want to use non-capturing groups instead. Add ?: to the start of each set of parentheses (I'd also recommend, on general principle, to use a raw string, though you don't have escaping issues in your current pattern):
numComma = re.compile(r'^(?:\d{1,3})(?:,\d{3})*$')
Since there are no groups being captured, re.findall will return the whole matched text, which I think is what you wanted. You can also use re.find or re.search and call the group() method on the returned match object to get the whole matched text.
The problem is:
A regex match will return a tuple item for each group. However, it is important to distinguish a group from a capture. Since you only have two parenthese-delimited groups, the matches will always be tuples of two: the first group, and the second. But the second group matches twice.
1: first group, captured
,234: second group, captured
,343: also second group, which means it overwrites ,234.
Unfortunately, it seems that vanilla Python does not have a way to access any captures of a group other than the last one in a manner similar to .NET's regex implementation. However, if you are only interested in getting the specific number, your best bet would be to use re.search(number). If it returns a non-None value, then the input string is a valid number. Otherwise, it is not.
Additionally: A test on your regex. Note that, as Paul Hankin stated, test cases 6 and 7 match even though they shouldn't, due to the first * following the first capturing group, which will make the initial group match any number of times. Otherwise, your regex is correct. Fixed version.
RESPONSE TO EDIT:
The reason now that your regex returns an empty set on ' and ' is because of the ^ and $ anchors in your regex. The ^ anchor, at the start of the regex, says 'this point needs to be at the start of a string'. The $ is its counterpart, saying 'This needs to be at the end of the string'. This is good if you want your entire string from start to end to match the pattern, but if you want to pick out multiple numbers, you should do away with them.
HOWEVER!
If you leave the regex in its current form sans anchors, it will now match the individual elements of 1,23,45 as separate numbers. So for this we need to add a zero-width positive lookahead assertion and say, 'make sure that after this number is either whitespace or the end of a line'. You can see the change here. The tail end, (?=\s|$), is our lookahead assertion: it doesn't capture anything, but just makes sure criteria or met, in this case whitespace (\s) or (|) the end of a line ($).
BUT: In a similar vein, the previous regex would have matched 2 onward in "1234,567", giving us the number "234,567", which would be bad. So we use a lookbehind assertion similar to our lookahead at the end: (?<!^|\s), only match if at the beginning of the string or there is whitespace before the number. This version can be found here, and should soundly satisfy any non-decimal number related needs.
Try:
import re
p = re.compile(ur'(?:(?<=^)|(?<=\s))((?:\d{1,3})(?:,\d{3})*)(?=\s|$)', re.DOTALL)
test_str = """1,234 and 23,322 and 1,234,567 1,234,567,891 200 and 12 but
not 1,23,1 or ,,1111, or anything else silly"""
for m in re.findall(p, test_str):
print m
and it's output will be
1,234
23,322
1,234,567
1,234,567,891
200
12
You can see demo here
This regex, would match any valid number, and would never match an invalid number:
(?<=^|\s)(?:(?:0|[1-9][0-9]{0,2}(?:,[0-9]{3})*))(?=\s|$)
https://regex101.com/r/dA4yB1/1

Findall vs search for overwriting groups in Python

I found topic Capturing group with findall? but unfortunately it is more basic and covers only groups that do not overwrite themselves.
Please let's take a look at the following example:
S = "abcabc" # string used for all the cases below
1. Findall - no groups
print re.findall(r"abc", S) # ['abc', 'abc']
General idea: No groups here so I expect findall to return a list of all matches - please confirm.
In this case: Findall is looking for abc, finds it, returns it, then goes on and finds the second one.
2. Findall - one explicit group
print re.findall(r"(abc)", S) # ['abc', 'abc']
General idea: Some groups here so I expect findall to return a list of all groups - please confirm.
In this case: Why two results while there is only one group? I understand it this way:
findall is looking for abc,
finds it,
places it in the group memory buffer,
returns it,
findall starts to look for abc again, and so on...
Is this reasoning correct?
3. Findall - overwriting groups
print re.findall(r"(abc)+", S) # ['abc']
This looks similar to the above yet returns only one abc. I understand it this way:
findall is looking for abc,
finds it,
places it in the group memory buffer,
does not return it because the RE itself demands to go on,
finds another abc,
places it in the group memory buffer (overwrites previous abc),
string ends so searching ends as well.
Is this reasoning correct? I am very specific here so if there is anything wrong (even tiny detail) then please let me know.
4. Search - overwriting groups
Search scans through a string looking for a single match, so re.search(r"(abc)", S) and re.search(r"(abc)", S) rather obviously return only one abc, then let me get right to:
re.search(r"(abc)+", S)
print m.group() # abcabc
print m.groups() # ('abc',)
a) Of course the whole match is abcabc, but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group()? And that is why nothing gets overwritten for this method?
In fact, this grouping feature of parentheses is completely unnecessary here - in such cases I just want to use parentheses to stress what needs to be taken together when repeating things without creating any regex groups.
b) Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3?
At first, let me state some facts:
A match value (match.group()) is the (sub)text that meets the whole pattern defined in a regular expression. Matches can contain zero or more capture groups.
A capture value (match.group(1..n)) is a part of the match (that can also be equal to the whole match if the whole pattern is enclosed into a capture group) that is matched with a parenthesized pattern part (a part of the pattern enclosed into a pair of unescaped parentheses).
Some languages can provide access to the capture collection, i.e. all the values that were captured with a quantified capture group like (\w{3})+. In Python, it is possible with PyPi regex module, in .NET, with a CaptureCollection, etc.
1: No groups here so I expect findall to return a list of all matches - please confirm.
True, only if there are capturing groups are defined in the pattern, re.findall returns a list of captured submatches. In case of abc, re.findall returns a list of matches.
2: Why two results while there is only one group?
There are two matches, re.findall(r"(abc)", S) finds two matches in abcabc, and each match has one submatch, or captured substring, so the resulting array has 2 elements (abc and abc).
3: Is this reasoning correct?
The re.findall(r"(abc)+", S) is looking for a match in the form abcabcabc and so on. It will match it as a whole and will keep the last abc in the capture group 1 buffer. So, I think your reasoning is correct. RE itself demands to go on can be precised as since the matching is not yet complete (as there are still characters for the regex engine to test for a match).
4: the whole match is abcabc, but we still have groups here, so can I conclude that groups are irrelevant (despite name) for m.group()?
No, the last group value is kept in this case. If you change your regex to (\w{3})+ and the string to abcedf you will feel the difference as the output for that case will be edf. And that is why nothing gets overwritten for this method? - So, you are wrong, the preceding capture group value is overwritten with the following ones.
5: Can anyone explain a mechanism behind returning abcabc (in terms of buffers and so on) similarly like I did in bullet 3?
The re.search(r"(abc)+", S) will match abcabc (match, not capture) because
abcabc is searched for abc from left to right. RE finds abc at the start and tries to find another abc right from the location after the first c. RE puts the abc into Capture group buffer 1.
RE finds the 2nd abc, rewrites the capture group #1 buffer with it. Tries to find another abc.
No more abc is found - return the matched value found : abcabc.

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.
First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'
Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

Categories

Resources