regex to add conditional values in python - python

i am working with regex with python and trying to write regex so that if the url has https then we need to have www3 in url and if http is there then www. my solution is working for https but for http it does not show http. Can anybody help to correct this
st='''
https://www3.yahoo.com
http://www.yahoo.com
'''
p=re.compile(r'(https)?://(?(1)www3|www)\.\w+\.\w+')

It would seem the simpest solution is just to write out both alternatives:
st = '''
https://www3.yahoo.com
https://www.yahoo.com
http://www3.yahoo.com
http://www.yahoo.com
'''
p = re.compile(r'http(?:s://www3|://www)\.\w+\.\w+')
p.findall(st)
Output:
['https://www3.yahoo.com', 'http://www.yahoo.com']

A normal solution, sample but work
re.findall(r'(http(?P<s>s)?://www(?(s)3|)\..*)', """
https://www3.yahoo.com
http://www.yahoo.com
http://www3.yahoo.com
https://www34.yahoo.com
""")
[('https://www3.yahoo.com', 's'), ('http://www.yahoo.com', '')]
Explain
(?P<s>s): (?P<name>) will give a name for the group.
(?(s)): (?(<id|name>)) will reference the group that match before.
(?(s)3|\.): (?(<id|name>)yes-pattern|no-pattern) will choice the yes pattern if a group matched.
Advice
group-id (1) does not always work, cause you need careful with the group order, and calculate the index of it by yourself, it usually caused an error
group-named (name) is a good idea to avoid such the problem.
Reference
docs.python/re

For the conditional to work, you have to make only the s char optional
http(s)?://(?(1)www3|www)\.\w+\.\w+
Regex demo
Note that using \.\w+\.\w+ is limited to match an url. This could be a broader match, using \S to match a non whitspace character.
Regex demo
http(s)?://(?(1)www3|www)\.\S+

Related

Regex in python for validating mail

I am learning regex for validating an email id, for which the regex I made is the following:
regex = r'^([a-zA-Z0-9_\\-\\.]+)#([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,})$'
Here someone#...com is showing as valid but it should not, how can fix this?
I would recommend the regular expression suggested on this site which properly shows that the email someone#...com is invalid, I quickly wrote up an example using their suggestion below, happy coding!
>>>import re
>>>email = "someone#...com"
>>>regex = re.compile(r"(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)")
>>>print(re.match(regex, email))
None
The reason it matches someone#...com is that the dot is in the character class here #([a-zA-Z0-9_\\-\\.]+) and is repeated 1 or more times and it can therefore also match ...
What you can do is place the dot after the character class, and use that whole part in a repeating group.
If you put the - at the end you don't have to escape it.
Note that that character class at the start also has a dot.
^[a-zA-Z0-9_.-]+#(?:[a-zA-Z0-9_-]+\.)+([a-zA-Z]{2,})$
Regex demo

REGEX: Negative lookbehind with multiple whitespaces [duplicate]

I am trying to use lookbehinds in a regular expression and it doesn't seem to work as I expected. So, this is not my real usage, but to simplify I will put an example. Imagine I want to match "example" on a string that says "this is an example". So, according to my understanding of lookbehinds this should work:
(?<=this\sis\san\s*?)example
What this should do is find "this is an", then space characters and finally match the word "example". Now, it doesn't work and I don't understand why, is it impossible to use '+' or '*' inside lookbehinds?
I also tried those two and they work correctly, but don't fulfill my needs:
(?<=this\sis\san\s)example
this\sis\san\s*?example
I am using this site to test my regular expressions: http://gskinner.com/RegExr/
Many regular expression libraries do only allow strict expressions to be used in look behind assertions like:
only match strings of the same fixed length: (?<=foo|bar|\s,\s) (three characters each)
only match strings of fixed lengths: (?<=foobar|\r\n) (each branch with fixed length)
only match strings with a upper bound length: (?<=\s{,4}) (up to four repetitions)
The reason for these limitations are mainly because those libraries can’t process regular expressions backwards at all or only a limited subset.
Another reason could be to avoid authors to build too complex regular expressions that are heavy to process as they have a so called pathological behavior (see also ReDoS).
See also section about limitations of look-behind assertions on Regular-Expressions.info.
Hey if your not using python variable look behind assertion you can trick the regex engine by escaping the match and starting over by using \K.
This site explains it well .. http://www.phpfreaks.com/blog/pcre-regex-spotlight-k ..
But pretty much when you have an expression that you match and you want to get everything behind it using \K will force it to start over again...
Example:
string = '<a this is a tag> with some information <div this is another tag > LOOK FOR ME </div>'
matching /(\<a).+?(\<div).+?(\>)\K.+?(?=\<div)/ will cause the regex to restart after you match the ending div tag so the regex won't include that in the result. The (?=\div) will make the engine get everything in front of ending div tag
What Amber said is true, but you can work around it with another approach: A non-capturing parentheses group
(?<=this\sis\san)(?:\s*)example
That make it a fixed length look behind, so it should work.
You can use sub-expressions.
(this\sis\san\s*?)(example)
So to retrieve group 2, "example", $2 for regex, or \2 if you're using a format string (like for python's re.sub)
Most regex engines don't support variable-length expressions for lookbehind assertions.

Improving accuracy/brevity of regex for inconsistent url filtering

So, for some lulz, a friend and I were playing with the idea of filtering a list (100k+) of urls to retrieve only the parent domain (ex. "domain.com|org|etc"). The only caveat is that they are not all nice and matching in format.
So, to explain, some may be "http://www.domain.com/urlstuff", some have country codes like "www.domain.co.uk/urlstuff", while others can be a bit more odd, more akin to "hello.in.con.sistent.urls.com/urlstuff".
So, story aside, I have a regex that works:
import re
firsturl = 'www.foobar.com/fizz/buzz'
m = re.search('\w+(?=(\..{3}/|\..{2}\..{2}/))\.(.{3}|.{2}\..{2})', firsturl)
m.group(0)
which returns:
foobar.com
It looks up the first "/" at the end of the url, then returns the two "." separated fields before it.
So, my query, would anyone in the stack hive mind have any wisdom to shed on how this could be done with better/shorter regex, or regex that doesn't rely on a forward lookup of the "/" within the string?
Appreciation for all of the help in this!
I do think that regex is just the right tool for this. Regex is pattern matching, which is put to best use when you have a known pattern that might have several variations, as in this case.
In your explanation of and attempted solution to the problem, I think you are greatly oversimplifying it, though. TLDs come in many more flavors than "2-digit country codes" and "3-digit" others. See ICANN's list of top-level domains for the hundreds currently available, with lengths from 2 digits and up. Also, you may have URLs without any slashes and some with multiple slashes and dots after the domain name.
So here's my solution (see on regex101):
^(?:https?://)?(?:[^/]+\.)*([^/]+\.[a-z]{2,})
What you want is captured in the first matching group.
Breakdown:
^(?:https?://)? matches a possible protocol at the beginning
(?:[^/]+\.)* matches possible multiple non-slash sequences, each followed by a dot
([^/]+\.[a-z]{2,}) matches (and captures) one final non-slash sequence followed by a dot and the TLD (2+ letters)
You can use this regex instead:
import re
firsturl = 'www.foobar.com/fizz/buzz'
domain = re.match("(.+?)\/", firsturl).group()
Notice, though, that this will only work without 'http://'.

Python, applying regex negative lookahead recursivly

In python, I am trying to implement a user defined regex expression by parsing it to a custom regex expression. this custom regex expression is then applied on a space-sperated string. The idea is to apply user regex on second column without using a for loop.
Stream //streams/sys_util mainline none 'sys_util'
Stream //streams/gta mainline none 'gta'
Stream //streams/gta_client development //streams/gta_cdevelop 'gta_client'
Stream //streams/gta_develop development //streams/gta 'gta_develop'
Stream //streams/gta_infrastructure development //streams/gta 'gta_infrastructure'
Stream //streams/gta_server development //streams/gta_cdevelop 'gta_server'
Stream //streams/0222_ImplAlig1.0 task none '0222_ImplAlig1.0'
Stream //streams/0377_kzo_the_wart task //streams/applications_int '0377_tta'
Expected output should be
//streams/gta
//streams/gta_client
//streams/gta_develop
//streams/gta_infrastructure
//streams/gta_server
here is my code,
import re
mystring = "..."
match_rgx = r'Stream\s(\/\/streams\/gta.*)(?!\s)'
result = re.findall(match_rgx, mystring, re.M)
NOTE: The expression inside first parenthesis can not be changed (as it is parsed from user input) so \/\/streams\/gta.* must remain as it is.
how can I improve negative look-ahead to get the desired results?
You can use:
match_rgx = 'Stream\s(//streams/gta.*?)\s'
result = re.findall(match_rgx, mystring)
By default, the operator * is greedy, so it will try to catch as much text as possible (for example: "//streams/gta mainline none" will match without the ?). But you only want the second column, so, with ? your operator become non-greedy, and stop at the minimal pattern, here, at the first occurrence of \s ("//streams/gta").
Hope this is clear, put a look at the doc (https://docs.python.org/2/library/re.html#contents-of-module-re) if it's not.
Btw, you don't have to escape the /, it is not a special character.
And it's useless to use the re.M flag if you don't use ^ or $.
Edit: Since your edit, if you don't want to catch development, some informations became useless.
Edit 2: Didn't see you don't want to change the pattern. In this case, just do:
match_rgx = 'Stream\s(\/\/streams\/gta.*?)\s'
Edit3: See comment.
Tested on https://regex101.com/ , this should do the work for all 2nd columns:
(?:\w+\s([^\s]+)\s.*[\n|\n\r]*)
And this for the GTAs 2nd column only:
(?:\w+\s(\/\/streams\/gta[^\s]*)\s.*[\n|\n\r]*)
For one line it would be just like (2nd col):
\w+\s([^\s]+)\s.*
Gta only for 1 line:
\w+\s(\/\/streams\/gta[^\s]*)\s.*

Tuning of a Web handler regex routes configuration

In a web handler routes configuration I have the following Regex:
('/post/(\w+)/.*', foo.app.WebHandlerFooClass)
this regex matches these kind of urls:
/post/HUIHUIGgS823SHUIH/this-is-the-slug
/post/HUIHUIGgS823SHUIH/
passing the correct HUIHUIGgS823SHUIH Id parameter to the web handler matched by the (\w+) group.
How could I modify the above Regex to match also this url?
/post/HUIHUIGgS823SHUIH
The handler is coded to accept just one parameter, the base64 Id, so there should be just one group in the Regex that matches the Id.
So, these are the urls that should be matched:
/post/HUIHUIGgS823SHUIH/this-is-the-slug
/post/HUIHUIGgS823SHUIH/
/post/HUIHUIGgS823SHUIH <-- Hey, I wanna this too
'/post/(\w+_-)(?:/([\w-]+))?/?'
This matches the following.
/post/HUIHUIGgS823SHUIH/this-is-the-slug
/post/HUIHUIGgS823SHUIH/this-is-the-slug/
/post/HUIHUIGgS823SHUIH/
/post/HUIHUIGgS823SHUIH
I think this is a better implementation because it captures only the pieces you want, e.g. the slug doesn't capture a trailing /. However, your spec is still slightly unclear to me, so this may not be your intention.
If you don't care about the data at the end, then why not just use this?
'/post/(\w+).*'
Otherwise you'll have to provide more info.
I think you just want:
'/post/([^/]+).*'
But that seems too simple an answer :)
If I guessed right your real intention, then you are fine with this one:
'/post/(\w+)'

Categories

Resources