Regex - so close, yet so far away [duplicate] - python

This question already has answers here:
Regular expression to find URLs within a string
(35 answers)
What is the best regular expression to check if a string is a valid URL?
(62 answers)
Closed 4 years ago.
Here is my current regex: (?:ht|f)tps?:[\S]*\/?(?:\w+)
I need to refine it such that it pulls the following link correctly from the quoted text below: http://www.purdue.edu/transcom/index.php
Any thoughts on how I can improve my current regex? Thanks in advance!
Additional information about the experimental protocol and results is
provided in the companion files and the TransCom project web site
(http://www.purdue.edu/transcom/index.php).The results of the Level 1
experiments presented here are grouped into two broad categories

I do not tested your regex thougoutly, and this is not clear enough why is your current regex failing.
But to catch a ulr in general, I would use the repetition of the group (the authorized characters for html minus the slash like [a-zA-Z0-9.]) and the slash)
something like
r'(?:ht|f)tps?:\\(?:\\[_html_authorized_chars])*'
and eventually a positive lookahead assertion if the answer is always inside quotes or parenthesis...

Url Similar Splitter
matches url similars and splits it into its address and parameters
by deme72
([--:\w?#%&+~#=]*\.[a-z]{2,4}\/{0,2})((?:[?&](?:\w+)=(?:\w+))+|[--:\w?#%&+~#=]+)?
Source: regexr.com community

Related

URL Regex for Python [duplicate]

This question already has answers here:
Regex match everything after question mark?
(7 answers)
Closed 12 months ago.
I am trying to compare webpage URLs using regex. I am using the below method.
regex_url = r'https://www.website.com/books/\w{8}$'
is_read = re.match(regex_url, request.url) is not None
if not is_read:
add_to_read(token)
Everything works well for the above regex. But there is a new URL pattern now which I cant seem to get the regex right.
The new URL pattern is
https://www.website.com/books/Ab7us83xI?varient=web
9 characters followed by a question mark and then the word 'varient' and then '=web'. Can anyone help me get the correct regex for this?
Only the first 9 characters change every time. Apologies if this is a stupid question.
Many thanks.
Is this what you need?
https://www.website.com/books/\w{9}\?varient=web$
\w{9} - match 9 characters
\? - match question mark
varient=web - match varient=web

Email Regex Validation fails in python [duplicate]

This question already has answers here:
How can I validate an email address using a regular expression?
(79 answers)
Closed 2 years ago.
I am using python to extract Emails from web using re library. it does its job but it extracts links that match the pattern. For example:
/images/paramproofs/services/pgp/logo_black_16#2x.png
/images/paramproofs/services/twitter/logo_black_16#2x.png
/images/paramproofs/services/github/logo_black_16#2x.png
/images/paramproofs/services/reddit/logo_black_16#2x.png
/images/paramproofs/services/web/logo_black_16#2x.png
/images/paramproofs/services/web/logo_black_16#2x.png
/images/paramproofs/services/stellar/logo_black_16#2x.png
/images/badges/install-badge-windows-168-56#2x.png
/images/badges/install-badge-windows-168-56#3x.png
This is the pattern I use:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[ a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
I don't know where you took that regex from, but according to emailregex.com this should suffice for almost all cases (including yours):
(^[a-zA-Z0-9_.+-]+#[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)
The line anchors (^ for the beginning of the line and $ for the end) are the key here.

Python finding variable characters in a string with RE? [duplicate]

This question already has answers here:
How to grab number after word in python
(4 answers)
What special characters must be escaped in regular expressions?
(13 answers)
Closed 2 years ago.
Hey I need to search for variable data in a console from a page source
The data will be shown like this:
"data":[13,17]
It will vary a lot with the amount of units inside the table. I have tried out several RE expressions, but the closest I have come to a result, is with a fixed amount of units.
self.driver.get("website.com")
apidata = self.driver.page_source
print(apidata)
datasetbasic = re.search('"data":[[0-99,0-99]+', apidata)
print(datasetbasic)
Instead of having it as a fixed amount, how do I capture anything that is inside the data table?
Before you ask, I cannot use xpath or any other selenium calls to capture this data directly from the webpage (I think), because the element is from a graph, where the data is only visible in the actual console.
Any help is appreciated

construct a Python regular expression where there is a sharing of prefix [duplicate]

This question already has answers here:
re.findall behaves weird
(3 answers)
Closed 3 years ago.
I want to achieve the following:
say I have two regex, regex1 and regex2. I want to construct a new regex that is of 'prefix_regex1 | prefix_regex2', what syntax should I use to share the prefix, I tried 'prefix_(regex1|regex2)' but it's not working, since I think it's confused on the bracket used as group rather than making the | precedence higher.
example:
I have two string that both should match the pattern:
prefix_123
prefix_abc
I wrote this pattern: prefix_(\d*|\D*) that tries to capture both cases, but when I run it against prefix_abc it's only matching prefix_, not the entire string.
This site might help with this problem (and others). It lets you tinker with the regex and see the result both graphically and in code: https://www.debuggex.com/
For example, I changed your regex to this: prefix_(\d+|\D+) which requires 1 or more digit or non-digit after "prefix_" Not sure if that's what you are looking for, but it's easy to experiment with the site I shared above.
Hope it helps.

Regex for matching email addresses [duplicate]

This question already has an answer here:
Restricting character length in a regular expression
(1 answer)
Closed 4 years ago.
So I have a regex that goes like:
regex1= re.compile(r'\S+#\S+')
This works perfectly but I am trying to add a character limit so the total amount of characters have to be less than 20.
I tried re.compile(r'\S+#\S+{5,20}') but it keeps giving me an error. Seems like a simple fix, but cant see what I am doing wrong.
You can't specify a greedy modifier (+) with a specific number of characters (i.e., \S+{5,20) is not a valid pattern). If you're doing this in python, I'd suggest just using the len(...) function on the string in addition to the regex to verify. For example:
if regex1.match(email) and (len(email) < 20):
...

Categories

Resources