Could anyone help with regex?
I have an URL like
"http://example.com/ru/path/?id=1234&var=abcd"
I'd like an assertion that checks that the URL has a stucture:
"http://example.com/ru/path/?id={id value}&var={var value}"
Surely regex is overkill. if it's repeatable like that you could use:
url="http://example.com/ru/path/?id=1234&var=abcd"
if url.split('?')[1].startswith('id=') and url.split('&')[1].startswith('var='):
print "yay!"
import re
s="http://example.com/ru/path/?id=1234&var=abcd"
pattern = r'http:\/\/example.com\/ru\/path\/\?id=\d+&var=\w+'
res = re.findall(patten,s)
if res:
print "yes:
Regex isn't needed but using regex just check that there is a digit (\d+) and a var ([A-z]+)
import re
p = re.compile('http://example.com/ru/path/\?id=\d+&var=[A-z]+')
check=p.match("http://example.com/ru/path/?id=1234&var=abcd")
if check:
print 'match'
else:
print 'does not match'
Related
here my data string :
MYDATA=DATANORMAL
MYDATA=DATA_NOTNORMAL
i use this code, but when i run it it shows empty at DATANORMAL
mydata = re.findall(r'MYDATA=(.*)' r'_.*', mystring)
print mydata
and it just shows : NOTNORMAL
i want both to work, and displays data like this:
DATANORMAL
NOTNORMAL
how do i do it? Thanks.
Try it online!
import re
mystring = """
MYDATA=DATANORMAL
MYDATA=DATA_NOTNORMAL
"""
mydata = re.findall(r'^\s*MYDATA=(?:.+_)?(.+?)\s*$', mystring, re.M)
print(mydata)
In case if you need word before _, not after, then use regex r'^\s*MYDATA=(.+?)(?:_.+)?\s*$' in code above, you may try this second variant here.
Based on what you describe, you might want to use an alternation here:
\bMYDATA=((?:DATA|(?:DATA_))\S+)\b
Script:
inp = "some text MYDATA=DATANORMAL more text MYDATA=DATA_NOTNORMAL"
mydata = re.findall(r'\bMYDATA=((?:DATA|(?:DATA_))\S+)\b', inp)
print(mydata)
This prints:
['DATANORMAL', 'DATA_NOTNORMAL']
I guess you need to add flags=re.M?
import re
mystring = """
MYDATA=DATANORMAL
MYDATA=DATA_NOTNORMAL"""
pattern = re.compile("MYDATA=(?:DATA_)?(\w+)",flags=re.M)
print(pattern.findall(mystring))
Here's ther scenario, I'd like to extract secondary path in URL, so the following URL should all return 'a-c-d'
/opportunity/a-c-d
/opportunity/a-c-d/
/opportunity/a-c-d/123/456/
/opportunity/a-c-d/?x=1
/opportunity/a-c-d?x=1
My code snippet is as follows:
m = re.match("^/opportunity/([^/]+)[\?|/|$]", "/opportunity/a-c-d")
if m:
print m.group(1)
It works for all possible URLs above EXCEPT the first one /opportunity/a-c-d. Could anyone help explain the reason and rectify my regex please? Thanks a lot!
Don't do this. Use the urlparse module instead.
Here is some test code:
from urlparse import urlparse
urls = [
'/opportunity/a-c-d',
'/opportunity/a-c-d/',
'/opportunity/a-c-d/123/456/',
'/opportunity/a-c-d/?x=1',
'/opportunity/a-c-d?x=1',
]
def secondary(url):
try:
return urlparse(url).path.split('/')[2]
except IndexError:
return None
for url in urls:
print '{0:30s} => {1}'.format(url, secondary(url))
and here is the output
/opportunity/a-c-d => a-c-d
/opportunity/a-c-d/ => a-c-d
/opportunity/a-c-d/123/456/ => a-c-d
/opportunity/a-c-d/?x=1 => a-c-d
/opportunity/a-c-d?x=1 => a-c-d
The $ in your regex is matching the literal '$' character, not the end of line character. Instead, you probably want this:
m = re.match(r"^/opportunity/([^/?]+)\/?\??", "/opportunity/a-c-d")
if m:
print m.group(1)
Alternative patterns should be inside (), not [], which is for matching specific characters.
You should also use a raw string, so that escape sequences will be sent literally to the re module, not get interpreted in the Python string.
m = re.match(r"^/opportunity/([^/]+)(\?|/|$])", "/opportunity/a-c-d")
or
m = re.match(r"^/opportunity/([^/]+)([?/]|$])", "/opportunity/a-c-d")
Use () to include all you need.
[re.sub(r'.*(\w+-\w+-\w+).*',r'\1',x) for x in urls]
https://fire.vimeocdn.com/1485599447-0xf546ac1afe7bce06fa5153973a8b85b1c45051d3/159463108/video/499604330/playlist.m3u8
I want to Include Everything Except playlist.m3u8
(playlist.[^.]*)
is selecting "playlist.m3u8", i need to do exactly opposite.
Here is an Demo. https://regex101.com/r/RONA65/1
You can use positive look ahead:
(.*)(?=playlist\.[^.]*)
Demo:
https://regex101.com/r/RONA65/4
Or you can try it like this as well:
.*\/
Demo:
https://regex101.com/r/RONA65/2
Regex:
.*\/ Select everything till last /
You can use the split function:
>>> s = 'https://fire.vimeocdn.com/.../159463108/video/499604330/playlist.m3u8'
>>> '/'.join(s.split('/')[:-1])
'https://fire.vimeocdn.com/.../159463108/video/499604330'
Or simpler with rsplit:
>>> s = 'https://fire.vimeocdn.com/.../159463108/video/499604330/playlist.m3u8'
>>> s.rsplit('/', 1)[0]
'https://fire.vimeocdn.com/.../159463108/video/499604330'
Use non-greedy match by adding '?' after '*'
import re
s = 'https://fire.vimeocdn.com/1485599447-0xf546ac1afe7bce06fa5153973a8b85b1c45051d3/159463108/video/499604330/playlist.m3u8'
m = re.match('(.*?)(playlist.[^.]*)', s)
print(m.group(1))
I want to match and group any of these listed words:
aboutus/,race/,cruise/,westerlies/,weather/,reach/,gear/ or empty_string
Here is a solution, but which will not match the empty_string:
^(aboutus|race|cruise|westerlies|weather|reach|gear)/$
So my question is: How to include Empty string in this matching?
I still don't get a good solution for this.
So I added one more regex specially for empty_string:ie ^$.
Note: these regular expression is for django urls.py.
update: It will be better if the capturing group does not contain /
try this:
^(aboutus|race|cruise|westerlies|weather|reach|gear)?/$
edit:
if '/' is in every case except the empty string try this
^((aboutus|race|cruise|westerlies|weather|reach|gear)(/))?$
Use this
^$|^(aboutus|race|cruise|westerlies|weather|reach|gear)/$
You can make the capturing group optional:
^(aboutus|race|cruise|westerlies|weather|reach|gear)?/$
import re
rgx = re.compile('^((aboutus|race|cruise|westerlies'
'|weather|reach|gear)/|)$')
# or
li = ['aboutus','race','cruise','westerlies',
'weather','reach','gear']
rgx = re.compile('^((%s)/|)$' % '|'.join(li))
for s in ('aboutus/',
'westerlies/',
'westerlies/ ',
''):
m = rgx.search(s)
print '%-21r%r' % (s,rgx.search(s).group() if m else m)
result
'aboutus/' 'aboutus/'
'westerlies/' 'westerlies/'
'westerlies/ ' None
'' ''
i need to find anything between
show_detail&
and
;session_id=1445045
in
https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0
using regex in python.
i know i need to use lookbehind/ahead but i can't seem to make it work!
please help!
thanks :)
Why use a regex?
>>>> url = 'https://ww.site.gov.....'
>>> start = url.index('show_detail&') + len('show_detail&')
>>> end = url.index(';session_id=')
>>> url[start:end]
'id=4035219;num=1'
>>> s= "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
>>> s.split(";session_id=1445045")[0].split("show_detail&")[-1]
'id=4035219;num=1'
>>>
You can use a non greedy match (.*?) in between your markers.
>>> import re
>>> url = "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
>>> m = re.search("show_detail&(.*?);session_id=1445045", url)
>>> m.group(1)
'id=4035219;num=1'
regex = re.compile(r"(?<=show_detail&).*?(?=;session_id=1445045)"
should work. See here for more info on lookaround assertions.
import re
url = "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
pattern = "([^>].+)(show_detail&)([^>].+)(session_id=1445045)([^>].+)"
reg = re.compile(r''''''+pattern+'''''',flags = re.S)
match =reg.search(url)
print match.group(3)
this would work i think