regex to find postition between two markers in string - python

i need to find anything between
show_detail&
and
;session_id=1445045
in
https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0
using regex in python.
i know i need to use lookbehind/ahead but i can't seem to make it work!
please help!
thanks :)

Why use a regex?
>>>> url = 'https://ww.site.gov.....'
>>> start = url.index('show_detail&') + len('show_detail&')
>>> end = url.index(';session_id=')
>>> url[start:end]
'id=4035219;num=1'

>>> s= "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
>>> s.split(";session_id=1445045")[0].split("show_detail&")[-1]
'id=4035219;num=1'
>>>

You can use a non greedy match (.*?) in between your markers.
>>> import re
>>> url = "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
>>> m = re.search("show_detail&(.*?);session_id=1445045", url)
>>> m.group(1)
'id=4035219;num=1'

regex = re.compile(r"(?<=show_detail&).*?(?=;session_id=1445045)"
should work. See here for more info on lookaround assertions.

import re
url = "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
pattern = "([^>].+)(show_detail&)([^>].+)(session_id=1445045)([^>].+)"
reg = re.compile(r''''''+pattern+'''''',flags = re.S)
match =reg.search(url)
print match.group(3)
this would work i think

Related

how do i extract value inside quotes using regex python?

My text is
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
I am trying to extract value of posted_data which is 2e54eba66f8f2881c8e78be8342428xd
My code :
extract_posted_data = re.search(r'(\"posted_data\": \")(\w*)', my_text)
print (extract_posted_data)
and it prints None
Thank you
This particular example doesn't seem like it needs regular expressions at all.
>>> my_text
'"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
>>> import json
>>> result = json.loads('{%s}' % my_text)
>>> result
{'posted_data': '2e54eba66f8f2881c8e78be8342428xd', 'isropa': False, 'rx': 'NO', 'readal': 'false'}
>>> result['posted_data']
'2e54eba66f8f2881c8e78be8342428xd'
With BeautifulSoup:
>>> import json
...
... from bs4 import BeautifulSoup
...
... soup = BeautifulSoup('<script type="text/javascript"> "posted_data":"2738273283723hjasda" </script>')
...
... result = json.loads('{%s}' % soup.script.text)
>>> result
{'posted_data': '2738273283723hjasda'}
>>> result['posted_data']
'2738273283723hjasda'
This is because your original code has an additional space. It should be:
extract_posted_data = re.search(r'(\"posted_data\":\")(\w*)', my_text)
And in fact, '\' is unnecessary here. Just:
extract_posted_data = re.search(r'("posted_data":")(\w*)', my_text)
Then:
extract_posted_data.group(2)
is what you want.
>>> my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
>>> extract_posted_data = re.search(r'("posted_data":")(\w*)', my_text)
>>> extract_posted_data.group(2)
'2e54eba66f8f2881c8e78be8342428xd'
You need to change your regex to use lookarounds, as follows:
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
extract_posted_data = re.search(r'(?<="posted_data":")\w*(?=")', my_text)
print (extract_posted_data[0])
Prints 2e54eba66f8f2881c8e78be8342428xd
Also re.search() returns a Match object, so to get the first match (the only match) you get index 0 of the match:
as others have mentioned json would be a better tool for this data but you can also use this regex (I added a \s* in case in the future there are spaces in between):
regex: "posted_data":\s*"(?P<posted_data>[^"]+)"
import re
my_text = '"posted_data":"2e54eba66f8f2881c8e78be8342428xd","isropa":false,"rx":"NO","readal":"false"'
m = re.search(r'"posted_data":\s*"(?P<posted_data>[^"]+)"', my_text)
if m:
print(m.group('posted_data'))

How can i Select Everything In Url except filename and extension?

https://fire.vimeocdn.com/1485599447-0xf546ac1afe7bce06fa5153973a8b85b1c45051d3/159463108/video/499604330/playlist.m3u8
I want to Include Everything Except playlist.m3u8
(playlist.[^.]*)
is selecting "playlist.m3u8", i need to do exactly opposite.
Here is an Demo. https://regex101.com/r/RONA65/1
You can use positive look ahead:
(.*)(?=playlist\.[^.]*)
Demo:
https://regex101.com/r/RONA65/4
Or you can try it like this as well:
.*\/
Demo:
https://regex101.com/r/RONA65/2
Regex:
.*\/ Select everything till last /
You can use the split function:
>>> s = 'https://fire.vimeocdn.com/.../159463108/video/499604330/playlist.m3u8'
>>> '/'.join(s.split('/')[:-1])
'https://fire.vimeocdn.com/.../159463108/video/499604330'
Or simpler with rsplit:
>>> s = 'https://fire.vimeocdn.com/.../159463108/video/499604330/playlist.m3u8'
>>> s.rsplit('/', 1)[0]
'https://fire.vimeocdn.com/.../159463108/video/499604330'
Use non-greedy match by adding '?' after '*'
import re
s = 'https://fire.vimeocdn.com/1485599447-0xf546ac1afe7bce06fa5153973a8b85b1c45051d3/159463108/video/499604330/playlist.m3u8'
m = re.match('(.*?)(playlist.[^.]*)', s)
print(m.group(1))

regex to verify url?

Could anyone help with regex?
I have an URL like
"http://example.com/ru/path/?id=1234&var=abcd"
I'd like an assertion that checks that the URL has a stucture:
"http://example.com/ru/path/?id={id value}&var={var value}"
Surely regex is overkill. if it's repeatable like that you could use:
url="http://example.com/ru/path/?id=1234&var=abcd"
if url.split('?')[1].startswith('id=') and url.split('&')[1].startswith('var='):
print "yay!"
import re
s="http://example.com/ru/path/?id=1234&var=abcd"
pattern = r'http:\/\/example.com\/ru\/path\/\?id=\d+&var=\w+'
res = re.findall(patten,s)
if res:
print "yes:
Regex isn't needed but using regex just check that there is a digit (\d+) and a var ([A-z]+)
import re
p = re.compile('http://example.com/ru/path/\?id=\d+&var=[A-z]+')
check=p.match("http://example.com/ru/path/?id=1234&var=abcd")
if check:
print 'match'
else:
print 'does not match'

Finding last char appearance in string

If have this input:
/Users/myMac/Desktop/MemoryAccess/BasicTest.asm/someStuff
and i want to find the last time the char "/" appeared, and get the string BasicTest
what is a good way of doing that?
Thank you!
os.path module provides basic path name manipulations.
>>> from os.path import *
>>> file = '/Users/myMac/Desktop/MemoryAccess/BasicTest.asm/someStuff'
>>> splitext(basename(dirname(file)))[0]
'BasicTest'
>>> s = "/Users/myMac/Desktop/MemoryAccess/BasicTest.asm/someStuff"
>>> ind = s.rfind('/')
>>> ind1 = s[:ind].rfind('/')
>>> print(s[ind1+1:ind].split('.')[0])
BasicTest
here is an exmple with os:
>>> p = '/Users/myMac/Desktop/MemoryAccess/BasicTest.asm/someStuff'
>>> os.path.dirname(p)
'/Users/myMac/Desktop/MemoryAccess/BasicTest.asm'
>>> os.path.splitext(os.path.dirname(p))
('/Users/myMac/Desktop/MemoryAccess/BasicTest', '.asm')
>>> os.path.basename(os.path.splitext(os.path.dirname(p))[0])
'BasicTest'
Well, "BasicTest" follows the next-to-last appearance of "/", but beyond that, try rfind.
The following will return BasicTest.asm which is half the battle:
'/Users/myMac/Desktop/MemoryAccess/BasicTest.asm/someStuff'.split('/')[-2]
The same trick can be used to split on the '.'
'BasicTest.asm'.split('.')[0]
with re in python
import re
s = "/Users/myMac/Desktop/MemoryAccess/BasicTest.asm/someStuff"
pattern = re.compile(r"/(\w+)\.\w+/\w*$")
match = re.search(pattern,s)
print match.group(1)

Replace ",**" with a linebreak using RegEx (or something else)

I'm getting started with RegEx and I was wondering if anyone could help me craft a statement to convert coordinates as follows:
145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16
to
145.00694,-37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
(Strip off the last comma and value and turn it into a line break.)
I can't figure out how to use wildcards to do something like that. Any help would be greatly appreciated! Thanks.
"Some people, when confronted with a
problem, think 'I know, I'll use
regular expressions.' Now they have
two problems." --Jamie Zawinski
Avoid that problem and use string methods:
s="145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,37.80301,16"
lines = s.split(' ') # each line is separated by ' '
for line in lines:
a,b,c=line.split(',') # three parts, separated by ','
print a,b
Regex have their uses, but this is not one of them.
>>> import re
>>> s="145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16"
>>> print re.sub(",\d*\w","\n",s)
145.00694,-37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
String methods seem to suffice here, regex are overkill:
>>> s='145.00694,-37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16'
>>> print('\n'.join(line.rpartition(',')[0] for line in s.split()))
145.00694,-37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
>>> s = '145.00694,37.80421,9 145.00686,-37.80382,9 145.00595,-37.8035,16 145.00586,-37.80301,16
>>> patt = '(%s,%s),%s' % (('[+-]?\d+\.?\d*', )*3)
>>> m = re.findall(patt, s)
>>> m
['145.00694,37.80421', '145.00686,-37.80382', '145.00595,-37.8035', '145.00586,-37.80301']
>>> print '\n'.join(m)
145.00694,37.80421
145.00686,-37.80382
145.00595,-37.8035
145.00586,-37.80301
but I prefer not use regular expressions in this case
I like SilentGhost solution

Categories

Resources