How can i Select Everything In Url except filename and extension? - python

https://fire.vimeocdn.com/1485599447-0xf546ac1afe7bce06fa5153973a8b85b1c45051d3/159463108/video/499604330/playlist.m3u8
I want to Include Everything Except playlist.m3u8
(playlist.[^.]*)
is selecting "playlist.m3u8", i need to do exactly opposite.
Here is an Demo. https://regex101.com/r/RONA65/1

You can use positive look ahead:
(.*)(?=playlist\.[^.]*)
Demo:
https://regex101.com/r/RONA65/4
Or you can try it like this as well:
.*\/
Demo:
https://regex101.com/r/RONA65/2
Regex:
.*\/ Select everything till last /

You can use the split function:
>>> s = 'https://fire.vimeocdn.com/.../159463108/video/499604330/playlist.m3u8'
>>> '/'.join(s.split('/')[:-1])
'https://fire.vimeocdn.com/.../159463108/video/499604330'
Or simpler with rsplit:
>>> s = 'https://fire.vimeocdn.com/.../159463108/video/499604330/playlist.m3u8'
>>> s.rsplit('/', 1)[0]
'https://fire.vimeocdn.com/.../159463108/video/499604330'

Use non-greedy match by adding '?' after '*'
import re
s = 'https://fire.vimeocdn.com/1485599447-0xf546ac1afe7bce06fa5153973a8b85b1c45051d3/159463108/video/499604330/playlist.m3u8'
m = re.match('(.*?)(playlist.[^.]*)', s)
print(m.group(1))

Related

How to count the number of double and triple repetitions of a letter in a string without the two counts overlapping? [duplicate]

I am trying to replace single $ characters with something else, and want to ignore multiple $ characters in a row, and I can't quite figure out how. I tried using lookahead:
s='$a $$b $$$c $d'
re.sub('\$(?!\$)','z',s)
This gives me:
'za $zb $$zc zd'
when what I want is
'za $$b $$$c zd'
What am I doing wrong?
notes, if not using a callable for the replacement function:
you would need look-ahead because you must not match if followed by $
you would need look-behind because you must not match if preceded by $
not as elegant but this is very readable:
>>> def dollar_repl(matchobj):
... val = matchobj.group(0)
... if val == '$':
... val = 'z'
... return val
...
>>> import re
>>> s = '$a $$b $$$c $d'
>>> re.sub('\$+', dollar_repl, s)
'za $$b $$$c zd'
Hmm. It looks like I can get it to work if I used both lookahead and lookbehind. Seems like there should be an easier way, though.
>>> re.sub('(?<!\$)\$(?!\$)','z',s)
'za $$b $$$c zd'
Ok, without lookaround and without callback function:
re.sub('(^|[^$])\$([^$]|$)', '\1z\2', s)
An alternative with re.split:
''.join('z' if x == '$' else x for x in re.split('(\$+)', s))

How to replace an re match with a transformation of that match?

For example, I have a string:
The struct-of-application and struct-of-world
With re.sub, it will replace the matched with a predefined string. How can I replace the match with a transformation of the matched content? To get, for example:
The [application_of_struct](http://application_of_struct) and [world-of-struct](http://world-of-struct)
If I write a simple regex ((\w+-)+\w+) and try to use re.sub, it seems I can't use what I matched as part of the replacement, let alone edit the matched content:
In [10]: p.sub('struct','The struct-of-application and struct-of-world')
Out[10]: 'The struct and struct'
Use a function for the replacement
s = 'The struct-of-application and struct-of-world'
p = re.compile('((\w+-)+\w+)')
def replace(match):
return 'http://{}'.format(match.group())
#for python 3.6+ ...
#return f'http://{match.group()}'
>>> p.sub(replace, s)
'The http://struct-of-application and http://struct-of-world'
>>>
Try this:
>>> p = re.compile(r"((\w+-)+\w+)")
>>> p.sub('[\\1](http://\\1)','The struct-of-application and struct-of-world')
'The [struct-of-application](http://struct-of-application) and [struct-of-world](http://struct-of-world)'

Extract substring between specific characters

I have some strings like:
\i{}Agrostis\i0{} <L.>
I would like to get rid of the '\i{}', '\io{}' characters, so that I could get just:
Agrostis <L.>
I've tried the following code (adapted from here):
m = re.search('\i{}(.+?)\i0', item_name)
if m:
name = m.group(1).strip('\\')
else:
name = item_name
It works in part, because when I run it I get just:
Agrostis
without the
<L.>
part (which I want to keep).
Any hints?
Thanks in advance for any assistance you can provide!
Use s.replace('\i{}', '') and s.replace('\io{}', '')
You ca do this in different ways.
The simplest one is to use str.replace
s = '''\i{}Agrostis\i0{} <L.>'''
s2 = s.replace('''\i{}''', '').replace('''\i0{}''', '')
Another way is to use re.sub()
You need to use the re.sub function.
In [34]: import re
In [35]: s = "\i{}Agrostis\i0{} <L.>"
In [36]: re.sub(r'\\i\d*{}', '', s)
Out[36]: 'Agrostis <L.>'
You could use a character class along with re.sub()
import re
regex = r'\\i[\d{}]+'
string = "\i{}Agrostis\i0{} <L.>"
string = re.sub(regex, '', string)
print string
See a demo on ideone.com.
You can either use s.replace('\i{}', '') and s.replace('\io{}', ''), as Julien said, or, continuing with the regex approach, change your pattern to:
re.search('\i{}(.+?)\i0(.++)', item_name)
And use m.group(1).strip('\\') + m.group(2).strip('\\') as the result.

Complex regex in Python

I am trying to write a generic pattern using regex so that it fetches only particular things from the string. Let's say we have strings like GigabitEthernet0/0/0/0 or FastEthernet0/4 or Ethernet0/0.222. The regex should fetch the first 2 characters and all the numerals. Therefore, the fetched result should be something like Gi0000 or Fa04 or Et00222 depending on the above cases.
x = 'GigabitEthernet0/0/0/2
m = re.search('([\w+]{2}?)[\\\.(\d+)]{0,}',x)
I am not able to understand how shall I write the regular expression. The values can be fetched in the form of a list also. I write few more patterns but it isn't helping.
In regex, you may use re.findall function.
>>> import re
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join(re.findall(r'\d', s))
'Gi0000'
OR
>>> ''.join(re.findall(r'^..|\d', s))
'Gi0000'
>>> ''.join(re.findall(r'^..|\d', 'Ethernet0/0.222'))
'Et00222'
OR
>>> s = 'GigabitEthernet0/0/0/0 '
>>> s[:2]+''.join([i for i in s if i.isdigit()])
'Gi0000'
z="Ethernet0/0.222."
print z[:2]+"".join(re.findall(r"(\d+)(?=[\d\W]*$)",z))
You can try this.This will make sure only digits from end come into play .
Here is another option:
s = 'Ethernet0/0.222'
"".join(re.findall('^\w{2}|[\d]+', s))

regex to find postition between two markers in string

i need to find anything between
show_detail&
and
;session_id=1445045
in
https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0
using regex in python.
i know i need to use lookbehind/ahead but i can't seem to make it work!
please help!
thanks :)
Why use a regex?
>>>> url = 'https://ww.site.gov.....'
>>> start = url.index('show_detail&') + len('show_detail&')
>>> end = url.index(';session_id=')
>>> url[start:end]
'id=4035219;num=1'
>>> s= "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
>>> s.split(";session_id=1445045")[0].split("show_detail&")[-1]
'id=4035219;num=1'
>>>
You can use a non greedy match (.*?) in between your markers.
>>> import re
>>> url = "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
>>> m = re.search("show_detail&(.*?);session_id=1445045", url)
>>> m.group(1)
'id=4035219;num=1'
regex = re.compile(r"(?<=show_detail&).*?(?=;session_id=1445045)"
should work. See here for more info on lookaround assertions.
import re
url = "https://www.site.gov.uk//search/cgi-bin/contract_search/contract_search.cgi?rm=show_detail&id=4035219;num=1;session_id=1445045;start=0;recs=20;subscription=1;value=0"
pattern = "([^>].+)(show_detail&)([^>].+)(session_id=1445045)([^>].+)"
reg = re.compile(r''''''+pattern+'''''',flags = re.S)
match =reg.search(url)
print match.group(3)
this would work i think

Categories

Resources