Regular expression to match as less characters as possible - python

I want to match x.py from a/b/c/x.py, when I use re:
s = 'a/b/c/x.py'
res = re.search('/(.*.py)?', s).group(1)
>>> res = b/c/x.py
This is not what I need. Any ideas?

You don't need regex, just use str.rsplit, with maxsplit=1, and take the last item:
>>> s.rsplit('/',1)[-1]
'x.py'

when you want to extract filename from path, you should use os.path.split. The os.path.split() method in Python is used to Split the path name into a pair head and tail independent of OS. Here, tail is the last path name component and head is everything leading up to that.
import os
path = 'a/b/c/x.py'
res = os.path.split(path)
print(res[1])
You can also use normpath and os.sep for this solution:
import os
path = 'a/b/c/x.py'
path = os.path.normpath(path)
res = path.split(os.sep)
print(res[-1])
You can use rsplit as #ThePyGuy said in this case to avoid more splitting by changing the line to:
res = path.rsplit(os.sep,1)

If you need to ensure that the element is is the last in a path, you can prepend (?<=\/), a positive lookbehind:
>>> s = 'a/b/c/x.py'
>>> el = re.search(r"(?<=\/)(\w+\.py)", s).group(1)
>>> el
'x.py'
Otherwise, if you need to match also filename.py, you need to remove it:
>>> s2 = 'file.py'
>>> el = re.search(r"(\w+\.py)", s2).group(1)
>>> el
'file.py'

I prefer a splitting approach here:
s = 'a/b/c/x.py'
last = s.split('/')[-1]
print(last) # x.py

import re
s = 'a/b/c/x.py'
res = re.search('\w*\.py', s).group() # It will match alphanumeric
# res = re.search(r'[\w&.-]+$', s).group()
# The above regex will match alphanumeric and the given special characters
EDIT
To match everything after the last / you can use following regex
res = re.search('[^/]+$', s).group()

Related

Extract numbers from string from specific point

Example strings:
myString1 = "/desktop/2512754353/Screenshots/photo_0000.png"
myString2 = "/desktop/51232132561/Screenshots/photo_3501.png"
myString3 = "/desktop/12321516123/Screenshots/photo_7501.png"
myString4 = "/desktop/5234324324/Screenshots/photo_11501.png"
I had a look around, and couldn't really figure out a proper way to do this. I want to be able to also retrieve the last numbers of my strings after the photo_ part, and store them in another variable (string, not int or float). Furthermore, I don't need the number before /Screenshots. It would also be nice if it can work for any number length. The photo_ will always remain inside the string too.
You can write a regex that only matches the end of the string
>>> import re
>>> myString1 = "/desktop/2512754353/Screenshots/photo_0000.png"
>>> re.search(r"photo_(\d+)\.png$", myString1).group(1)
'0000'
This calls for a regex solution:
import re
mystring = "/desktop/2512754353/Screenshots/photo_0000.png"
your_value = re.findall(r'(photo_[0-9]+)', mystring)[0]
print(your_value) # photo_0000
Using regex:
import re
data = [
"/desktop/2512754353/Screenshots/photo_0000.png",
"/desktop/51232132561/Screenshots/photo_3501.png",
"/desktop/12321516123/Screenshots/photo_7501.png",
"/desktop/5234324324/Screenshots/photo_11501.png"
]
id_regex = re.compile(r".+photo_(\d+)\.png")
ids = [int(id_regex.match(d).groups()[0]) for d in data]
print(ids) # [0, 3501, 7501, 11501]
Simple way to do it with split function :
myString1 = "/desktop/2512754353/Screenshots/photo_0000.png"
first_split = myString1.split('photo_')
number = first_split[1].split('.')[0]
print(number)
Other way by using regex :
import re
myString1 = "/desktop/2512754353/Screenshots/photo_0000.png"
number = re.findall(r'\d+', myString1)[1]
print(number)
Proper way would be to use pathlib in conjunction with re.
import re
from pathlib import Path
pattern = re.compile(r"(?<=photo_)[0-9]*")
pattern.search(Path(myString1).name).group(0)
> '0000'
Try this:
import re
myString1 = "/desktop/2512754353/Screenshots/photo_0000.png"
x=re.findall(r'photo_\d+',myString1.split("/")[-1])
print(x)
Another way using inbuilt string functions would be to slice the string between "photo" and ".png":
strings = [myString1, myString2, myString3, myString4]
>>> [s[s.rfind("photo")+6:s.rfind(".png")] for s in strings]
['0000', '3501', '7501', '11501']
I suggest:
import re
myString3 = "/desktop/12321516123/Screenshots/photo_7501.png"
s = re.findall('\_\d+',myString3)[0]
int(s[1:len(s)])
output: 7501
You can use pathlib and split:
from pathlib import Path
fn="/desktop/5234324324/Screenshots/photo_11501.png"
Path(fn).stem.split('_')[-1])
# 11501
The pathlib property .stem is the name of the path stripped of the path to it and the extension:
>>> Path(fn).stem
'photo_11501'
Then either split or partition on the '_' delimiter:
>>> Path(fn).stem.partition('_')
('photo', '_', '11501')
>>> Path(fn).stem.split('_')
['photo', '11501']
You can use split or partition entirely on strings that represent paths:
>>> fn.partition('.png')[0].partition('_')[-1]
'11501'
But using pathlib allows you to produce those paths as the result of a glob or other method and is likely more robust and certainly more cross platform.

How can i Select Everything In Url except filename and extension?

https://fire.vimeocdn.com/1485599447-0xf546ac1afe7bce06fa5153973a8b85b1c45051d3/159463108/video/499604330/playlist.m3u8
I want to Include Everything Except playlist.m3u8
(playlist.[^.]*)
is selecting "playlist.m3u8", i need to do exactly opposite.
Here is an Demo. https://regex101.com/r/RONA65/1
You can use positive look ahead:
(.*)(?=playlist\.[^.]*)
Demo:
https://regex101.com/r/RONA65/4
Or you can try it like this as well:
.*\/
Demo:
https://regex101.com/r/RONA65/2
Regex:
.*\/ Select everything till last /
You can use the split function:
>>> s = 'https://fire.vimeocdn.com/.../159463108/video/499604330/playlist.m3u8'
>>> '/'.join(s.split('/')[:-1])
'https://fire.vimeocdn.com/.../159463108/video/499604330'
Or simpler with rsplit:
>>> s = 'https://fire.vimeocdn.com/.../159463108/video/499604330/playlist.m3u8'
>>> s.rsplit('/', 1)[0]
'https://fire.vimeocdn.com/.../159463108/video/499604330'
Use non-greedy match by adding '?' after '*'
import re
s = 'https://fire.vimeocdn.com/1485599447-0xf546ac1afe7bce06fa5153973a8b85b1c45051d3/159463108/video/499604330/playlist.m3u8'
m = re.match('(.*?)(playlist.[^.]*)', s)
print(m.group(1))

Regex Python capture string in quotes

I have a file with lines of this form:
ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName
and I would like to capture the names in quotes "" after ClientsName(0) = and ClientsName(1) =.
So far, I came up with this code
import re
f = open('corrected_clients_data.txt', 'r')
result = ''
re_name = "ClientsName\(0\) = (.*)"
for line in f:
name = re.search(line, re_name)
print (name)
which is returning None at each line...
Two sources of error can be: the backslashes and the capture sequence (.*)...
You can do that more easily using re.findall and using \d instead of 0 to make it more general:
import re
s = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> print re.findall(r'ClientsName\(\d\) = "([^"]*)"', s)
['SUPERBRAND', 'GREATSTUFF']
Another thing you must note is that your order of arguments to search() or findall() is wrong. It should be as follows: re.search(pattern, string)
You can use re.findall and just take the first two matches:
>>> s = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> re.findall(r'\"([^"]+)\"' , s)[:2]
['SUPERBRAND', 'GREATSTUFF']
try this
import re
text_file = open("corrected_clients_data.txt", "r")
text = text_file.read()
matches=re.findall(r'\"(.+?)\"',text)
text_file.close()
if you notice the question mark(?) indicates that we have to stop reading the string
at the first ending double quotes encountered.
hope this is helpful.
Use a lookbehind to get the value of ClientsName(0) and ClientsName(1) through re.findall function,
>>> import re
>>> str = '''ClientsName(0) = "SUPERBRAND": ClientsName(1) = "GREATSTUFF": cClientsNames.Add Key:="SUPER", Item:=ClientsName'''
>>> m = re.findall(r'(?<=ClientsName\(0\) = \")[^"]*|(?<=ClientsName\(1\) = \")[^"]*', str)
>>> m
['SUPERBRAND', 'GREATSTUFF']
Explanation:
(?<=ClientsName\(0\) = \") Positive lookbehind is used to set the matching marker just after to the string ClientsName(0) = "
[^"]* Then it matches any character not of " zero or more times. So it match the first value ie, SUPERBRAND
| Logical OR operator used to combine two regexes.
(?<=ClientsName\(1\) = \")[^"]* Matches any character just after to the string ClientsName(1) = " upto the next ". Now it matches the second value ie, GREATSTUFF

search patterns with variable gaps in python

I am looking for patterns in a list containing different strings as:
names = ['TAATGH', 'GHHKLL', 'TGTHA', 'ATGTTKKKK', 'KLPPNF']
I would like to select the string that has the pattern 'T--T' (no matter how the string starts), so those elements would be selected and appended to a new list as:
namesSelected = ['TAATGH', 'ATGTTKKKK']
Using grep I could:
grep "T[[:alpha:]]\{2\}T"
Is there a similar mode in re python?
Thanks for any help!
I think this is most likely what you want:
re.search(r'T[A-Z]{2}T', inputString)
The equivalent in Python for [[:alpha:]] would be [a-zA-Z]. You may replace [A-Z] with [a-zA-Z] in the code snippet above if you wish to allow lowercase alphabet.
Documentation for re.search.
Yep, you can use re.search:
>>> names = ['TAATGH', 'GHHKLL', 'TGTHA', 'ATGTTKKKK', 'KLPPNF']
>>> reslist = []
>>> for i in names:
... res = re.search(r'T[A-Z]{2}T', i)
... if res:
... reslist.append(i)
...
>>>
>>> print(reslist)
['TAATGH', 'ATGTTKKKK']
import re
def grep(l, pattern):
r = re.compile(pattern)
return [_ for _ in l if r.search(pattern)]
nameSelected = grep(names, "T\w{2}T")
Note the use of \w instead of [[:alpha:]]

How to delete everything after a certain character in a string?

How would I delete everything after a certain character of a string in python? For example I have a string containing a file path and some extra characters. How would I delete everything after .zip? I've tried rsplit and split , but neither included the .zip when deleting extra characters.
Any suggestions?
Just take the first portion of the split, and add '.zip' back:
s = 'test.zip.zyz'
s = s.split('.zip', 1)[0] + '.zip'
Alternatively you could use slicing, here is a solution where you don't need to add '.zip' back to the result (the 4 comes from len('.zip')):
s = s[:s.index('.zip')+4]
Or another alternative with regular expressions:
import re
s = re.match(r'^.*?\.zip', s).group(0)
str.partition:
>>> s='abc.zip.blech'
>>> ''.join(s.partition('.zip')[0:2])
'abc.zip'
>>> s='abc.zip'
>>> ''.join(s.partition('.zip')[0:2])
'abc.zip'
>>> s='abc.py'
>>> ''.join(s.partition('.zip')[0:2])
'abc.py'
Use slices:
s = 'test.zip.xyz'
s[:s.index('.zip') + len('.zip')]
=> 'test.zip'
And it's easy to pack the above in a little helper function:
def removeAfter(string, suffix):
return string[:string.index(suffix) + len(suffix)]
removeAfter('test.zip.xyz', '.zip')
=> 'test.zip'
I think it's easy to create a simple lambda function for this.
mystrip = lambda s, ss: s[:s.index(ss) + len(ss)]
Can be used like this:
mystr = "this should stay.zipand this should be removed."
mystrip(mystr, ".zip") # 'this should stay.zip'
You can use the re module:
import re
re.sub('\.zip.*','.zip','test.zip.blah')

Categories

Resources