How to delete everything after a certain character in a string? - python

How would I delete everything after a certain character of a string in python? For example I have a string containing a file path and some extra characters. How would I delete everything after .zip? I've tried rsplit and split , but neither included the .zip when deleting extra characters.
Any suggestions?

Just take the first portion of the split, and add '.zip' back:
s = 'test.zip.zyz'
s = s.split('.zip', 1)[0] + '.zip'
Alternatively you could use slicing, here is a solution where you don't need to add '.zip' back to the result (the 4 comes from len('.zip')):
s = s[:s.index('.zip')+4]
Or another alternative with regular expressions:
import re
s = re.match(r'^.*?\.zip', s).group(0)

str.partition:
>>> s='abc.zip.blech'
>>> ''.join(s.partition('.zip')[0:2])
'abc.zip'
>>> s='abc.zip'
>>> ''.join(s.partition('.zip')[0:2])
'abc.zip'
>>> s='abc.py'
>>> ''.join(s.partition('.zip')[0:2])
'abc.py'

Use slices:
s = 'test.zip.xyz'
s[:s.index('.zip') + len('.zip')]
=> 'test.zip'
And it's easy to pack the above in a little helper function:
def removeAfter(string, suffix):
return string[:string.index(suffix) + len(suffix)]
removeAfter('test.zip.xyz', '.zip')
=> 'test.zip'

I think it's easy to create a simple lambda function for this.
mystrip = lambda s, ss: s[:s.index(ss) + len(ss)]
Can be used like this:
mystr = "this should stay.zipand this should be removed."
mystrip(mystr, ".zip") # 'this should stay.zip'

You can use the re module:
import re
re.sub('\.zip.*','.zip','test.zip.blah')

Related

Regular expression to match as less characters as possible

I want to match x.py from a/b/c/x.py, when I use re:
s = 'a/b/c/x.py'
res = re.search('/(.*.py)?', s).group(1)
>>> res = b/c/x.py
This is not what I need. Any ideas?
You don't need regex, just use str.rsplit, with maxsplit=1, and take the last item:
>>> s.rsplit('/',1)[-1]
'x.py'
when you want to extract filename from path, you should use os.path.split. The os.path.split() method in Python is used to Split the path name into a pair head and tail independent of OS. Here, tail is the last path name component and head is everything leading up to that.
import os
path = 'a/b/c/x.py'
res = os.path.split(path)
print(res[1])
You can also use normpath and os.sep for this solution:
import os
path = 'a/b/c/x.py'
path = os.path.normpath(path)
res = path.split(os.sep)
print(res[-1])
You can use rsplit as #ThePyGuy said in this case to avoid more splitting by changing the line to:
res = path.rsplit(os.sep,1)
If you need to ensure that the element is is the last in a path, you can prepend (?<=\/), a positive lookbehind:
>>> s = 'a/b/c/x.py'
>>> el = re.search(r"(?<=\/)(\w+\.py)", s).group(1)
>>> el
'x.py'
Otherwise, if you need to match also filename.py, you need to remove it:
>>> s2 = 'file.py'
>>> el = re.search(r"(\w+\.py)", s2).group(1)
>>> el
'file.py'
I prefer a splitting approach here:
s = 'a/b/c/x.py'
last = s.split('/')[-1]
print(last) # x.py
import re
s = 'a/b/c/x.py'
res = re.search('\w*\.py', s).group() # It will match alphanumeric
# res = re.search(r'[\w&.-]+$', s).group()
# The above regex will match alphanumeric and the given special characters
EDIT
To match everything after the last / you can use following regex
res = re.search('[^/]+$', s).group()

Extract numbers from string from specific point

Example strings:
myString1 = "/desktop/2512754353/Screenshots/photo_0000.png"
myString2 = "/desktop/51232132561/Screenshots/photo_3501.png"
myString3 = "/desktop/12321516123/Screenshots/photo_7501.png"
myString4 = "/desktop/5234324324/Screenshots/photo_11501.png"
I had a look around, and couldn't really figure out a proper way to do this. I want to be able to also retrieve the last numbers of my strings after the photo_ part, and store them in another variable (string, not int or float). Furthermore, I don't need the number before /Screenshots. It would also be nice if it can work for any number length. The photo_ will always remain inside the string too.
You can write a regex that only matches the end of the string
>>> import re
>>> myString1 = "/desktop/2512754353/Screenshots/photo_0000.png"
>>> re.search(r"photo_(\d+)\.png$", myString1).group(1)
'0000'
This calls for a regex solution:
import re
mystring = "/desktop/2512754353/Screenshots/photo_0000.png"
your_value = re.findall(r'(photo_[0-9]+)', mystring)[0]
print(your_value) # photo_0000
Using regex:
import re
data = [
"/desktop/2512754353/Screenshots/photo_0000.png",
"/desktop/51232132561/Screenshots/photo_3501.png",
"/desktop/12321516123/Screenshots/photo_7501.png",
"/desktop/5234324324/Screenshots/photo_11501.png"
]
id_regex = re.compile(r".+photo_(\d+)\.png")
ids = [int(id_regex.match(d).groups()[0]) for d in data]
print(ids) # [0, 3501, 7501, 11501]
Simple way to do it with split function :
myString1 = "/desktop/2512754353/Screenshots/photo_0000.png"
first_split = myString1.split('photo_')
number = first_split[1].split('.')[0]
print(number)
Other way by using regex :
import re
myString1 = "/desktop/2512754353/Screenshots/photo_0000.png"
number = re.findall(r'\d+', myString1)[1]
print(number)
Proper way would be to use pathlib in conjunction with re.
import re
from pathlib import Path
pattern = re.compile(r"(?<=photo_)[0-9]*")
pattern.search(Path(myString1).name).group(0)
> '0000'
Try this:
import re
myString1 = "/desktop/2512754353/Screenshots/photo_0000.png"
x=re.findall(r'photo_\d+',myString1.split("/")[-1])
print(x)
Another way using inbuilt string functions would be to slice the string between "photo" and ".png":
strings = [myString1, myString2, myString3, myString4]
>>> [s[s.rfind("photo")+6:s.rfind(".png")] for s in strings]
['0000', '3501', '7501', '11501']
I suggest:
import re
myString3 = "/desktop/12321516123/Screenshots/photo_7501.png"
s = re.findall('\_\d+',myString3)[0]
int(s[1:len(s)])
output: 7501
You can use pathlib and split:
from pathlib import Path
fn="/desktop/5234324324/Screenshots/photo_11501.png"
Path(fn).stem.split('_')[-1])
# 11501
The pathlib property .stem is the name of the path stripped of the path to it and the extension:
>>> Path(fn).stem
'photo_11501'
Then either split or partition on the '_' delimiter:
>>> Path(fn).stem.partition('_')
('photo', '_', '11501')
>>> Path(fn).stem.split('_')
['photo', '11501']
You can use split or partition entirely on strings that represent paths:
>>> fn.partition('.png')[0].partition('_')[-1]
'11501'
But using pathlib allows you to produce those paths as the result of a glob or other method and is likely more robust and certainly more cross platform.

how to split string from the end after certain occurances of character

how to split the below string after 2nd occurrence of '/' from the end:
/u01/dbms/orcl/product/11.2.0.4/db_home
Expected output is :
/u01/dbms/orcl/product/
Thanks.
Do not use split, use rsplit instead! It's much simpler and faster.
s = '/u01/dbms/orcl/product/11.2.0.4/db_home'
result = s.rsplit('/', 2)[0] + '/'
string = "/u01/dbms/orcl/product/11.2.0.4/db_home"
split_string = string.split('/')
expected_output = "/".join(split_string[:-2]) + "/"
You're also free to change "-2" to minus whatever amount of filenames you need clipped.
If you can parse it as a filepath, I recommend pathlib, try:
from pathlib import Path
p = Path('/u01/dbms/orcl/product/11.2.0.4/db_hom')
p.parent.parent # Returns object containg path /u01/dbms/orc1/product/
input='/u01/dbms/orcl/product/11.2.0.4/db_home'
output = '/'.join(str(word) for word in input.split('/')[:-2])+'/'

how to split the text using python?

f_output.write('\n{}, {}\n'.format(filename, summary))
I am printing the output as the name of the file. I am getting the output as VCALogParser_output_ARW.log, VCALogParser_output_CZC.log and so on. but I am interested only in printing ARW, CZC and so on. So please someone can tell me how to split this text ?
filename.split('_')[-1].split('.')[0]
this will give you : 'ARW'
summary.split('_')[-1].split('.')[0]
and this will give you: 'CZC'
If you are only interested in CZC and ARW without the .log then, you can do it with re.search method:
>>> import re
>>> s1 = 'VCALogParser_output_ARW.log'
>>> s2 = 'VCALogParser_output_CZC.log'
>>> re.search(r'.*_(.*)\.log', s1).group(1)
'ARW'
>>> re.search(r'.*_(.*)\.log', s2).group(1)
'CZC'
Or better maker your patten p then call its search method when formatting your string:
>>> p = re.compile(r'.*_(.*)\.log')
>>>
>>> '\n{}, {}\n'.format(p.search(s1).group(1), p.search(s2).group(1))
'\nARW, CZC\n'
Also, it might be helpful using re.sub with positive look ahead and group naming:
>>> p = re.compile(r'.*(?<=_)(?P<mystr>[a-zA-Z0-9]+)\.log$')
>>>
>>>
>>> p.sub('\g<mystr>', s1)
'ARW'
>>> p.sub('\g<mystr>', s2)
'CZC'
>>>
>>>
>>> '\n{}, {}\n'.format(p.sub('\g<mystr>', s1), p.sub('\g<mystr>', s2))
'\nARW, CZC\n'
In case, you are not able or you don't want to use re module, then you can define lengths of strings that you don't need and index your string variables with them:
>>> i1 = len('VCALogParser_output_')
>>> i2 = len ('.log')
>>>
>>> '\n{}, {}\n'.format(s1[i1:-i2], s2[i1:-i2])
'\nARW, CZC\n'
But keep in mind that the above is valid as long as you have those common strings in all of your string variables.
fname.split('_')[-1]
is rought but this will give you 'CZC.log', 'ARW.log' and so on, assuming that all files have the same underscore-delimited format.
If the format of the file is always such that it ends with _ARW.log or _CZC.log this is really easy to do just using the standard string split() method, with two consecutive splits:
shortname = filename.split("_")[-1].split('.')[0]
Or, to make it (arguably) a bit more readable, we can use the os module:
shortname = os.path.splitext(filename)[0].split("_")[-1]
You can also try:
>>> s1 = 'VCALogParser_output_ARW.log'
>>> s2 = 'VCALogParser_output_CZC.log'
>>> s1.split('_')[2].split('.')[0]
ARW
>>> s2.split('_')[2].split('.')[0]
CZC
Parse file name correctly, so basically my guess is that you wanna to strip file extension .log and prefix VCALogParser_output_ to do that it's enough to use str.replace rather than using str.split
Use os.linesep when you writing to file to have cross-browser
Code below would perform desired result(after applying steps listed above):
import os
filename = 'VCALogParser_output_SOME_NAME.log'
summary = 'some summary'
fname = filename.replace('VCALogParser_output_', '').replace('.log', '')
linesep = os.linesep
f_output.write('{linesep}{fname}, {summary}{linesep}'
.format(fname=fname, summary=summary, linesep=linesep))
# or if vars in execution scope strictly controlled pass locals() into format
f_output.write('{linesep}{fname}, {summary}{linesep}'.format(**locals()))

How to extract a substring from a string in Python?

Suppose I have a string , text2='C:\Users\Sony\Desktop\f.html', and I want to separate "C:\Users\Sony\Desktop" and "f.html" and store them in different variables then what should I do ? I tried out regular expressions but I wasn't successful.
os.path.split does what you want:
>>> import os
>>> help(os.path.split)
Help on function split in module ntpath:
split(p)
Split a pathname.
Return tuple (head, tail) where tail is everything after the final slash.
Either part may be empty.
>>> os.path.split(r'c:\users\sony\desktop\f.html')
('c:\\users\\sony\\desktop', 'f.html')
>>> path,filename = os.path.split(r'c:\users\sony\desktop\f.html')
>>> path
'c:\\users\\sony\\desktop'
>>> filename
'f.html'

Categories

Resources