remove part of path

remove part of path - python

I have the following data:
/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA
However, I dont want to have "Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/" in it, I want the path to be like:
/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA
With regular expressions I basically want to math anything that holds *.torrent./, how can I accomplish this in regexp?
Thanks!

You don't even need regular expressions. You can use os.path.dirname and os.path.basename:
os.path.join(os.path.dirname(os.path.dirname(path)),
os.path.basename(path))
where path is the original path to the file.
Alternatively, you can also use os.path.split as follows:
dirname, filename = os.path.split(path)
os.path.join(os.path.dirname(dirname), filename)
Note This will work under the assumption that what you want to remove is the directory name that contains the file from the path as in the example in the question.

You can do this without using regexp:
>>> x = unicode('/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA')
>>> x.rfind('.torrent')
66
>>> x[:x.rfind('.torrent')]
u'/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA'

I basically want to math anything that holds *.torrent./, how can I accomplish this in regexp?
You can use:
[^/]*\.torrent/
Assuming the last . was a typo.

Given path='/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA'
You can do it with regular expression as
re.sub("/[^/]*\.torrent/","",path)
You can also do it without regex as
'/'.join(x for x in path.split("/") if x.find("torrent") == -1)

Your question is a bit vague and unclear, but here's one way how to strip off what you want:
import re
s = "/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA"
c = re.compile("(/.*/).*?torrent/(.*)")
m = re.match(c, s)
path = m.group(1)
file = m.group(2)
print path + file
>>> ## working on region in file /usr/tmp/python-215357Ay...
/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA

Related

Replacing parts of a string containing directory paths using Python

I have a large string with potentially many paths in it resembling this structure:
dirA/dirB/a1ed4f3b-a046-4fbf-bb70-0774bd7bfcn
and I need to replace everything before the a1ed4f3b-a046-4fbf-bb70-0774bd7bfcn part of the string with "local/" such that the
result will look like this:
local/a1ed4f3b-a046-4fbf-bb70-0774bd7bfcn
The string could contain more than just dirA/dirB/ at
the start of the string too.
How can I do this string manipulation in Python?

Using regular expressions, you can replace everything up to and including the last "/" with "locals/"
import re
s = "dirA/dirB/a1ed4f3b-a046-4fbf-bb70-0774bd7bfcn"
re.sub(r'.*(\/.*)',r'local\1',s)
and you obtain:
'local/a1ed4f3b-a046-4fbf-bb70-0774bd7bfcn'

Use os module
Ex:
import os
path = "dirA/dirB/a1ed4f3b-a046-4fbf-bb70-0774bd7bfcn"
print(os.path.join("locals", os.path.basename(path)))

Another alternative is to split the string on "/" and then concatenate "locals/" with the last element of the resultant list.
s = "dirA/dirB/a1ed4f3b-a046-4fbf-bb70-0774bd7bfcn"
print("locals/" + s.split("/")[-1])
#'locals/a1ed4f3b-a046-4fbf-bb70-0774bd7bfcn'

How does this look?
inputstring = 'dirA/dirB/a1ed4f3b-a046-4fbf-bb70-0774bd7bfcn'
filename = os.path.basename(inputstring)
localname = 'local'
os.path.join(localname, filename)

renaming the filename with regex in python using re

I have a folder which contains multiple files with a below filename as one example and I have multiple different such
_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001_ICGCDBDE20130916001.rsem.bam
Now I want to rename then using only by ICGCDBDE20130916001.rsem.bam will change according to the file in the path. The string corresponding to the name *.rsem.bam should be the one separated by "_". So for all the files in the directory should be replaced accordingly by this. I am thinking to use the regular expression so I came up with the below pattern
pat=r'_(.*)_(.*)_(.*)_(.*)_(.\w+)'
This separates out my filename as desired and I can rename the filenames with by using a global variable where I take only pat[4]. I wanted to use python since I want to learn it as of now to make small changes as file renaming and so on and later with time convert my workflows in python. I am unable to do it. How should I make this work in python? Also am in a fix what should have been the corresponding bash regex since this one is a pretty big filename and my encounter with such is very new. Below was my code not to change directly but to understand if it works but how should I get it work if I want to rename them.
import re
import os
_src = "path/bam/test/"
_ext = ".rsem.bam"
endsWithNumber = re.compile(r'_(.*)_(.*)_(.*)_(.*)_(.\w+)'+(re.escape(_ext))+'$')
print(endsWithNumber)
for filename in os.listdir(_src):
m = endsWithNumber.search(filename)
print(m)
I would appreciate both in python and bash, however, I would prefer python for my own understanding and future learning.

You can use rpartition which will separate out the part you want from the rest in to a three part tuple.
Given:
>>> fn
'_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001_ICGCDBDE20130916001.rsem.bam'
You can do:
>>> fn.rpartition('_')
('_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001', '_', 'ICGCDBDE20130916001.rsem.bam')
Then:
>>> _,sep,new_name=fn.rpartition('_')
>>> new_name
'ICGCDBDE20130916001.rsem.bam'
If you want to use a regex:
>>> re.search(r'_([^_]+$)', fn).group(1)
'ICGCDBDE20130916001.rsem.bam'
As a practical matter, you would test to see if there was a match before using group(1):
>>> m=re.search(r'_([^_]+$)', fn)
>>> new_name = m.group(1) if m else fn
For sed you can do:
$ echo "$fn" | sed -E 's/.*_([^_]*)$/\1/'
ICGCDBDE20130916001.rsem.bam
Or in Bash, same regex:
$ [[ $fn =~ _([^_]*)$ ]] && echo "${BASH_REMATCH[1]}"
ICGCDBDE20130916001.rsem.bam

You can use list comprehension
import re
import os
_src = "path/bam/test/"
new_s = [re.search("[a-zA-Z0-9]+\.rsem\.bam", filename) for filename in os.listdir(_src)]
for first, second in zip(os.listdir(_src), new_s):
if second is not None:
os.rename(first, second.group(0))

Too much work.
newname = oldname.rsplit('_', 1)[1]

import os
fname = 'YOUR_FILENAME.avi'
fname1 = fname.split('.')
fname2 = str(fname1[0]) + '.mp4'
os.rename('path to your source file' + str(fname), 'path to your destination file' + str(fname2))
fname = fname2

How to get the substring from a String in python

I have a string path='/home/user/Desktop/My_file.xlsx'.
I want to extract the "My_file" substring. I am using Django framework for python.
I have tried to get it with:
re.search('/(.+?).xlsx', path).group(1)
but it returns the whole path again.
Can someone please help.

If you know that the file extension is always the same (e.g. ".xlsx") I would suggest you to go this way:
import os
filename_full = os.path.basename(path)
filename = filename_full.split(".xlsx")[0]
Hope it helps

More generally:
import os
filename = os.path.basename(os.path.splitext(path)[0])

If you need to match the exact extension:
# (?<=/) ensure that before the match is /
# [^/]*.xlsx search for anything but / followed by .xlsx
mo1 = re.search('(?<=/)[^/]*.xlsx', path).group(0)
print(mo1)
My_file.xlsx
otherwise:
path='/home/user/Desktop/My_file.xlsx'
with regex:
mo = re.search(r'(?<=/)([\w.]+$)',path)
print(mo.group(1))
My_file.xlsx
with rsplit:
my_file = path.rsplit('/')[-1]
print(my_file)
My_file.xlsx

Python and regex - how to find anytext_NUMBER_svm.pkl

I have file names that are in this format:
anytext_NUMBER_svm.pkl
I need to loop thourgh all files in a dir and file files that look like this:
file1.txt
file2.txt
anytext_1_svm.pkl
anytext_2_svm.pkl
anytext_3_svm.pkl
The matched files will be this:
anytext_1_svm.pkl
anytext_2_svm.pkl
anytext_3_svm.pkl
How to I use python regex to do this?

An option that:
doesn't use re
makes sure the comparison is only on the filename part - not part of a path
restricts the number of filename patterns to validate further using iglob
Code:
from glob import iglob
import os.path
for fname in iglob('*_*_svm.pkl'):
path, name = os.path.split(fname)
anytext, digit, rest = name.split('_', 2)
if digit.isdigit(): # add criteria for anytext if required...
# ....

This regex shoud solve your problems:
>>> import re
>>> regex = re.compile(r'.+_\d+_svm\.pkl')
>>> regex.search('anytext_1_svm.pkl') != None
True
But you should definitely take a look at the documentation: http://docs.python.org/library/re.html

I would suggest a review of this page:
http://docs.python.org/py3k/library/re.html#module-re
It will help you understand how to write regular expressions and ensure that you are matching things properly. For the number, use [0-9]*, use _ to separate your groups, and write a little match-checking conditional stuff and this will be a quick project.

import glob
file_list = glob.glob('anytext_[0-9]_svm.pk1')

regex to catch "anytext_NUMBER_svm.pkl" is very simple.
r'.+_\d+_svm\.pkl'

Getting file extension using pattern matching in python

I am trying to find the extension of a file, given its name as a string. I know I can use the function os.path.splitext but it does not work as expected in case my file extension is .tar.gz or .tar.bz2 as it gives the extensions as gz and bz2 instead of tar.gz and tar.bz2 respectively.
So I decided to find the extension of files myself using pattern matching.
print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz')group('ext')
>>> gz # I want this to come as 'tar.gz'
print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.bz2')group('ext')
>>> bz2 # I want this to come 'tar.bz2'
I am using (?P<ext>...) in my pattern matching as I also want to get the extension.
Please help.

root,ext = os.path.splitext('a.tar.gz')
if ext in ['.gz', '.bz2']:
ext = os.path.splitext(root)[1] + ext
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

>>> print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
gz
>>> print re.compile(r'^.*?[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
tar.gz
>>>
The ? operator tries to find the minimal match, so instead of .* eating ".tar" as well, .*? finds the minimal match that allows .tar.gz to be matched.

I have idea which is much easier than breaking your head with regex,sometime it might sound stupid too.
name="filename.tar.gz"
extensions=('.tar.gz','.py')
[x for x in extensions if name.endswith(x)]

Starting from phihags answer:
DOUBLE_EXTENSIONS = ['tar.gz','tar.bz2'] # Add extra extensions where desired.
def guess_extension(filename):
"""
Guess the extension of given filename.
"""
root,ext = os.path.splitext(filename)
if any([filename.endswith(x) for x in DOUBLE_EXTENSIONS]):
root, first_ext = os.path.splitext(root)
ext = first_ext + ext
return root, ext

this is simple and works on both single and multiple extensions
In [1]: '/folder/folder/folder/filename.tar.gz'.split('/')[-1].split('.')[0]
Out[1]: 'filename'
In [2]: '/folder/folder/folder/filename.tar'.split('/')[-1].split('.')[0]
Out[2]: 'filename'
In [3]: 'filename.tar.gz'.split('/')[-1].split('.')[0]
Out[3]: 'filename'

Continuing from phihags answer to generic remove all double or triple extensions such as CropQDS275.jpg.aux.xml use while '.' in:
tempfilename, file_extension = os.path.splitext(filename)
while '.' in tempfilename:
tempfilename, tempfile_extension = os.path.splitext(tempfilename)
file_extension = tempfile_extension + file_extension

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

remove part of path - python

You can do this without using regexp: >>> x = unicode('/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA') >>> x.rfind('.torrent') 66 >>> x[:x.rfind('.torrent')] u'/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA'

I basically want to math anything that holds .torrent./, how can I accomplish this in regexp? You can use: [^/]\.torrent/ Assuming the last . was a typo.

Given path='/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA' You can do it with regular expression as re.sub("/[^/]*\.torrent/","",path) You can also do it without regex as '/'.join(x for x in path.split("/") if x.find("torrent") == -1)

Related

Replacing parts of a string containing directory paths using Python

renaming the filename with regex in python using re

How to get the substring from a String in python

Python and regex - how to find anytext_NUMBER_svm.pkl

Getting file extension using pattern matching in python

Categories

Resources

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

remove part of path - python

You can do this without using regexp: >>> x = unicode('/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA') >>> x.rfind('.torrent') 66 >>> x[:x.rfind('.torrent')] u'/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA'

I basically want to math anything that holds *.torrent./, how can I accomplish this in regexp? You can use: [^/]*\.torrent/ Assuming the last . was a typo.

Given path='/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA' You can do it with regular expression as re.sub("/[^/]*\.torrent/","",path) You can also do it without regex as '/'.join(x for x in path.split("/") if x.find("torrent") == -1)

Related

Replacing parts of a string containing directory paths using Python

renaming the filename with regex in python using re

How to get the substring from a String in python

Python and regex - how to find anytext_NUMBER_svm.pkl

Getting file extension using pattern matching in python

Categories

Resources

I basically want to math anything that holds .torrent./, how can I accomplish this in regexp? You can use: [^/]\.torrent/ Assuming the last . was a typo.