renaming the filename with regex in python using re - python

I have a folder which contains multiple files with a below filename as one example and I have multiple different such
_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001_ICGCDBDE20130916001.rsem.bam
Now I want to rename then using only by ICGCDBDE20130916001.rsem.bam will change according to the file in the path. The string corresponding to the name *.rsem.bam should be the one separated by "_". So for all the files in the directory should be replaced accordingly by this. I am thinking to use the regular expression so I came up with the below pattern
pat=r'_(.*)_(.*)_(.*)_(.*)_(.\w+)'
This separates out my filename as desired and I can rename the filenames with by using a global variable where I take only pat[4]. I wanted to use python since I want to learn it as of now to make small changes as file renaming and so on and later with time convert my workflows in python. I am unable to do it. How should I make this work in python? Also am in a fix what should have been the corresponding bash regex since this one is a pretty big filename and my encounter with such is very new. Below was my code not to change directly but to understand if it works but how should I get it work if I want to rename them.
import re
import os
_src = "path/bam/test/"
_ext = ".rsem.bam"
endsWithNumber = re.compile(r'_(.*)_(.*)_(.*)_(.*)_(.\w+)'+(re.escape(_ext))+'$')
print(endsWithNumber)
for filename in os.listdir(_src):
m = endsWithNumber.search(filename)
print(m)
I would appreciate both in python and bash, however, I would prefer python for my own understanding and future learning.

You can use rpartition which will separate out the part you want from the rest in to a three part tuple.
Given:
>>> fn
'_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001_ICGCDBDE20130916001.rsem.bam'
You can do:
>>> fn.rpartition('_')
('_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001', '_', 'ICGCDBDE20130916001.rsem.bam')
Then:
>>> _,sep,new_name=fn.rpartition('_')
>>> new_name
'ICGCDBDE20130916001.rsem.bam'
If you want to use a regex:
>>> re.search(r'_([^_]+$)', fn).group(1)
'ICGCDBDE20130916001.rsem.bam'
As a practical matter, you would test to see if there was a match before using group(1):
>>> m=re.search(r'_([^_]+$)', fn)
>>> new_name = m.group(1) if m else fn
For sed you can do:
$ echo "$fn" | sed -E 's/.*_([^_]*)$/\1/'
ICGCDBDE20130916001.rsem.bam
Or in Bash, same regex:
$ [[ $fn =~ _([^_]*)$ ]] && echo "${BASH_REMATCH[1]}"
ICGCDBDE20130916001.rsem.bam

You can use list comprehension
import re
import os
_src = "path/bam/test/"
new_s = [re.search("[a-zA-Z0-9]+\.rsem\.bam", filename) for filename in os.listdir(_src)]
for first, second in zip(os.listdir(_src), new_s):
if second is not None:
os.rename(first, second.group(0))

Too much work.
newname = oldname.rsplit('_', 1)[1]

import os
fname = 'YOUR_FILENAME.avi'
fname1 = fname.split('.')
fname2 = str(fname1[0]) + '.mp4'
os.rename('path to your source file' + str(fname), 'path to your destination file' + str(fname2))
fname = fname2

Related

Python RE Directories and slashes

Let's say I have a string that is a root directory that has been entered
'C:/Users/Me/'
Then I use os.listdir() and join with it to create a list of subdirectories.
I end up with a list of strings that are like below:
'C:/Users/Me/Adir\Asubdir\'
and so on.
I want to split the subdirectories and capture each directory name as its own element. Below is one attempt. I am seemingly having issues with the \ and / characters. I assume \ is escaping, so '[\\/]' to me that says look for \ or / so then '[\\/]([\w\s]+)[\\/]' as a match pattern should look for any word between two slashes... but the output is only ['/Users/'] and nothing else is matched. So I then I add a escape for the forward slash.
'[\\\/]([\w\s]+)[\\\/]'
However, my output then only becomes ['Users','ADir'] so that is confusing the crud out of me.
My question is namely how do I tokenize each directory from a string using both \ and / but maybe also why is my RE not working as I expect?
Minimal Example:
import re, os
info = re.compile('[\\\/]([\w ]+)[\\\/]')
root = 'C:/Users/i12500198/Documents/Projects/'
def getFiles(wdir=os.getcwd()):
files = (os.path.join(wdir,file) for file in os.listdir(wdir)
if os.path.isfile(os.path.join(wdir,file)))
return list(files)
def getDirs(wdir=os.getcwd()):
dirs = (os.path.join(wdir,adir) for adir in os.listdir(wdir)
if os.path.isdir(os.path.join(wdir,adir)))
return list(dirs)
def walkSubdirs(root,below=[]):
subdirs = getDirs(root)
for aDir in subdirs:
below.append(aDir)
walkSubdirs(aDir,below)
return below
subdirs = walkSubdirs(root)
for aDir in subdirs:
files = getFiles(aDir)
for f in files:
finfo = info.findall(f)
print(f)
print(finfo)
I want to split the subdirectories and capture each directory name as its own element
Instead of regular expressions, I suggest you use one of Python's standard functions for parsing filesystem paths.
Here is one using pathlib:
from pathlib import Path
p = Path("C:/Users/Me/ADir\ASub Dir\2 x 2 Dir\\")
p.parts
#=> ('C:\\', 'Users', 'Me', 'ADir', 'ASub Dir\x02 x 2 Dir')
Note that the behaviour of pathlib.Path depends on the system running Python. Since I'm on a Linux machine, I actually used pathlib.PureWindowsPath here. I believe the output should be accurate for those of you on Windows.

How to get the substring from a String in python

I have a string path='/home/user/Desktop/My_file.xlsx'.
I want to extract the "My_file" substring. I am using Django framework for python.
I have tried to get it with:
re.search('/(.+?).xlsx', path).group(1)
but it returns the whole path again.
Can someone please help.
If you know that the file extension is always the same (e.g. ".xlsx") I would suggest you to go this way:
import os
filename_full = os.path.basename(path)
filename = filename_full.split(".xlsx")[0]
Hope it helps
More generally:
import os
filename = os.path.basename(os.path.splitext(path)[0])
If you need to match the exact extension:
# (?<=/) ensure that before the match is /
# [^/]*.xlsx search for anything but / followed by .xlsx
mo1 = re.search('(?<=/)[^/]*.xlsx', path).group(0)
print(mo1)
My_file.xlsx
otherwise:
path='/home/user/Desktop/My_file.xlsx'
with regex:
mo = re.search(r'(?<=/)([\w.]+$)',path)
print(mo.group(1))
My_file.xlsx
with rsplit:
my_file = path.rsplit('/')[-1]
print(my_file)
My_file.xlsx

remove part of path

I have the following data:
/​share/​Downloads/​Videos/​Movies/​Big.Buck.Bunny.​720p.​Bluray.​x264-BLA.​torrent/Big.Buck.Bunny.​720p.​Bluray.​x264-BLA
However, I dont want to have "Big.Buck.Bunny.​720p.​Bluray.​x264-BLA.torrent/" in it, I want the path to be like:
/​share/​Downloads/​Videos/​Movies/Big.Buck.Bunny.​720p.​Bluray.​x264-BLA
With regular expressions I basically want to math anything that holds *.torrent./, how can I accomplish this in regexp?
Thanks!
You don't even need regular expressions. You can use os.path.dirname and os.path.basename:
os.path.join(os.path.dirname(os.path.dirname(path)),
os.path.basename(path))
where path is the original path to the file.
Alternatively, you can also use os.path.split as follows:
dirname, filename = os.path.split(path)
os.path.join(os.path.dirname(dirname), filename)
Note This will work under the assumption that what you want to remove is the directory name that contains the file from the path as in the example in the question.
You can do this without using regexp:
>>> x = unicode('/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA')
>>> x.rfind('.torrent')
66
>>> x[:x.rfind('.torrent')]
u'/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA'
I basically want to math anything that holds *.torrent./, how can I accomplish this in regexp?
You can use:
[^/]*\.torrent/
Assuming the last . was a typo.
Given path='/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA'
You can do it with regular expression as
re.sub("/[^/]*\.torrent/","",path)
You can also do it without regex as
'/'.join(x for x in path.split("/") if x.find("torrent") == -1)
Your question is a bit vague and unclear, but here's one way how to strip off what you want:
import re
s = "/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA"
c = re.compile("(/.*/).*?torrent/(.*)")
m = re.match(c, s)
path = m.group(1)
file = m.group(2)
print path + file
>>> ## working on region in file /usr/tmp/python-215357Ay...
/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA

Split filenames with python

I have files that I want only 'foo' and 'bar' left from split.
dn = "C:\\X\\Data\\"
files
f= C:\\X\\Data\\foo.txt
f= C:\\X\\Dats\\bar.txt
I have tried f.split(".",1)[0]
I thought since dn and .txt are pre-defined I could subtract, nope.
Split does not work for me.
How about using the proper path handling methods from os.path?
>>> f = 'C:\\X\\Data\\foo.txt'
>>> import os
>>> os.path.basename(f)
'foo.txt'
>>> os.path.dirname(f)
'C:\\X\\Data'
>>> os.path.splitext(f)
('C:\\X\\Data\\foo', '.txt')
>>> os.path.splitext(os.path.basename(f))
('foo', '.txt')
To deal with path and file names, it is best to use the built-in module os.path in Python. Please look at function dirname, basename and split in that module.
simple Example for your Help.
import os
from os import path
path_to_directory = "C:\\X\\Data"
for f in os.listdir(path_to_directory):
name , extension = path.splitext(f)
print(name)
Output
foo
bar
These two lines return a list of file names without extensions:
import os
[fname.rsplit('.', 1)[0] for fname in os.listdir("C:\\X\\Data\\")]
It seems you've left out some code. From what I can tell you're trying to split the contents of the file.
To fix your problem, you need to operate on a list of the files in the directory. That is what os.listdir does for you. I've also added a more sophisticated split. rsplit operates from the right, and will only split the first . it finds. Notice the 1 as the second argument.
another example:
f.split('\\')[-1].split('.')[0]
Using python3 and pathlib:
import pathlib
f = 'C:\\X\\Data\\foo.txt'
print(pathlib.PureWindowsPath(f).stem)
will print: 'foo'

Getting file extension using pattern matching in python

I am trying to find the extension of a file, given its name as a string. I know I can use the function os.path.splitext but it does not work as expected in case my file extension is .tar.gz or .tar.bz2 as it gives the extensions as gz and bz2 instead of tar.gz and tar.bz2 respectively.
So I decided to find the extension of files myself using pattern matching.
print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz')group('ext')
>>> gz # I want this to come as 'tar.gz'
print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.bz2')group('ext')
>>> bz2 # I want this to come 'tar.bz2'
I am using (?P<ext>...) in my pattern matching as I also want to get the extension.
Please help.
root,ext = os.path.splitext('a.tar.gz')
if ext in ['.gz', '.bz2']:
ext = os.path.splitext(root)[1] + ext
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
>>> print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
gz
>>> print re.compile(r'^.*?[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
tar.gz
>>>
The ? operator tries to find the minimal match, so instead of .* eating ".tar" as well, .*? finds the minimal match that allows .tar.gz to be matched.
I have idea which is much easier than breaking your head with regex,sometime it might sound stupid too.
name="filename.tar.gz"
extensions=('.tar.gz','.py')
[x for x in extensions if name.endswith(x)]
Starting from phihags answer:
DOUBLE_EXTENSIONS = ['tar.gz','tar.bz2'] # Add extra extensions where desired.
def guess_extension(filename):
"""
Guess the extension of given filename.
"""
root,ext = os.path.splitext(filename)
if any([filename.endswith(x) for x in DOUBLE_EXTENSIONS]):
root, first_ext = os.path.splitext(root)
ext = first_ext + ext
return root, ext
this is simple and works on both single and multiple extensions
In [1]: '/folder/folder/folder/filename.tar.gz'.split('/')[-1].split('.')[0]
Out[1]: 'filename'
In [2]: '/folder/folder/folder/filename.tar'.split('/')[-1].split('.')[0]
Out[2]: 'filename'
In [3]: 'filename.tar.gz'.split('/')[-1].split('.')[0]
Out[3]: 'filename'
Continuing from phihags answer to generic remove all double or triple extensions such as CropQDS275.jpg.aux.xml use while '.' in:
tempfilename, file_extension = os.path.splitext(filename)
while '.' in tempfilename:
tempfilename, tempfile_extension = os.path.splitext(tempfilename)
file_extension = tempfile_extension + file_extension

Categories

Resources