Getting file extension using pattern matching in python

Getting file extension using pattern matching in python - python

I am trying to find the extension of a file, given its name as a string. I know I can use the function os.path.splitext but it does not work as expected in case my file extension is .tar.gz or .tar.bz2 as it gives the extensions as gz and bz2 instead of tar.gz and tar.bz2 respectively.
So I decided to find the extension of files myself using pattern matching.
print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz')group('ext')
>>> gz # I want this to come as 'tar.gz'
print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.bz2')group('ext')
>>> bz2 # I want this to come 'tar.bz2'
I am using (?P<ext>...) in my pattern matching as I also want to get the extension.
Please help.

root,ext = os.path.splitext('a.tar.gz')
if ext in ['.gz', '.bz2']:
ext = os.path.splitext(root)[1] + ext
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

>>> print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
gz
>>> print re.compile(r'^.*?[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
tar.gz
>>>
The ? operator tries to find the minimal match, so instead of .* eating ".tar" as well, .*? finds the minimal match that allows .tar.gz to be matched.

I have idea which is much easier than breaking your head with regex,sometime it might sound stupid too.
name="filename.tar.gz"
extensions=('.tar.gz','.py')
[x for x in extensions if name.endswith(x)]

Starting from phihags answer:
DOUBLE_EXTENSIONS = ['tar.gz','tar.bz2'] # Add extra extensions where desired.
def guess_extension(filename):
"""
Guess the extension of given filename.
"""
root,ext = os.path.splitext(filename)
if any([filename.endswith(x) for x in DOUBLE_EXTENSIONS]):
root, first_ext = os.path.splitext(root)
ext = first_ext + ext
return root, ext

this is simple and works on both single and multiple extensions
In [1]: '/folder/folder/folder/filename.tar.gz'.split('/')[-1].split('.')[0]
Out[1]: 'filename'
In [2]: '/folder/folder/folder/filename.tar'.split('/')[-1].split('.')[0]
Out[2]: 'filename'
In [3]: 'filename.tar.gz'.split('/')[-1].split('.')[0]
Out[3]: 'filename'

Continuing from phihags answer to generic remove all double or triple extensions such as CropQDS275.jpg.aux.xml use while '.' in:
tempfilename, file_extension = os.path.splitext(filename)
while '.' in tempfilename:
tempfilename, tempfile_extension = os.path.splitext(tempfilename)
file_extension = tempfile_extension + file_extension

Related

Add character in a link

I want to add an character to a link.
The link is C:\Users\user\Documents\test.csv I want to add C:\Users\user\Documents\test_new.csv.
So you can see I added the _new to the filename.
Should I extract the name with Path(path).name) and then with Regex? What is the best option for do that?

As you said you want to "add" _new and not rename here is your solution and it is tiny just 2 lines of code apart from the varaible and the result, this is solution might be complex because i have compressed the code to take less memory and do the work fast, you could also change the keyword and the extension from the OUTPUT FUNCTION arguments
PATH = "C:\\User\\Folder\\file.csv"
def new_name(path, ext="csv", keyword="_new"):
print('\\'.join(path.split("\\")[:-1])+"\\"+path.split("\\")[-1].split(".")[0] + keyword + "." + ext)
new_name(PATH)

Here's a solution using the os module:
path = r"C:\User\Folder\file.csv"
root, ext = os.path.splitext(path)
new_path = f'{root}_new{ext}'
And here's one using pathlib:
path = pathlib.Path(r"C:\User\Folder\file.csv")
new_path = str(path.with_stem(path.stem + '_new'))

renaming the filename with regex in python using re

I have a folder which contains multiple files with a below filename as one example and I have multiple different such
_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001_ICGCDBDE20130916001.rsem.bam
Now I want to rename then using only by ICGCDBDE20130916001.rsem.bam will change according to the file in the path. The string corresponding to the name *.rsem.bam should be the one separated by "_". So for all the files in the directory should be replaced accordingly by this. I am thinking to use the regular expression so I came up with the below pattern
pat=r'_(.*)_(.*)_(.*)_(.*)_(.\w+)'
This separates out my filename as desired and I can rename the filenames with by using a global variable where I take only pat[4]. I wanted to use python since I want to learn it as of now to make small changes as file renaming and so on and later with time convert my workflows in python. I am unable to do it. How should I make this work in python? Also am in a fix what should have been the corresponding bash regex since this one is a pretty big filename and my encounter with such is very new. Below was my code not to change directly but to understand if it works but how should I get it work if I want to rename them.
import re
import os
_src = "path/bam/test/"
_ext = ".rsem.bam"
endsWithNumber = re.compile(r'_(.*)_(.*)_(.*)_(.*)_(.\w+)'+(re.escape(_ext))+'$')
print(endsWithNumber)
for filename in os.listdir(_src):
m = endsWithNumber.search(filename)
print(m)
I would appreciate both in python and bash, however, I would prefer python for my own understanding and future learning.

You can use rpartition which will separate out the part you want from the rest in to a three part tuple.
Given:
>>> fn
'_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001_ICGCDBDE20130916001.rsem.bam'
You can do:
>>> fn.rpartition('_')
('_EGAZ00001018697_2014_ICGC_130906_D81P8DQ1_0153_C2704ACXX.nopd.AOCS_001', '_', 'ICGCDBDE20130916001.rsem.bam')
Then:
>>> _,sep,new_name=fn.rpartition('_')
>>> new_name
'ICGCDBDE20130916001.rsem.bam'
If you want to use a regex:
>>> re.search(r'_([^_]+$)', fn).group(1)
'ICGCDBDE20130916001.rsem.bam'
As a practical matter, you would test to see if there was a match before using group(1):
>>> m=re.search(r'_([^_]+$)', fn)
>>> new_name = m.group(1) if m else fn
For sed you can do:
$ echo "$fn" | sed -E 's/.*_([^_]*)$/\1/'
ICGCDBDE20130916001.rsem.bam
Or in Bash, same regex:
$ [[ $fn =~ _([^_]*)$ ]] && echo "${BASH_REMATCH[1]}"
ICGCDBDE20130916001.rsem.bam

You can use list comprehension
import re
import os
_src = "path/bam/test/"
new_s = [re.search("[a-zA-Z0-9]+\.rsem\.bam", filename) for filename in os.listdir(_src)]
for first, second in zip(os.listdir(_src), new_s):
if second is not None:
os.rename(first, second.group(0))

Too much work.
newname = oldname.rsplit('_', 1)[1]

import os
fname = 'YOUR_FILENAME.avi'
fname1 = fname.split('.')
fname2 = str(fname1[0]) + '.mp4'
os.rename('path to your source file' + str(fname), 'path to your destination file' + str(fname2))
fname = fname2

How to cut tar.gz extension from filename

I have a problem with deleting extension from my filename. I tried to use
os.path.splitext(checked_delivery)[0]
, but it delete only .gz from filename. I need to check if file has extension or it's a directory. I did it using this:
os.path.exists(delivery)
But another problem is, that I can't split it cause of data in it (YYYY.MM.DD). Should I use join() or it is something more attractive instead of tons of methods and ifs?

I propose the following small function:
def strip_extension(fn: str, extensions=[".tar.bz2", ".tar.gz"]):
for ext in extensions:
if fn.endswith(ext):
return fn[: -len(ext)]
raise ValueError(f"Unexpected extension for filename: {fn}")
assert strip_extension("foo.tar.gz") == "foo"

I propose a generic solution to remove the file extension from the string using the pathlib module. Using the os to manage the paths is not that convenient nowadays, IMO.
import pathlib
def remove_extention(path: pathlib.PosixPath) -> path.PosixPath:
suffixes = ''.join(path.suffixes)
return pathlib.Path(str(path).replace(suffixes, ''))

If you know that the extension is always going to be .tar.gz, you can still use split:
In [1]: fname = 'RANDOM_FILE-2017.06.07.tar.gz'
In [2]: '.'.join(fname.split('.')[:-2])
Out[2]: 'RANDOM_FILE-2017.06.07'
From the docstring for os.path.splitext:
"Extension is everything from the last dot to the end, ignoring leading dots. "
In the case of gzipped tarballs, this makes sense anyway, as the file 'FILE.tar.gz' is a gzipped version of the 'FILE.tar', which is presumably a tarball made from file 'FILE'
This is why you would need to use something other than os.path.splitext for this, if what you need is the original filename, without .tar

remove part of path

I have the following data:
/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA
However, I dont want to have "Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/" in it, I want the path to be like:
/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA
With regular expressions I basically want to math anything that holds *.torrent./, how can I accomplish this in regexp?
Thanks!

You don't even need regular expressions. You can use os.path.dirname and os.path.basename:
os.path.join(os.path.dirname(os.path.dirname(path)),
os.path.basename(path))
where path is the original path to the file.
Alternatively, you can also use os.path.split as follows:
dirname, filename = os.path.split(path)
os.path.join(os.path.dirname(dirname), filename)
Note This will work under the assumption that what you want to remove is the directory name that contains the file from the path as in the example in the question.

You can do this without using regexp:
>>> x = unicode('/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA')
>>> x.rfind('.torrent')
66
>>> x[:x.rfind('.torrent')]
u'/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA'

I basically want to math anything that holds *.torrent./, how can I accomplish this in regexp?
You can use:
[^/]*\.torrent/
Assuming the last . was a typo.

Given path='/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA'
You can do it with regular expression as
re.sub("/[^/]*\.torrent/","",path)
You can also do it without regex as
'/'.join(x for x in path.split("/") if x.find("torrent") == -1)

Your question is a bit vague and unclear, but here's one way how to strip off what you want:
import re
s = "/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA.torrent/Big.Buck.Bunny.720p.Bluray.x264-BLA"
c = re.compile("(/.*/).*?torrent/(.*)")
m = re.match(c, s)
path = m.group(1)
file = m.group(2)
print path + file
>>> ## working on region in file /usr/tmp/python-215357Ay...
/share/Downloads/Videos/Movies/Big.Buck.Bunny.720p.Bluray.x264-BLA

Python: getting filename case as stored in Windows?

Though Windows is case insensitive, it does preserve case in filenames. In Python, is there any way to get a filename with case as it is stored on the file system?
E.g., in a Python program I have filename = "texas.txt", but want to know that it's actually stored "TEXAS.txt" on the file system, even if this is inconsequential for various file operations.

Here's the simplest way to do it:
>>> import win32api
>>> win32api.GetLongPathName(win32api.GetShortPathName('texas.txt')))
'TEXAS.txt'

I had problems with special characters with the win32api solution above. For unicode filenames you need to use:
win32api.GetLongPathNameW(win32api.GetShortPathName(path))

This one is standard library only and converts all path parts (except drive letter):
def casedpath(path):
r = glob.glob(re.sub(r'([^:/\\])(?=[/\\]|$)|\[', r'[\g<0>]', path))
return r and r[0] or path
And this one handles UNC paths in addition:
def casedpath_unc(path):
unc, p = os.path.splitunc(path)
r = glob.glob(unc + re.sub(r'([^:/\\])(?=[/\\]|$)|\[', r'[\g<0>]', p))
return r and r[0] or path
Note: It is somewhat slower than the file system dependent Win API "GetShortPathName" method, but works platform & file system independent and also when short filename generation is switched off on Windows volumes (fsutil.exe 8dot3name query C:). The latter is recommended at least for performance critical file systems when no 16bit apps rely anymore on that:
fsutil.exe behavior set disable8dot3 1

>>> import os
>>> os.listdir("./")
['FiLeNaMe.txt']
Does this answer your question?

and if you want to recurse directories
import os
path=os.path.join("c:\\","path")
for r,d,f in os.walk(path):
for file in f:
if file.lower() == "texas.txt":
print "Found: ",os.path.join( r , file )

You could use:
import os
a = os.listdir('mydirpath')
b = [f.lower() for f in a]
try:
i = b.index('texas.txt')
print a[i]
except ValueError:
print('File not found in this directory')
This of course assumes that your search string 'texas.txt' is in lowercase. If it isn't you'll have to convert it to lowercase first.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting file extension using pattern matching in python - python

root,ext = os.path.splitext('a.tar.gz') if ext in ['.gz', '.bz2']: ext = os.path.splitext(root)[1] + ext Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

I have idea which is much easier than breaking your head with regex,sometime it might sound stupid too. name="filename.tar.gz" extensions=('.tar.gz','.py') [x for x in extensions if name.endswith(x)]

Related

Add character in a link

renaming the filename with regex in python using re

How to cut tar.gz extension from filename

remove part of path

Python: getting filename case as stored in Windows?

Categories

Resources