How to cut tar.gz extension from filename - python

I have a problem with deleting extension from my filename. I tried to use
os.path.splitext(checked_delivery)[0]
, but it delete only .gz from filename. I need to check if file has extension or it's a directory. I did it using this:
os.path.exists(delivery)
But another problem is, that I can't split it cause of data in it (YYYY.MM.DD). Should I use join() or it is something more attractive instead of tons of methods and ifs?

I propose the following small function:
def strip_extension(fn: str, extensions=[".tar.bz2", ".tar.gz"]):
for ext in extensions:
if fn.endswith(ext):
return fn[: -len(ext)]
raise ValueError(f"Unexpected extension for filename: {fn}")
assert strip_extension("foo.tar.gz") == "foo"

I propose a generic solution to remove the file extension from the string using the pathlib module. Using the os to manage the paths is not that convenient nowadays, IMO.
import pathlib
def remove_extention(path: pathlib.PosixPath) -> path.PosixPath:
suffixes = ''.join(path.suffixes)
return pathlib.Path(str(path).replace(suffixes, ''))

If you know that the extension is always going to be .tar.gz, you can still use split:
In [1]: fname = 'RANDOM_FILE-2017.06.07.tar.gz'
In [2]: '.'.join(fname.split('.')[:-2])
Out[2]: 'RANDOM_FILE-2017.06.07'
From the docstring for os.path.splitext:
"Extension is everything from the last dot to the end, ignoring leading dots. "
In the case of gzipped tarballs, this makes sense anyway, as the file 'FILE.tar.gz' is a gzipped version of the 'FILE.tar', which is presumably a tarball made from file 'FILE'
This is why you would need to use something other than os.path.splitext for this, if what you need is the original filename, without .tar

Related

normalize non-existing path using pathlib only

python has recently added the pathlib module (which i like a lot!).
there is just one thing i'm struggling with: is it possible to normalize a path to a file or directory that does not exist? i can do that perfectly well with os.path.normpath. but wouldn't it be absurd to have to use something other than the library that should take care of path related stuff?
the functionality i would like to have is this:
from os.path import normpath
from pathlib import Path
pth = Path('/tmp/some_directory/../i_do_not_exist.txt')
pth = Path(normpath(str(pth)))
# -> /tmp/i_do_not_exist.txt
but without having to resort to os.path and without having to type-cast to str and back to Path. also pth.resolve() does not work for non-existing files.
is there a simple way to do that with just pathlib?
is it possible to normalize a path to a file or directory that does not exist?
Starting from 3.6, it's the default behavior. See https://docs.python.org/3.6/library/pathlib.html#pathlib.Path.resolve
Path.resolve(strict=False)
...
If strict is False, the path is resolved as far as possible and any remainder is appended without checking whether it exists
As of Python 3.5: No, there's not.
PEP 0428 states:
Path resolution
The resolve() method makes a path absolute, resolving
any symlink on the way (like the POSIX realpath() call). It is the
only operation which will remove " .. " path components. On Windows,
this method will also take care to return the canonical path (with the
right casing).
Since resolve() is the only operation to remove the ".." components, and it fails when the file doesn't exist, there won't be a simple means using just pathlib.
Also, the pathlib documentation gives a hint as to why:
Spurious slashes and single dots are collapsed, but double dots ('..')
are not, since this would change the meaning of a path in the face of
symbolic links:
PurePath('foo//bar') produces PurePosixPath('foo/bar')
PurePath('foo/./bar') produces PurePosixPath('foo/bar')
PurePath('foo/../bar') produces PurePosixPath('foo/../bar')
(a naïve approach would make PurePosixPath('foo/../bar') equivalent to PurePosixPath('bar'), which is wrong if foo is a symbolic link to another directory)
All that said, you could create a 0 byte file at the location of your path, and then it'd be possible to resolve the path (thus eliminating the ..). I'm not sure that's any simpler than your normpath approach, though.
If this fits you usecase(e.g. ifle's directory already exists) you might try to resolve path's parent and then re-append file name, e.g.:
from pathlib import Path
p = Path()/'hello.there'
print(p.parent.resolve()/p.name)
Old question, but here is another solution in particular if you want POSIX paths across the board (like nix paths on Windows too).
I found pathlib resolve() to be broken as of Python 3.10, and this method is not currently exposed by PurePosixPath.
What I found worked was to use posixpath.normpath(). Also found PurePosixPath.joinpath() to be broken. I.E. It will not join ".." with "myfile.txt" as expected. It will return just "myfile.txt". But posixpath.join() works perfectly; will return "../myfile.txt".
Note this is in path strings, but easily back to pathlib.Path(my_posix_path) et al for an OOP container.
And easily transpose to Windows platform paths too by just constructing this way, as the module takes care of the platform independence for you.
Might be the solution for others with Python file path woes..

Check if GZIP file exists in Python

I would like to check for the existence of a .gz file on my linux machine while running Python. If I do this for a text file, the following code works:
import os.path
os.path.isfile('bob.asc')
However, if bob.asc is in gzip format (bob.asc.gz), Python does not recognize it. Ideally, I would like to use os.path.isfile or something very concise (without writing new functions). Is it possible to make the file recognizable either in Python or by changing something in my system configuration?
Unfortunately I can't change the data format or the file names as they are being given to me in batch by a corporation.
After fooling around for a bit, the most concise way I could get the job done was
subprocess.call(['ls','bob.asc.gz']) == 0
which returns True if the file exists in the directory. This is the behavior I would expect from
os.path.isfile('bob.asc.gz')
but for some reason Python won't accept files with extension .gz as files when passed to os.path.isfile.
I don't feel like my solution is very elegant, but it is concise. If someone has a more elegant solution, I'd love to see it. Thanks.
Of course it doesn't; they are completely different files. You need to test it separately:
os.path.isfile('bob.asc.gz')
This would return True if that exact file was present in the current working directory.
Although a workaround could be:
from os import listdir, getcwd
from os.path import splitext, basename
any(splitext(basename(f))[0] == 'bob.asc' for f in listdir(getcwd()))
You need to test each file. For example :
if any(map(os.path.isfile, ['bob.asc', 'bob.asc.gz'])):
print 'yay'

Getting file extension using pattern matching in python

I am trying to find the extension of a file, given its name as a string. I know I can use the function os.path.splitext but it does not work as expected in case my file extension is .tar.gz or .tar.bz2 as it gives the extensions as gz and bz2 instead of tar.gz and tar.bz2 respectively.
So I decided to find the extension of files myself using pattern matching.
print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz')group('ext')
>>> gz # I want this to come as 'tar.gz'
print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.bz2')group('ext')
>>> bz2 # I want this to come 'tar.bz2'
I am using (?P<ext>...) in my pattern matching as I also want to get the extension.
Please help.
root,ext = os.path.splitext('a.tar.gz')
if ext in ['.gz', '.bz2']:
ext = os.path.splitext(root)[1] + ext
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
>>> print re.compile(r'^.*[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
gz
>>> print re.compile(r'^.*?[.](?P<ext>tar\.gz|tar\.bz2|\w+)$').match('a.tar.gz').group('ext')
tar.gz
>>>
The ? operator tries to find the minimal match, so instead of .* eating ".tar" as well, .*? finds the minimal match that allows .tar.gz to be matched.
I have idea which is much easier than breaking your head with regex,sometime it might sound stupid too.
name="filename.tar.gz"
extensions=('.tar.gz','.py')
[x for x in extensions if name.endswith(x)]
Starting from phihags answer:
DOUBLE_EXTENSIONS = ['tar.gz','tar.bz2'] # Add extra extensions where desired.
def guess_extension(filename):
"""
Guess the extension of given filename.
"""
root,ext = os.path.splitext(filename)
if any([filename.endswith(x) for x in DOUBLE_EXTENSIONS]):
root, first_ext = os.path.splitext(root)
ext = first_ext + ext
return root, ext
this is simple and works on both single and multiple extensions
In [1]: '/folder/folder/folder/filename.tar.gz'.split('/')[-1].split('.')[0]
Out[1]: 'filename'
In [2]: '/folder/folder/folder/filename.tar'.split('/')[-1].split('.')[0]
Out[2]: 'filename'
In [3]: 'filename.tar.gz'.split('/')[-1].split('.')[0]
Out[3]: 'filename'
Continuing from phihags answer to generic remove all double or triple extensions such as CropQDS275.jpg.aux.xml use while '.' in:
tempfilename, file_extension = os.path.splitext(filename)
while '.' in tempfilename:
tempfilename, tempfile_extension = os.path.splitext(tempfilename)
file_extension = tempfile_extension + file_extension

Problem reading text files without extensions in python

I have written a piece of a code which is supposed to read the texts inside several files which are located in a directory. These files are basically text files but they do not have any extensions.But my code is not able to read them:
corpus_path = 'Reviews/'
for infile in glob.glob(os.path.join(corpus_path,'*.*')):
review_file = open(infile,'r').read()
print review_file
To test if this code works, I put a dummy text file, dummy.txt. which worked because it has extension. But i don't know what should be done so files without the extensions could be read.
can someone help me? Thanks
Glob patterns don't work the same way as wildcards on the Windows platform. Just use * instead of *.*. i.e. os.path.join(corpus_path,'*'). Note that * will match every file in the directory - if that's not what you want then you can revise the pattern accordingly.
See the glob module documentation for more details.
Just use * instead of *.*.
The latter requires an extension to be present (more precisely, there needs to be a dot in the filename), the former doesn't.
You could search for * instead of *.*, but this will match every file in your directory.
Fundamentally, this means that you will have to handle cases where the file you are opening is not a text file.
it seems that you need
from os import listdir
from filename in ( fn for fn in listdir(corpus_path) if '.' not in fn):
# do something
you could write
from os import listdir
for fn in listdir(corpus_path):
if '.' not in fn:
# do something
but the former with a generator spares one indentation level

How can I find path to given file?

I have a file, for example "something.exe" and I want to find path to this file
How can I do this in python?
Perhaps os.path.abspath() would do it:
import os
print os.path.abspath("something.exe")
If your something.exe is not in the current directory, you can pass any relative path and abspath() will resolve it.
use os.path.abspath to get a normalized absolutized version of the pathname
use os.walk to get it's location
import os
exe = 'something.exe'
#if the exe just in current dir
print os.path.abspath(exe)
# output
# D:\python\note\something.exe
#if we need find it first
for root, dirs, files in os.walk(r'D:\python'):
for name in files:
if name == exe:
print os.path.abspath(os.path.join(root, name))
# output
# D:\python\note\something.exe
if you absolutely do not know where it is, the only way is to find it starting from root c:\
import os
for r,d,f in os.walk("c:\\"):
for files in f:
if files == "something.exe":
print os.path.join(r,files)
else, if you know that there are only few places you store you exe, like your system32, then start finding it from there. you can also make use of os.environ["PATH"] if you always put your .exe in one of those directories in your PATH variable.
for p in os.environ["PATH"].split(";"):
for r,d,f in os.walk(p):
for files in f:
if files == "something.exe":
print os.path.join(r,files)
Just to mention, another option to achieve this task could be the subprocess module, to help us execute a command in terminal, like this:
import subprocess
command = "find"
directory = "/Possible/path/"
flag = "-iname"
file = "something.foo"
args = [command, directory, flag, file]
process = subprocess.run(args, stdout=subprocess.PIPE)
path = process.stdout.decode().strip("\n")
print(path)
With this we emulate passing the following command to the Terminal:
find /Posible/path -iname "something.foo".
After that, given that the attribute stdout is binary string, we need to decode it, and remove the trailing "\n" character.
I tested it with the %timeit magic in spyder, and the performance is 0.3 seconds slower than the os.walk() option.
I noted that you are in Windows, so you may search for a command that behaves similar to find in Unix.
Finally, if you have several files with the same name in different directories, the resulting string will contain all of them. In consequence, you need to deal with that appropriately, maybe using regular expressions.
This is really old thread, but might be useful to someone who stumbles across this. In python 3, there is a module called "glob" which takes "egrep" style search strings and returns system appropriate pathing (i.e. Unix\Linux and Windows).
https://docs.python.org/3/library/glob.html
Example usage would be:
results = glob.glob('./**/FILE_NAME')
Then you get a list of matches in the result variable.
Uh... This question is a bit unclear.
What do you mean "have"? Do you have the name of the file? Have you opened it? Is it a file object? Is it a file descriptor? What???
If it's a name, what do you mean with "find"? Do you want to search for the file in a bunch of directories? Or do you know which directory it's in?
If it is a file object, then you must have opened it, reasonably, and then you know the path already, although you can get the filename from fileob.name too.

Categories

Resources