Check tar archive before extractall

Check tar archive before extractall - python

In the python documentation, it is adviced not to extract a tar archive without prior inspection. What is the best way to make sure an archive is safe using the tarfile python module? Should I just iterate over all the filename and check wether they contain absolute pathnames?
Would something like the following be sufficient?
import sys
import tarfile
with tarfile.open('sample.tar', 'r') as tarf:
for n in tarf.names():
if n[0] == '/' or n[0:2] == '..':
print 'sample.tar contains unsafe filenames'
sys.exit(1)
tarf.extractall()
Edit
This script is not compatible with versions prior to 2.7. cf with and tarfile.
I now iterate over the members:
target_dir = "/target/"
with closing(tarfile.open('sample.tar', mode='r:gz')) as tarf:
for m in tarf:
pathn = os.path.abspath(os.path.join(target_dir, m.name))
if not pathn.startswith(target_dir):
print 'The tar file contains unsafe filenames. Aborting.'
sys.exit(1)
tarf.extract(m, path=tdir)

Almost, although it would still be possible to have a path like foo/../../.
Better would be to use os.path.join and os.path.abspath, which together will correctly handle leading / and ..s anywhere in the path:
target_dir = "/target/" # trailing slash is important
with tarfile.open(…) as tarf:
for n in tarf.names:
if not os.path.abspath(os.path.join(target_dir, n)).startswith(target_dir):
print "unsafe filenames!"
sys.exit(1)
tarf.extractall(path=target_dir)

Related

How to insert strings and slashes in a path?

I'm trying to extract tar.gz files which are situated in diffent files named srm01, srm02 and srm03.
The file's name must be in input (a string) to run my code.
I'm trying to do something like this :
import tarfile
import glob
thirdBloc = 'srm01' #Then, that must be 'srm02', or 'srm03'
for f in glob.glob('C://Users//asediri//Downloads/srm/'+thirdBloc+'/'+'*.tar.gz'):
tar = tarfile.open(f)
tar.extractall('C://Users//asediri//Downloads/srm/'+thirdBloc)
I have this error message:
IOError: CRC check failed 0x182518 != 0x7a1780e1L
I want first to be sure that my code find the .tar.gz files. So I tried to just print my paths after glob:
thirdBloc = 'srm01' #Then, that must be 'srm02', or 'srm03'
for f in glob.glob('C://Users//asediri//Downloads/srm/'+thirdBloc+'/'+'*.tar.gz'):
print f
That gives :
C://Users//asediri//Downloads/srm/srm01\20160707000001-server.log.1.tar.gz
C://Users//asediri//Downloads/srm/srm01\20160707003501-server.log.1.tar.gz
The os.path.exists method tell me that my files doesn't exist.
print os.path.exists('C://Users//asediri//Downloads/srm/srm01\20160707000001-server.log.1.tar.gz')
That gives : False
Any way todo properly this work ? What's the best way to have first of all the right paths ?

In order to join paths you have to use os.path.join as follow:
import os
import tarfile
import glob
thirdBloc = 'srm01' #Then, that must be 'srm02', or 'srm03'
for f in glob.glob(os.path.join('C://Users//asediri//Downloads/srm/', thirdBloc, '*.tar.gz'):
tar = tarfile.open(f)
tar.extractall(os.path.join('C://Users//asediri//Downloads/srm/', thirdBloc))

os.path.join will create the correct paths for your filesystem
f = os.path.join('C://Users//asediri//Downloads/srm/', thirdBloc, '*.tar.gz')

C://Users//asediri//Downloads/srm/srm01\20160707000001-server.log.1.tar.gz
Never use \ with python for filepaths, \201 is \x81 character. It results to this:
C://Users//asediri//Downloads/srm/srm01ü60707000001-server.log.1.tar.gz
this is why os.path.exists does not find it
Or use (r"C:\...")
I would suggest you do this:
import os
os.chdir("C:/Users/asediri/Downloads/srm/srm01")
for f in glob.glob(str(thirdBloc) + ".tar.gz"):
print f

How do I extract a file with the python zipfile library while changing it's name

This is motivated by pathfile issues (unfortunately this doesn't seem to be true in my case).
I have a zipfile that I am trying to extract with python. The zipfile appears to have been created on windows. The code I have to extract the files from the zipfile is like this:
def unzip_file(zipfile_path):
z = zipfile.ZipFile(zipfile_path)
# get pathname without extension
directory = os.path.splitext(zipfile_path)[0]
print directory
if not os.path.exists(directory):
os.makedirs(directory)
#this line doesn't work. tries to extract "Foobar\\baz.quux" to directory and complains that the directory doesn't exist
# z.extractall(directory)
for name in z.namelist():
# actual dirname we want is this
# (dirname, filename) = os.path.split(name)
# I've tried to be cross-platform, (see above) but aparently zipfiles save filenames as
# Foobar\filename.log so I need this for cygwin
dir_and_filename = name.split('\\')
if len(dir_and_filename) >1:
dirname = dir_and_filename[0:-1]
filename = dir_and_filename[-1]
else:
dirname = ['']
filename = dir_and_filename[0]
out_dir = os.path.join(directory, *dirname)
print "Decompressing " + name + " on " + out_dir
if not os.path.exists(out_dir):
os.makedirs(out_dir)
z.extract(name, out_dir)
return directory
while this seems overly complicated this is to try and workaround some bugs I've found. One member of the zipfile is Foobar\\filename.log. on trying to extract that it complains that the directory doesn't exist. I need a way to use a method like so:
zipfile.extract_to(member_name, directory_name, file_name_to_write)
where member name is the name of the member to be read (in this example Foobar\\filename.log), directory_name is the name of the directory that we want to write to, and file_name_to_write is the name of the file that we want to write (in this case it would be filename.log). This does not seem to be supported. Does anyone have any other ideas on how to get a cross platform implementation of extracting this kind of zip archive that has nested expressions?
According to this answer the zipfile I have may not meet the zipfile specifications (it says that:
All slashes
MUST be forward slashes '/' as opposed to
backwards slashes '\' for compatibility with Amiga
and UNIX file systems etc.
in the zipfile specification 4.4.17)
How do I solve this problem?

I solved this by simply shelling out to unzip. We need to check for an exit code of 0 or 1 as an exit code of one is returned by the unzip command (due to the malformed zipfile, the message given is something like warning: zipfile appears to contain backslashes as path separators.
#!/bin/bash
unzip $1 -d $2
exit_code=$?
# we catch exit_codes < 2 as the zipfiles are malformed
if [ $exit_code -lt 2 ]
then exit 0
else exit $exit_code
fi

Extracting a tar file with folders starting with /

I am writing a program in python and using tarfile to extract tarfiles. Some of these tarfiles contain folders which start with a / or (Alternatively for windows \) which cause problems (files are extracted to wrong place). How can I get around this issue and make sure that the extraction ends up in correct place ?

The docs for tarfile explicitly warn about such a scenario. Instead you need to iterate over the content of the tar file and extract each file individually:
import os
import tarfile
extract_to = "."
tfile = tarfile.open('so.tar')
members = tfile.getmembers()
for m in members:
if m.name[0] == os.sep:
m.name = m.name[1:]
tfile.extract(m, path=extract_to)

Did you try extractall() method? As I remeber one of the this method arguments contains information where archive should be extracted.

How to move a file in Python?

How can I do the equivalent of mv in Python?
mv "path/to/current/file.foo" "path/to/new/destination/for/file.foo"

os.rename(), os.replace(), or shutil.move()
All employ the same syntax:
import os
import shutil
os.rename("path/to/current/file.foo", "path/to/new/destination/for/file.foo")
os.replace("path/to/current/file.foo", "path/to/new/destination/for/file.foo")
shutil.move("path/to/current/file.foo", "path/to/new/destination/for/file.foo")
The filename ("file.foo") must be included in both the source and destination arguments. If it differs between the two, the file will be renamed as well as moved.
The directory within which the new file is being created must already exist.
On Windows, a file with that name must not exist or an exception will be raised, but os.replace() will silently replace a file even in that occurrence.
shutil.move simply calls os.rename in most cases. However, if the destination is on a different disk than the source, it will instead copy and then delete the source file.

Although os.rename() and shutil.move() will both rename files, the command that is closest to the Unix mv command is shutil.move(). The difference is that os.rename() doesn't work if the source and destination are on different disks, while shutil.move() is files disk agnostic.

After Python 3.4, you can also use pathlib's class Path to move file.
from pathlib import Path
Path("path/to/current/file.foo").rename("path/to/new/destination/for/file.foo")
https://docs.python.org/3.4/library/pathlib.html#pathlib.Path.rename

For either the os.rename or shutil.move you will need to import the module.
No * character is necessary to get all the files moved.
We have a folder at /opt/awesome called source with one file named awesome.txt.
in /opt/awesome
○ → ls
source
○ → ls source
awesome.txt
python
>>> source = '/opt/awesome/source'
>>> destination = '/opt/awesome/destination'
>>> import os
>>> os.rename(source, destination)
>>> os.listdir('/opt/awesome')
['destination']
We used os.listdir to see that the folder name in fact changed.
Here's the shutil moving the destination back to source.
>>> import shutil
>>> source = '/opt/awesome/destination'
>>> destination = '/opt/awesome/source'
>>> shutil.move(source, destination)
>>> os.listdir('/opt/awesome/source')
['awesome.txt']
This time I checked inside the source folder to be sure the awesome.txt file I created exists. It is there
Now we have moved a folder and its files from a source to a destination and back again.

This is what I'm using at the moment:
import os, shutil
path = "/volume1/Users/Transfer/"
moveto = "/volume1/Users/Drive_Transfer/"
files = os.listdir(path)
files.sort()
for f in files:
src = path+f
dst = moveto+f
shutil.move(src,dst)
You can also turn this into a function, that accepts a source and destination directory, making the destination folder if it doesn't exist, and moves the files. Also allows for filtering of the src files, for example if you only want to move images, then you use the pattern '*.jpg', by default, it moves everything in the directory
import os, shutil, pathlib, fnmatch
def move_dir(src: str, dst: str, pattern: str = '*'):
if not os.path.isdir(dst):
pathlib.Path(dst).mkdir(parents=True, exist_ok=True)
for f in fnmatch.filter(os.listdir(src), pattern):
shutil.move(os.path.join(src, f), os.path.join(dst, f))

The accepted answer is not the right one, because the question is not about renaming a file into a file, but moving many files into a directory. shutil.move will do the work, but for this purpose os.rename is useless (as stated on comments) because destination must have an explicit file name.

Since you don't care about the return value, you can do
import os
os.system("mv src/* dest/")

Also possible with using subprocess.run() method.
python:
>>> import subprocess
>>> new = "/path/to/destination"
>>> old = "/path/to/new/destination"
>>> process = "mv ..{} ..{}".format(old,new)
>>> subprocess.run(process, shell=True) # do not remember, assign shell value to True.
This will work fine when working on Linux. Windows probably gives error since there is no mv Command.

Based on the answer described here, using subprocess is another option.
Something like this:
subprocess.call("mv %s %s" % (source_files, destination_folder), shell=True)
I am curious to know the pro's and con's of this method compared to shutil. Since in my case I am already using subprocess for other reasons and it seems to work I am inclined to stick with it.
This is dependent on the shell you are running your script in. The mv command is for most Linux shells (bash, sh, etc.), but would also work in a terminal like Git Bash on Windows. For other terminals you would have to change mv to an alternate command.

This is solution, which does not enables shell using mv.
from subprocess import Popen, PIPE, STDOUT
source = "path/to/current/file.foo",
destination = "path/to/new/destination/for/file.foo"
p = Popen(["mv", "-v", source, destination], stdout=PIPE, stderr=STDOUT)
output, _ = p.communicate()
output = output.strip().decode("utf-8")
if p.returncode:
print(f"E: {output}")
else:
print(output)

import os,shutil
current_path = "" ## source path
new_path = "" ## destination path
os.chdir(current_path)
for files in os.listdir():
os.rename(files, new_path+'{}'.format(f))
shutil.move(files, new_path+'{}'.format(f)) ## to move files from
different disk ex. C: --> D:

Working with relative paths

When I run the following script:
c:\Program Files\foo\bar\scripy.py
How can I refer to directory 'foo'?
Is there a convenient way of using relative paths?
I've done it before with the string module, but there must be a better way (I couldn't find it in os.path).

The os.path module includes various functions for working with paths like this. The convention in most operating system is to use .. to go "up one level", so to get the outside directory you could do this:
import os
import os.path
current_dir = os.getcwd() # find the current directory
print current_dir # c:\Program Files\foo\bar\scripy.py
parent = os.path.join(current_dir, "..") # construct a path to its parent
print parent # c:\Program Files\foo\bar\..
normal_parent = os.path.normpath(parent) # "normalize" the path
print normal_parent # c:\Program Files\foo
# or on one line:
print os.path.normpath(os.path.join(os.getcwd(), ".."))

os.path.dirname(path)
Will return the second half of a SPLIT that is performed on the path parameter. (head - the directory and tail, the file) Put simply it returns the directory the path is in. You'll need to do it twice but this is probably the best way.
Python Docs on path functions:
http://docs.python.org/library/os.path#os.path.expanduser

I have recently started using the unipath library instead of os.path. Its object-oriented representations of paths are much simpler:
from unipath import Path
original = Path(__file__) # .absolute() # r'c:\Program Files\foo\bar\scripy.py'
target = original.parent.parent
print target # Path(u'c:\\Program Files\\foo')
Path is a subclass of str so you can use it with standard filesystem functions, but it also provides alternatives for many of them:
print target.isdir() # True
numbers_dir = target.child('numbers')
print numbers_dir.exists() # False
numbers_dir.mkdir()
print numbers_dir.exists() # True
for n in range(10):
file_path = numbers_dir.child('%s.txt' % (n,))
file_path.write_file("Hello world %s!\n" % (n,), 'wt')

This is a bit tricky. For instance, the following code:
import sys
import os
z = sys.argv[0]
p = os.path.dirname(z)
f = os.path.abspath(p)
print "argv[0]={0} , dirname={1} , abspath={2}\n".format(z,p,f)
gives this output on Windows
argv[0]=../zzz.py , dirname=.. , abspath=C:\Users\michael\Downloads
First of all, notice that argv has the slash which I typed in the command python ../zzz.py and the absolute path has the normal Windows backslashes. If you need to be cross platform you should probably refrain from putting regular slashes on Python command lines, and use os.sep to refer to the character that separated pathname components.
So far I have only partly answered your question. There are a couple of ways to use the value of f to get what you want. Brute force is to use something like:
targetpath = f + os.sep + ".." + os.sep + ".."
which would result in something like C:\Users\michael\Downloads\..\.. on Windows and /home/michael/../.. on Unix. Each .. goes back one step and is the equivalent of removing the pathname component.
But you could do better by breaking up the path:
target = f.split(os.sep)
targetpath = os.sep.join(target[:-2]
and rejoining all but the last two bits to get C:\Users on Windows and / on Unix. If you do that it might be a good idea to check that there are enough pathname components to remove.
Note that I ran the program above by typing python ../xxx.py. In other words I was not in the same working directory as the script, therefore getcwd() would not be useful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check tar archive before extractall - python

Related

How to insert strings and slashes in a path?

How do I extract a file with the python zipfile library while changing it's name

Extracting a tar file with folders starting with /

How to move a file in Python?

Working with relative paths

Categories

Resources