Conditional extraction of files from an Archive file - python

I have a large tar.gz archive file having nxml files and total size is around 5gb.
My aim is to extract files from it but, I do not have to extract all of them. I have to extract all those files whose name is greater than a threshold value.
For example:
Let us consider 1000 is our threshold value. So
path/to/file/900.nxml will not be extracted but
path/to/file/1100.nxml will be extracted.
So my requirement is to make a conditional extraction of files from the archive.
Thanks

Use tar -tf <archive> to get a list of files in the archive.
Process the list of files to determine those you need to extract. Write the file list to a temporary file <filelist>, one line per file.
Looking at the tags you chose, you can use either Python or bash for this string filtering, whichever you prefer.
Use tar -xf <archive> -T <filelist> to extract the files you need.
The option -T or --files-from reads the filenames to process from the given file.
See also the manpage for tar.

You can also use --wildcards option of tar.
For example in the case when your threshold is 1000 you can use tar -xf tar.gz --wildcards path/to/files/????*.nxml. The ? will match one character and using * will match any number of character. This pattern will look for any file name with 4 or more characters.
Hope this helps.

Related

Combine two files chunked in this format XXXXX.csv.gz_1_2.tar & XXXXX.csv.gz_2_2.tar (with python or pyspark)

I have two files with this format XXXX.csv.gz_1_2.tar & XXXX.csv.gz_2_2.tar, my goal is to combine those files to be able to unzip the complete file in order to get the csv file.
Can you help me please ?
I tried to use tar or cat function from linux cmd with import os like:
import os
cat="cat C:/Users/AAAA/XXXX.csv.gz_1_2.tar C:/Users/AAAA/XXXX.csv.gz_2_2.tar > C:/Users/AAAA/XXXX.csv.gz.tar "
os.system(cat)
Thank you !
The code below is (almost) completely stolen from Add files from one tar into another tar in python, with the obvious adaptation of using two (or any number) of original tar files.
import tarfile
old_tars = ("….tar", "….tar.gz", "….tar.xz", …)
with tarfile.open("new.tar", "w") as new_tar:
for old_tar in (tarfile.open(tar_name, "r") for tar_name in old_tars):
for member in old_tar.getmembers():
new_tar.addfile(member, old_tar.extractfile(member.name))
old_tar.close()
(of course in a real world program the names of the tar files wouldn't be not hard-coded into the source).

Fastest way to read an image from huge uncompressed tar file in __getitem__ of PyTorch custom dataset

I have a huge dataset (2 million) of jpg images in one uncompressed TAR file. I also have a txt file each line is the name of the image in TAR file in order.
img_0000001.jpg
img_0000002.jpg
img_0000003.jpg
...
and images in tar file are exactly the same.
I searched alot and find out tarfile module is the best one, but when I tried to read images from tar file using name, it takes too long. And the reason is, everytime I call getmemeber(name) method, it calls getmembers() method which scan whole tar file, then return a Namespace of all names, then start finding in this Namespace.
if it helps, my dataset size is 20GB single tar file.
I don't know it is better idea to first extract all then use extracted folders in my CustomDataset or reading directly from archive.
Here is the code I am using to read a single file from tar file:
with tarfile.open('data.tar') as tf:
tarinfo = tf.getmember('img_000001.jpg')
image = tf.extractfile(tarinfo)
image = image.read()
image = Image.open(io.BytesIO(image))
I used this code in my __getitem__ method of CustomDataset class that loops over all names in filelist.txt
Thanks for any advice
tarfile seems to have caching for getmember, it reuses getmembers() results.
But if you use the provided snipped in __getitem__, then for each item from the dataset the tar file is open and read fully, one image file extracted, then the tar file closed and the associated info is lost.
The simplest way to resolve this is probably to open the tar file in your dataset's __init__ like self.tf = tarfile.open('data.tar'), but then you need to remember to close it in the end.

Zipping files with maximum size limit

I have a few directories that contain a varying number of files, and I want to create zip files for each directory containing all of the files in the directory. This is fine for most of them, but one directory has significantly more files and zipping the entire thing would result in a 20GB+ file. I'd rather limit the maximum size of the zip file and split it into, say, 5GB parts. Is there an easy way to do this with Python? I'm using the zipfile module right now, but I'm not seeing a way to tell it to automatically split into multiple zips at a certain filesize.
If you can use RAR insted of ZIP,
You can try this RAR-PACKAGE
from librar import archive
myRAR = archive.Archive("resultPath",base)
myRAR.add_file("FilePath")
myRAR.set_volume_size("5000000K") # split archive based on volume size 5 Gigabytes
update:
that rar-package is outdated and does not work with python 3, but we have a better solution now:
rar a -m5 -v10m myarchive movie.avi
It will compress movie.avi and split it into 10 MB chunks (-v10m), using the best compression ratio (-m5)
more info:
https://ubuntuincident.wordpress.com/2011/05/27/compress-with-rar-and-split-into-multiple-files/

How to walk a tar.gz file that contains zip files without extraction

I have a large tar.gz file to analyze using a python script. The tar.gz file contains a number of zip files which might embed other .gz files in it. Before extracting the file, I would like to walk through the directory structure within the compressed files to see if certain files or directories are present. By looking at tarfile and zipfile module I don't see any existing function that allow me to get a table of content of a zip file within a tar.gz file.
Appreciate your help,
You can't get at it without extracting the file. However, you don't need to extract it to disk if you don't want to. You can use the tarfile.TarFile.extractfile method to get a file-like object that you can then pass to tarfile.open as the fileobj argument. For example, given these nested tarfiles:
$ cat bar/baz.txt
This is bar/baz.txt.
$ tar cvfz bar.tgz bar
bar/
bar/baz.txt
$ tar cvfz baz.tgz bar.tgz
bar.tgz
You can access files from the inner one like so:
>>> import tarfile
>>> baz = tarfile.open('baz.tgz')
>>> bar = tarfile.open(fileobj=baz.extractfile('bar.tgz'))
>>> bar.extractfile('bar/baz.txt').read()
'This is bar/baz.txt.\n'
and they're only ever extracted to memory.
I suspect that this is not possible and that you'll have to program it manually.
.tar.gz files are first tar'd then gzipped with what is essentially two different applications, in succession. To access the tar file, you're probably going to have to un-gzip it, first.
Also, once you do have access to the tar file after ungzipping it, it does not do random-access well. There is no central repository in the tar file that lists the contents.

Regex to match the first file in a rar archive file set in Python

I need to uncompress all the files in a directory and for this I need to find the first file in the set. I'm currently doing this using a bunch of if statements and loops. Can i do this this using regex?
Here's a list of files that i need to match:
yes.rar
yes.part1.rar
yes.part01.rar
yes.part001.rar
yes.r01
yes.r001
These should NOT be matched:
no.part2.rar
no.part02.rar
no.part002.rar
no.part011.rar
no.r002
no.r02
I found a similar regex on this thread but it seems that Python doesn't support varible length lookarounds. A single line regex would be complicated but I'll document it well and it's not a problem. It's just one of those problems you beat your heap up, over.
Thanks in advance guys.
:)
Don't rely on the names of the files to determine which one is first. You're going to end up finding an edge case where you get the wrong file.
RAR's headers will tell you which file is the first on in the volume, assuming they were created in a somewhat-recent version of RAR.
HEAD_FLAGS Bit flags:
2 bytes
0x0100 - First volume (set only by RAR 3.0 and later)
So open up each file and examine the RAR headers, looking specifically for the flag that indicates which file is the first volume. This will never fail, as long as the archive isn't corrupt.
Update: I've just confirmed this by taking a look at some spanning archives in a hex editor. The files headers are constructed exactly as the link above indicates. It's just a matter of opening the files and reading the header for that flag. The file with that flag is the first volume.
There's no need to use look behind assertions for this. Since you start looking from the beginning of the string, you can do everything with look-aheads that you can with look-behinds. This should work:
^((?!\.part(?!0*1\.rar$)\d+\.rar$).)*\.(?:rar|r?0*1)$
To capture the first part of the filename as you requested, you could do this:
^((?:(?!\.part\d+\.rar$).)*)\.(?:(?:part0*1\.)?rar|r?0*1)$
Are you sure you want to match these cases?
yes.r01
They are not the first archives: .rar always is.
It's bla.rar, bla.r00 and then only bla.r01. You'll probably extract the files twice if you match .r01 and .rar as first archive.
yes.r001
.r001 doesn't exist. Do you mean the .001 files that WinRAR supports?
After .r99, it's .s00. If it does exist, then somebody manually renamed the files.
In theory, matching on filename should be as reliable as matching on the 0x0100 flag to find the first archive.

Categories

Resources