Extracting files from stream-mode tarfile

Extracting files from stream-mode tarfile - python

I`ve got an stream which contains .tar file conten, so I work with it using tarfile.open('r|')
What I need to do - is to look into list of files inside it and read some of them, then upload whole tar into another place.
When I try to tarfile.extractfile() after tarfile.getnames() it raises an tarfile.StreamError. But I cannot extract file which name I dont know.
How can I get list of files without crushing tarfile? I cannot save the whole tar into RAM\disk, because some files inside it can be larger than 10GB.
>>> tf = tarfile.open(fileobj=open('Downloads/clean-alpine.ova', 'rb'), mode='r|')
>>> tfn = tf.getnames()
>>> tfn
['clean-alpine.ovf', 'clean-alpine.mf', 'clean-alpine-disk1.vmdk']
>>> tf.fileobj
<tarfile._Stream object at 0x7ff878dac7b8>
>>> tf.fileobj.pos
33595392
>>> ovf = tf.extractfile('clean-alpine.ovf')
>>> ovf
<ExFileObject name=''>
>>> d = ovf.read().decode()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3.6/tarfile.py", line 696, in read
self.fileobj.seek(offset + (self.position - start))
File "/usr/lib/python3.6/tarfile.py", line 522, in seek
raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed

Looking at the source of TarFile.extractall() the important bit is to use TarFile as an iterable, like I did in my use case:
for member in tf:
if not member.isfile():
continue
dest = Path.cwd() / member.name # This is vulnerable to, like, 5 things
with tf.extractfile(member) as tfobj:
dest.write_bytes(tfobj.read())

Related

Iterate over pathlib paths and python-docx: zipfile.BadZipFile

My python skills are a bit rusty since I recently primarily used Rstats. However I ran into the following problem, my goal is that I want to recursively iterate over all .docx files in a directory and change some of the core attributes with the python-docx package.
For the loop, I first created a list with pathlib and glob
from docx import Document
from docx.shared import Inches
import pathlib
# Reading the stats dir
root_dir = pathlib.Path(r"C:\some\Björn\PycharmProjects\mre_docx")
# Get all word files in the stats directory
files = [x for x in root_dir.glob("**/*.docx") if x.is_file()]
files
Output of files looks fine.
[WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test1.docx'),
WindowsPath('C:/Users/Björn/PycharmProjects/mre_docx/test2.docx')]
When I now want to read in a document with the list I get a zip error (see full traceback below)
document = Document(files[1])
Traceback (most recent call last):
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\IPython\core\interactiveshell.py", line 3441, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-26-482c5438fa33>", line 1, in <module>
document = Document(files[1])
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\package.py", line 128, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\pkgreader.py", line 32, in from_file
phys_reader = PhysPkgReader(pkg_file)
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\site-packages\docx\opc\phys_pkg.py", line 101, in __init__
self._zipf = ZipFile(pkg_file, 'r')
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1257, in __init__
self._RealGetContents()
File "C:\Users\Björn\AppData\Local\Programs\Python\Python39\lib\zipfile.py", line 1324, in _RealGetContents
raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file
However just running the same line of code, without the list works fine (except for differences in the path separator / and r"\", which I thought should not matter due to the fact that the lists contains pathlib.Path objects).
document = Document(pathlib.Path(r"C:\Users\Björn\PycharmProjects\mre_docx\test1.docx"))
Edit to Comment
I created a total of 4 new word files for this mre. Now I entered text in two of them and two are empty. And to my surprise I found out that the empty ones result in the error.
for file in files:
try:
document = Document(file)
except:
print(f"The file: {file} appears to be corrupted")
Output:
The file: C:\Users\Björn\PycharmProjects\mre_docx\new_file.docx appears to be corrupted
The file: C:\Users\Björn\PycharmProjects\mre_docx\test2.docx appears to be corrupted
Semi Solution to Future Readers
Add a try and except block around the call to Document("Path/to/file.docx"), and print out the respective file for which the function failed. In my case it where just a few, which I could easily edit manually.

You are not doing wrong, since documents are empty you are getting this error. If you open those files type something, you will not get any error. But
According to https://python-docx.readthedocs.io/en/latest/user/documents.html
You can open word documents with different codes.
First:
document = Document()
document.save(files[1])
Second:
document = Document(files[1])
document.save(files[1])
Also According to docs you can open them like files:
with open(files[1], 'rb') as f:
document = Document(f)

Python Configparser. Whitespace causes AttributeError

I recieve some files with .ini file with them. I have to recieve file names from [FILES] section.
Sometimes there is an extra witespace in another section of .ini-file which raises exception in ConfigParser module
The example of "bad" ini-file:
[LETTER]
SUBJECT=some text
some text
and text with whitespace in the beggining
[FILES]
0=file1.txt
1=file2.doc
My code(Python 3.7):
import configparser
def get_files_from_ini_file(info_file):
ini = configparser.ConfigParser(allow_no_value=True)
ini.read(info_file) # ERROR is here
if ini.has_section("FILES"):
pocket_files = [ini.get("FILES", i) for i in ini.options("FILES")]
return pocket_files
print(get_files_from_ini_file("D:\\bad.ini"))
Traceback (most recent call last):
File "D:/test.py", line 10, in <module>
print(get_files_from_ini_file("D:\\bad.ini"))
File "D:/test.py", line 5, in get_files_from_ini_file
ini.read(info_file) # ERROR
File "C:\Users\ap\AppData\Local\Programs\Python\Python37-32\lib\configparser.py", line 696, in read
self._read(fp, filename)
File "C:\Users\ap\AppData\Local\Programs\Python\Python37-32\lib\configparser.py", line 1054, in _read
cursect[optname].append(value)
AttributeError: 'NoneType' object has no attribute 'append'
I can't influence on files I recieve so that is there any way to ignore this error? In fact I need only [FILES] section to parse.
Have tried empty_lines_in_values=False with no result
May be that's invalid ini file and I should write my own parser?

If you only need the "FILES" part, a simple way is to:
open the file and read into a string
get the part after "[FILES]" using .split() method
add "[FILES]" before the string
use the configparser read_string method on the string
This is a hacky solution but it should work:
import configparser
def get_files_from_ini_file(info_file):
with open(info_file, 'r') as file:
ini_string = file.read()
useful_part = "[FILES]" + ini_string.split("[FILES]")[-1]
ini = configparser.ConfigParser(allow_no_value=True)
ini.read_string(useful_part) # ERROR is here
if ini.has_section("FILES"):
pocket_files = [ini.get("FILES", i) for i in ini.options("FILES")]
return pocket_files
print(get_files_from_ini_file("D:\\bad.ini"))

How to iterate through and delete certain files from Python fcache?

In my PyQt5 app, I've been using fache (https://pypi.org/project/fcache/) to cache lots of small files to the user's temp folder for speed. It's working well for caching, but now I need to be able to iterate through the cached files and selectively delete files that are no longer needed.
However when I try to iterate through the FileCache object, I'm getting an error.
thisCache is the name of my cache, and if I print(thisCache) I get:
which is fine.
Then if I do print(thisCache.keys()) I get KeysView(<fcache.cache.FileCache object at 0x000001F7BA0F2848>), which seems correct (I think?). Similarly, printing .values() gives me a ValuesView.
Then if I do print(len(thisCache.keys()) I get: 1903, showing that there are 1903 files in there, which is probably correct. But here's where I get stuck.
If I try to iterate through the KeysView in any way, I get an error. Each of the following attempts:
for f in thisCache.values():
for f in thisCache.keys():
always throws an error:
Process finished with exit code -1073740791 (0xC0000409)
I'm fairly new to Python, so am I just misunderstanding how I'm supposed to iterate through this list? Or is there a bug or gotcha here that I need to work around?
Thanks
::::::::: EDIT ::::::::
After a bit of a delay, here's a reproducile (but not especially minimal or quality) bit of example code.
import random
import string
from fcache.cache import FileCache
from shutil import copyfile
def random_string(stringLength=10):
letters = string.ascii_lowercase
return ''.join(random.choice(letters) for i in range(stringLength))
cacheName = "TestCache"
cache = FileCache(cacheName)
sourceFile = "C:\\TestFile.mov"
targetCount = 50
# copy the file 50 times:
for w in range(1, targetCount+1):
fileName = random_string(50) + ".mov"
targetPath = cache.cache_dir + "\\" + fileName
print("Copying file ", w)
copyfile(sourceFile, targetPath)
cache[str(w)] = targetPath
print("Cached", targetCount, "items.")
print("Syncing cache...")
cache.sync()
# iterate through the cache:
print("Item keys:", cache.keys())
for key in cache.keys():
v = cache[key]
print(key, v)
print("Cache read.")
There is one dependency, which is having a file called "C:\TestFile.mov" on your system, but the path isn't important so this can be pointed to any file. I've tested with other file formats, with the same result.
The error that is thrown is:
Traceback (most recent call last):
File "C:\Users\stuart.bruce\AppData\Local\Programs\Python\Python37\lib\encodings\hex_codec.py", line 19, in hex_decode
return (binascii.a2b_hex(input), len(input))
binascii.Error: Non-hexadecimal digit found
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File
"C:\Users\stuart.bruce\AppData\Local\Programs\Python\Python37\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\stuart.bruce\AppData\Local\Programs\Python\Python37\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\stuart.bruce\PycharmProjects\testproject\test_code.py", line 32, in <module>
for key in cache.keys():
File "C:\Users\stuart.bruce\AppData\Local\Programs\Python\Python37\lib\_collections_abc.py", line 720, in __iter__
yield from self._mapping
File "C:\Users\stuart.bruce\AppData\Local\Programs\Python\Python37\lib\site-packages\fcache\cache.py", line 297, in __iter__
yield self._decode_key(key)
File "C:\Users\stuart.bruce\AppData\Local\Programs\Python\Python37\lib\site-packages\fcache\cache.py", line 211, in _decode_key
bkey = codecs.decode(key.encode(self._keyencoding), 'hex_codec')
binascii.Error: decoding with 'hex_codec' codec failed (Error: Non-hexadecimal digit found)
Line 32 of test_code.py (as mentioned in the error) is the line for key in cache.keys():, so this is where it seems a non-hexidecimal character is being found. But firstly I'm not sure why, and secondly I don't know how to get around it?
(PS. Please note that if you run this code, you'll end up with 50 copies of your chosen file in your temp folder, and nothing will tidy it up automatically!)

After reading the sources of fcache, it seems that the cache_dir should only be used by fcache itself, as it reads all its files to find previously created cache data.
The program (or, better, the module) crashes because you created the other files in that directory, and it cannot deal with them.
The solution is to use another directory to store those files.
import os
# ...
data_dir = os.path.join(os.path.dirname(cache.cache_dir), 'data')
if not os.path.exists(data_dir):
os.mkdir(data_dir)
for w in range(1, targetCount+1):
fileName = random_string(50) + ".mov"
targetPath = os.path.join(data_dir, fileName)
copyfile(sourceFile, targetPath)
cache[str(w)] = targetPath

How to turn a comma seperated value TXT into a CSV for machine learning

How do I turn this format of TXT file into a CSV file?
Date,Open,high,low,close
1/1/2017,1,2,1,2
1/2/2017,2,3,2,3
1/3/2017,3,4,3,4
I am sure you can understand? It already has the comma -eparated values.
I tried using numpy.
>>> import numpy as np
>>> table = np.genfromtxt("171028 A.txt", comments="%")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\npyio.py", line 1551, in genfromtxt
fhd = iter(np.lib._datasource.open(fname, 'rb'))
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\_datasource.py", line 151, in open
return ds.open(path, mode)
File "C:\Users\Smith\AppData\Local\Continuum\anaconda3\lib\site-packages\numpy\lib\_datasource.py", line 501, in open
raise IOError("%s not found." % path)
OSError: 171028 A.txt not found.
I have (S&P) 500 txt files to do this with.

You can use csv module. You can find more information here.
import csv
txt_file = 'mytext.txt'
csv_file = 'mycsv.csv'
in_txt = csv.reader(open(txt_file, "r"), delimiter=',')
out_csv = csv.writer(open(csv_file, 'w+'))
out_csv.writerows(in_txt)

Per #dclarke's comment, check the directory from which you run the code. As you coded the call, the file must be in that directory. When I have it there, the code runs without error (although the resulting table is a single line with four nan values). When I move the file elsewhere, I reproduce your error quite nicely.
Either move the file to be local, add a local link to the file, or change the file name in your program to use the proper path to the file (either relative or absolute).

configparser loading config files from zip

I am creating a program that loads and runs python scripts from a compressed file. Along with those python scripts, I have a config file that I previously used configparser to load info from in an uncompressed version of the program.
Is it possible to directly read config files in zip files directly with configparser? or do I have to unzip it into a temp folder and load it from there?
I have tried directly giving the path:
>>> sysconf = configparser.ConfigParser()
>>> sysconf.read_file("compressed.zip/config_data.conf")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/configparser.py", line 691, in read_file
self._read(f, source)
File "/usr/local/lib/python3.4/configparser.py", line 1058, in _read
raise MissingSectionHeaderError(fpname, lineno, line)
configparser.MissingSectionHeaderError: File contains no section headers.
file: '<???>', line: 1
Didn't work. no surprises there.
Then I tried using zipfile
>>> zf = zipfile.ZipFile("compressed.zip")
>>> data = zf.read("config_data.conf")
>>> sysconf = configparser.ConfigParser()
>>> sysconf.read_file(data)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/configparser.py", line 691, in read_file
self._read(f, source)
File "/usr/local/lib/python3.4/configparser.py", line 1009, in _read
if line.strip().startswith(prefix):
AttributeError: 'int' object has no attribute 'strip'
and found that it didn't work either.
so I've resorted to creating a temp folder, uncompressing to it, and reading the conf file there. I would really like to avoid this if possible as the conf files are the only limiting factor. I can (and am) loading the python modules from the zip file just fine at this point.
I can get the raw text of the file if there's a way to pass that directly to configparser, but searching the docs I came up empty handed.
Update:
I tried using stringIO as a file object, and it seems to work somewhat.
configparser doesn't reject it, but it doesn't like it either.
>>> zf = zipfile.ZipFile("compressed.zip")
>>> data = zf.read(config_data.conf)
>>> confdata = io.StringIO(str(data))
>>> sysconf = configparser.ConfigParser()
>>> sysconf.readfp(confdata)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/configparser.py", line 736, in readfp
self.read_file(fp, source=filename)
File "/usr/local/lib/python3.4/configparser.py", line 691, in read_file
self._read(f, source)
File "/usr/local/lib/python3.4/configparser.py", line 1058, in _read
raise MissingSectionHeaderError(fpname, lineno, line)
configparser.MissingSectionHeaderError: File contains no section headers.
file: '<???>', line: 1
(continues to spit out the entire contents of the file)
If I use read_file instead, it doesn't error out, but doesn't load anything either.
>>> zf = zipfile.ZipFile("compressed.zip")
>>> data = zf.read(config_data.conf)
>>> confdata = io.StringIO(str(data))
>>> sysconf = configparser.ConfigParser()
>>> sysconf.read_file(confdata)
>>> sysconf.items("General") #(this is the main section in the file)
Traceback (most recent call last):
File "/usr/local/lib/python3.4/configparser.py", line 824, in items
d.update(self._sections[section])
KeyError: 'General'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/configparser.py", line 827, in items
raise NoSectionError(section)
configparser.NoSectionError: No section: 'General'

can get the raw text of the file if there's a way to pass that directly to configparser
Try configparser.ConfigParser.read_string
When coupled with an appropriate ZIP file, this code works for me:
import zipfile
import configparser
zf = zipfile.ZipFile("compressed.zip")
zf_config = zf.open("config_data.conf", "rU")
zf_config_data = zf_config.read().decode('ascii')
config = configparser.ConfigParser()
config.read_string(zf_config_data)
assert config['today']['lunch']=='cheeseburger'
Upon reflection, the following might be more appropriate:
import zipfile
import configparser
import io
zf = zipfile.ZipFile("compressed.zip")
zf_config = zf.open("config_data.conf", "rU")
zf_config = io.TextIOWrapper(zf_config)
config = configparser.ConfigParser()
config.read_file(zf_config)
assert config['today']['lunch']=='cheeseburger'

As written in comments, #matthewatabet answer won't work with Python 3.4 (and newer vesions). It's because ZipFile.open now returns a "bytes-like" object and not a "file-like" object anymore. You can use:
codecs.getreader("utf-8")(config_file)
To convert the config_file bytes-like object into a file-like object using the UTF-8 encoding. The code is now:
import zipfile, configparser, codecs
# Python >= 3.4
with zipfile.ZipFile("compressed.zip") as zf:
config_file = zf.open("config_data.conf") # binary mode
sysconfig = configparser.ConfigParser()
sysconfig.read_file(codecs.getreader("utf-8")(config_file))
That seems more satisfactory than creating a string, but I don't know if it's more efficient...
EDIT Since Python 3.9, the zipfile module provides a zipfile.Path.open method that can handle text and binary modes. Default is text mode. The following code works fine:
# Python >= 3.9
with zipfile.ZipFile("compressed.zip") as zf:
zip_path = zipfile.Path(zf)
config_path = zip_path / "config_data.conf"
config_file = config_path.open() # text mode
sysconfig = configparser.ConfigParser()
sysconfig.read_file(config_file)

ZipFile not only supports read but also open, which returns a file-like object. So, you could do something like this:
zf = zipfile.ZipFile("compressed.zip")
config_file = zf.open("config_data.conf")
sysconfig = configparser.ConfigParser()
sysconfig.readfp(config_file)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting files from stream-mode tarfile - python

Related

Iterate over pathlib paths and python-docx: zipfile.BadZipFile

Python Configparser. Whitespace causes AttributeError

How to iterate through and delete certain files from Python fcache?

How to turn a comma seperated value TXT into a CSV for machine learning

configparser loading config files from zip

Categories

Resources