How can I process a tarfile with a Python multiprocessing pool? - python

I'm trying to process the contents of a tarfile using multiprocessing.Pool. I'm able to successfully use the ThreadPool implementation within the multiprocessing module, but would like to be able to use processes instead of threads as it would possibly be faster and eliminate some changes made for Matplotlib to handle the multithreaded environment. I'm getting an error that I suspect is related to processes not sharing address space, but I'm not sure how to fix it:
Traceback (most recent call last):
File "test_tarfile.py", line 32, in <module>
test_multiproc()
File "test_tarfile.py", line 24, in test_multiproc
pool.map(read_file, files)
File "/ldata/whitcomb/epd-7.1-2-rh5-x86_64/lib/python2.7/multiprocessing/pool.py", line 225, in map
return self.map_async(func, iterable, chunksize).get()
File "/ldata/whitcomb/epd-7.1-2-rh5-x86_64/lib/python2.7/multiprocessing/pool.py", line 522, in get
raise self._value
ValueError: I/O operation on closed file
The actual program is more complicated, but this is an example of what I'm doing that reproduces the error:
from multiprocessing.pool import ThreadPool, Pool
import StringIO
import tarfile
def write_tar():
tar = tarfile.open('test.tar', 'w')
contents = 'line1'
info = tarfile.TarInfo('file1.txt')
info.size = len(contents)
tar.addfile(info, StringIO.StringIO(contents))
tar.close()
def test_multithread():
tar = tarfile.open('test.tar')
files = [tar.extractfile(member) for member in tar.getmembers()]
pool = ThreadPool(processes=1)
pool.map(read_file, files)
tar.close()
def test_multiproc():
tar = tarfile.open('test.tar')
files = [tar.extractfile(member) for member in tar.getmembers()]
pool = Pool(processes=1)
pool.map(read_file, files)
tar.close()
def read_file(f):
print f.read()
write_tar()
test_multithread()
test_multiproc()
I suspect that the something's wrong when the TarInfo object is passed into the other process but the parent TarFile is not, but I'm not sure how to fix it in the multiprocess case. Can I do this without having to extract files from the tarball and write them to disk?

You're not passing a TarInfo object into the other process, you're passing the result of tar.extractfile(member) into the other process where member is a TarInfo object. The extractfile(...) method returns a file-like object which has, among other things, a read() method which operates upon the original tar file you opened with tar = tarfile.open('test.tar').
However, you can't use an open file from one process in another process, you have to re-open the file. I replaced your test_multiproc() with this:
def test_multiproc():
tar = tarfile.open('test.tar')
files = [name for name in tar.getnames()]
pool = Pool(processes=1)
result = pool.map(read_file2, files)
tar.close()
And added this:
def read_file2(name):
t2 = tarfile.open('test.tar')
print t2.extractfile(name).read()
t2.close()
and was able to get your code working.

Related

Pytsk - Sending files to a server from a disk image

I am trying to send each file from a disk image to a remote server using paramiko.
class Server:
def __init__(self):
self.ssh = paramiko.SSHClient()
self.ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
self.ssh.connect('xxx', username='xxx', password='xxx')
def send_file(self, i_node, name):
sftp = self.ssh.open_sftp()
serverpath = '/home/paul/Testing/'
try:
sftp.chdir(serverpath)
except IOError:
sftp.mkdir(serverpath)
sftp.chdir(serverpath)
serverpath = '/home/Testing/' + name
sftp.putfo(fs.open_meta(inode = i_node), serverpath)
However when I run this I get an error saying that "pytsk.File has no attribute read".
Is there any other way of sending this file to the server?
After a quick investigation I think I found what your problem is. Paramiko's sftp.putfo expects a Python file object as the first parameter. The file object of Pytsk3 is a completely different thing. Your sftp object tries to perform "read" on this, but Pytsk3 file object does not have a method "read", hence the error.
You could in theory try expanding Pytsk3.File class and adding this method but I would not hold my breath that it actually works.
I would just read the file to a temporary one and send that. Something like this (you would need to make temp file name handling more clever and delete the file afterwards but you will get the idea):
serverpath = '/home/Testing/' + name
tmp_path = "/tmp/xyzzy"
file_obj = fs.open_meta(inode = i_node)
# Add here tests to confirm this is actually a file, not a directory
tha = open(tmp_path, "wb")
tha.write(file_obj.read_random(0, file_obj.info.meta.size))
tha.close()
rha = open(tmp_path, "rb")
sftp.putfo(rha, serverpath)
rha.close()
# Delete temp file here
Hope this helps. This will read the whole file in memory from fs image to be written to temp file, so if the file is massive you would run out of memory.
To work around that, you should read the file in chunks looping through it with read_random in suitable chunks (the parameters are start offset and amount of data to read), allowing you to construct the temp file in a chunks of for example a couple of megabytes.
This is just a simple example to illustrate your problem.
Hannu

Picking file path and file name with async file download

I am currently using this code (python 3.5.2):
from multiprocessing.dummy import Pool
from urllib.request import urlretrieve
urls = ["link"]
result = Pool(4).map(urlretrieve, urls)
print(result[0][0])
It works, but gets saved to the temp file with some weird name, is there a way to pick a file path and possibly a file name? as well as adding a file extension, it gets saved without one.
Thanks!
You simply need to supply a location to urlretrieve. However pool.map doesn't appear to support multiple args in functions (Python multiprocessing pool.map for multiple arguments). So, you can refactor, as described there, or use a different multiprocessing primitive, e.g. Process:
from multiprocessing import Process
from urllib.request import urlretrieve
urls = ["link", "otherlink"]
filenames = ["{}.html".format(i) for i in urls]
args = zip(urls, filenames)
for arg in args:
p = Process(urlretrieve, arg)
p.start()
In the comments you say you only need to download 1 url. In that case it is very easy:
from urllib.request import urlretrieve
urlretrieve("https://yahoo.com", "where_to_save.html")
Then the file will be saved in where_to_save.html. You can of course provide a full path there, e.g. /where/exactly/to/save.html.

Reading appengine backup_info file gives EOFError

I'm trying to inspect my appengine backup files to work out when a data corruption occured. I used gsutil to locate and download the file:
gsutil ls -l gs://my_backup/ > my_backup.txt
gsutil cp gs://my_backup/LongAlphaString.Mymodel.backup_info file://1.backup_info
I then created a small python program, attempting to read the file and parse it using the appengine libraries.
#!/usr/bin/python
APPENGINE_PATH='/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/'
ADDITIONAL_LIBS = [
'lib/yaml/lib'
]
import sys
sys.path.append(APPENGINE_PATH)
for l in ADDITIONAL_LIBS:
sys.path.append(APPENGINE_PATH+l)
import logging
from google.appengine.api.files import records
import cStringIO
def parse_backup_info_file(content):
"""Returns entities iterator from a backup_info file content."""
reader = records.RecordsReader(cStringIO.StringIO(content))
version = reader.read()
if version != '1':
raise IOError('Unsupported version')
return (datastore.Entity.FromPb(record) for record in reader)
INPUT_FILE_NAME='1.backup_info'
f=open(INPUT_FILE_NAME, 'rb')
f.seek(0)
content=f.read()
records = parse_backup_info_file(content)
for r in records:
logging.info(r)
f.close()
The code for parse_backup_info_file was copied from
backup_handler.py
When I run the program, I get the following output:
./view_record.py
Traceback (most recent call last):
File "./view_record.py", line 30, in <module>
records = parse_backup_info_file(content)
File "./view_record.py", line 19, in parse_backup_info_file
version = reader.read()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/files/records.py", line 335, in read
(chunk, record_type) = self.__try_read_record()
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/api/files/records.py", line 307, in __try_read_record
(length, len(data)))
EOFError: Not enough data read. Expected: 24898 but got 2112
I've tried with a half a dozen different backup_info files, and they all show the same error (with different numbers.)
I have noticed that they all have the same expected length: I was reviewing different versions of the same model when I made that observation, it's not true when I view the backup files of other Modules.
EOFError: Not enough data read. Expected: 24932 but got 911
EOFError: Not enough data read. Expected: 25409 but got 2220
Is there anything obviously wrong with my approach?
I guess the other option is that the appengine backup utility is not creating valid backup files.
Anything else you can suggest would be very welcome.
Thanks in Advance
There are multiple metadata files created when an AppEngine Datastore backup is run:
LongAlphaString.backup_info is created once. This contains metadata about all of the entity types and backup files that were created in datastore backup.
LongAlphaString.[EntityType].backup_info is created once per entity type. This contains metadata about the the specific backup files created for [EntityType] along with schema information for the [EntityType].
Your code works for interrogating the file contents of LongAlphaString.backup_info, however it seems that you are trying to interrogate the file contents of LongAlphaString.[EntityType].backup_info. Here's a script that will print the contents in a human-readable format for each file type:
import cStringIO
import os
import sys
sys.path.append('/usr/local/google_appengine')
from google.appengine.api import datastore
from google.appengine.api.files import records
from google.appengine.ext.datastore_admin import backup_pb2
ALL_BACKUP_INFO = 'long_string.backup_info'
ENTITY_KINDS = ['long_string.entity_kind.backup_info']
def parse_backup_info_file(content):
"""Returns entities iterator from a backup_info file content."""
reader = records.RecordsReader(cStringIO.StringIO(content))
version = reader.read()
if version != '1':
raise IOError('Unsupported version')
return (datastore.Entity.FromPb(record) for record in reader)
print "*****" + ALL_BACKUP_INFO + "*****"
with open(ALL_BACKUP_INFO, 'r') as myfile:
parsed = parse_backup_info_file(myfile.read())
for record in parsed:
print record
for entity_kind in ENTITY_KINDS:
print os.linesep + "*****" + entity_kind + "*****"
with open(entity_kind, 'r') as myfile:
backup = backup_pb2.Backup()
backup.ParseFromString(myfile.read())
print backup

How to print the content of zipped gzip'd files

Ok, so I have a zip file that contains gz files (unix gzip).
Here's what I do --
def parseSTS(file):
import zipfile, re, io, gzip
with zipfile.ZipFile(file, 'r') as zfile:
for name in zfile.namelist():
if re.search(r'\.gz$', name) != None:
zfiledata = zfile.open(name)
print("start for file ", name)
with gzip.open(zfiledata,'r') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
This gives the following result --
>>>
start for file XXXXXX.gz
done opening
done reading
Then stays like that forever until it crashes ...
What can I do with filecontent?
Edit : this is not a duplicate since my gzipped files are in a zipped file and i'm trying to avoid extracting that zip file to disk. It works with zip files in a zip file as per How to read from a zip file within zip file in Python? .
I created a zip file containing a gzip'ed PDF file I grabbed from the web.
I ran this code (with two small changes):
1) Fixed indenting of everything under the def statement (which I also corrected in your Question because I'm sure that it's right on your end or it wouldn't get to the problem you have).
2) I changed:
zfiledata = zfile.open(name)
print("start for file ", name)
with gzip.open(zfiledata,'r') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
to:
print("start for file ", name)
with gzip.open(name,'rb') as gzfile:
print("done opening")
filecontent = gzfile.read()
print("done reading")
print(filecontent)
Because you were passing a file object to gzip.open instead of a string. I have no idea how your code is executing without that change, but it was crashing for me until I fixed it.
EDIT: Adding link to GZIP docs from James R's answer --
Also, see here for further documentation:
http://docs.python.org/2/library/gzip.html#examples-of-usage
END EDIT
Now, since my gzip'ed file is small, the behavior I observe is that is pauses for about 3 seconds after printing done reading, then outputs what is in filecontent.
I would suggest adding the following debugging line after your print "done reading" -- print len(filecontent). If this number is very, very large, consider not printing the entire file contents in one shot.
I would also suggest reading this for more insight into what I expect is your problem: Why is printing to stdout so slow? Can it be sped up?
EDIT 2 - an alternative if your system does not handle file io on zip files, causing no such file errors in the above:
def parseSTS(afile):
import zipfile
import zlib
import gzip
import io
with zipfile.ZipFile(afile, 'r') as archive:
for name in archive.namelist():
if name.endswith('.gz'):
bfn = archive.read(name)
bfi = io.BytesIO(bfn)
g = gzip.GzipFile(fileobj=bfi,mode='rb')
qqq = g.read()
print qqq
parseSTS('t.zip')
Most likely your problem lies here:
if name.endswith(".gz"): #as goncalopp said in the comments, use endswith
#zfiledata = zfile.open(name) #don't do this
#print("start for file ", name)
with gzip.open(name,'rb') as gzfile: #gz compressed files should be read in binary and gzip opens the files directly
#print("done opening") #trust in your program, luke
filecontent = gzfile.read()
#print("done reading")
print(filecontent)
See here for further documentation:
http://docs.python.org/2/library/gzip.html#examples-of-usage

"'NoneType' object is not iterable" error

Just wrote my first python program! I get zip files as attachment in mail which is saved in local folder. The program checks if there is any new file and if there is one it extracts the zip file and based on the filename it extracts to different folder. When i run my code i get the following error:
Traceback (most recent call last): File "C:/Zip/zipauto.py", line 28, in for file in new_files: TypeError: 'NoneType' object is not iterable
Can anyone please tell me where i am going wrong.
Thanks a lot for your time,
Navin
Here is my code:
import zipfile
import os
ROOT_DIR = 'C://Zip//Zipped//'
destinationPath1 = "C://Zip//Extracted1//"
destinationPath2 = "C://Zip//Extracted2//"
def check_for_new_files(path=ROOT_DIR):
new_files=[]
for file in os.listdir(path):
print "New file found ... ", file
def process_file(file):
sourceZip = zipfile.ZipFile(file, 'r')
for filename in sourceZip.namelist():
if filename.startswith("xx") and filename.endswith(".csv"):
sourceZip.extract(filename, destinationPath1)
elif filename.startswith("yy") and filename.endswith(".csv"):
sourceZip.extract(filename, destinationPath2)
sourceZip.close()
if __name__=="__main__":
while True:
new_files=check_for_new_files(ROOT_DIR)
for file in new_files: # fails here
print "Unzipping files ... ", file
process_file(ROOT_DIR+"/"+file)
check_for_new_files has no return statement, and therefore implicitely returns None. Therefore,
new_files=check_for_new_files(ROOT_DIR)
sets new_files to None, and you cannot iterate over None.
Return the read files in check_for_new_files:
def check_for_new_files(path=ROOT_DIR):
new_files = os.listdir(path)
for file in new_files:
print "New file found ... ", file
return new_files
Here is the answer to your NEXT 2 questions:
(1) while True:: your code will loop forever.
(2) your function check_for_new_files doesn't check for new files, it checks for any files. You need to either move each incoming file to an archive directory after it's been processed, or use some kind of timestamp mechanism.
Example, student_grade = dict(zip(names, grades)) make sure names and grades are lists and both having at least more than one item to iterate with. This has helped me

Categories

Resources