Python hashing folder contents results in memory error

Python hashing folder contents results in memory error - python

Context
In an automated test case I download a folder containing
txt files, and
encrypted files with various 3 digit extensions(i.e. .001, .004)
and hash every one of these files in the following manner:
def hash_directory(path):
hashed_files = {}
if os.path.isfile(path):
md5 = hashlib.md5()
with open(path, 'rb') as f:
while True:
buf = f.read(65536)
if not buf : break
md5.update(buf)
return {os.path.basename(path):md5.hexdigest()}
else:
for dir_name, dir_names, file_names in os.walk(path):
for filename in file_names:
md5 = hashlib.md5()
file_path = os.path.join(dir_name.replace(path, '').strip('\\'),filename)
with open(os.path.join(dir_name, filename), 'rb') as f:
part = 0
while True:
buf = f.read(65536)
if not buf: break
md5.update(buf)
#if the md5 object gets larger than a quarter MB, then digest a part of the file.
if sys.getsizeof(md5)>262144:
hashed_files[file_path+'part'+str(part)] = md5.hexdigest()
part+=1
hashed_files[file_path] = md5.hexdigest()
return hashed_files
I do the same for a folder being held on a server on my network(this is the "expected"), and compare the hashed_files to determine if there are missing files in the downloaded folder, or if any files are corrupted.
Problem
I get this:
Traceback (most recent call last):
File "c:\python27\lib\runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "c:\python27\lib\runpy.py", line 72, in _run_code
exec code in run_globals
File "c:\python27\lib\site-packages\robot\run.py", line 550, in <module>
run_cli(sys.argv[1:])
File "c:\python27\lib\site-packages\robot\run.py", line 489, in run_cli
return RobotFramework().execute_cli(arguments, exit=exit)
File "c:\python27\lib\site-packages\robot\utils\application.py", line 46, in execute_cli
rc = self._execute(arguments, options)
File "c:\python27\lib\site-packages\robot\utils\application.py", line 90, in _execute
error, details = get_error_details(exclude_robot_traces=False)
File "c:\python27\lib\site-packages\robot\utils\error.py", line 47, in get_error_details
details = ErrorDetails(exclude_robot_traces=exclude_robot_traces)
File "c:\python27\lib\site-packages\robot\utils\error.py", line 60, in ErrorDetails
raise exc_value
MemoryError
We use robot framework for writing the test cases, python27, and this occurs both on my win10 16gb machine, and on remote servers running the code.
Would anyone have a sniff of what this could be ?
Does anyone know of a python tool to monitor memory as code is run ?

Related

IOError : [Error no: 21] is a directory : './w2v-model/wordmodel3'

def generate_w2vModel(decTokenFlawPath, w2vModelPath):
print("training...")
model = Word2Vec(sentences= DirofCorpus(decTokenFlawPath), size=30, alpha=0.01, window=5, min_count=0, max_vocab_size=None, sample=0.001, seed=1, workers=1, min_alpha=0.0001, sg=1, hs=0, negative=10, iter=5)
model.save(w2vModelPath)
def evaluate_w2vModel(w2vModelPath):
print("\nevaluating...")
model = Word2Vec.load(w2vModelPath)
for sign in ['(', '+', '-', '*', 'main']:
print(sign, ":")
print(model.most_similar_cosmul(positive=[sign], topn=10))
def main():
dec_tokenFlaw_path = ['./data/cdg_ddg/corpus/']
w2v_model_path = "./w2v_model/wordmodel3"
generate_w2vModel(dec_tokenFlaw_path, w2v_model_path)
evaluate_w2vModel(w2v_model_path)
print("success!")
this the python that i was running. This file is used to train word2vec model. The inputs are corpus files, and the output is the word2vec model. i got the following error :
Traceback (most recent call last):
File "create_w2vmodel.py", line 67, in <module>
main()
File "create_w2vmodel.py", line 62, in main
generate_w2vModel(dec_tokenFlaw_path, w2v_model_path)
File "create_w2vmodel.py", line 50, in generate_w2vModel
model.save(w2vModelPath)
File "/usr/local/lib/python2.7/dist-packages/gensim-3.4.0-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 930, in save
super(Word2Vec, self).save(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/gensim-3.4.0-py2.7-linux-x86_64.egg/gensim/models/base_any2vec.py", line 281, in save
super(BaseAny2VecModel, self).save(fname_or_handle, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/gensim-3.4.0-py2.7-linux-x86_64.egg/gensim/utils.py", line 691, in save
self._smart_save(fname_or_handle, separately, sep_limit, ignore, pickle_protocol=pickle_protocol)
File "/usr/local/lib/python2.7/dist-packages/gensim-3.4.0-py2.7-linux-x86_64.egg/gensim/utils.py", line 550, in _smart_save
pickle(self, fname, protocol=pickle_protocol)
File "/usr/local/lib/python2.7/dist-packages/gensim-3.4.0-py2.7-linux-x86_64.egg/gensim/utils.py", line 1311, in pickle
with smart_open(fname, 'wb') as fout: # 'b' for binary, needed on Windows
File "build/bdist.linux-x86_64/egg/smart_open/smart_open_lib.py", line 89, in smart_open
File "build/bdist.linux-x86_64/egg/smart_open/smart_open_lib.py", line 301, in file_smart_open
IOError: [Errno 21] Is a directory: './w2v_model/wordmodel3'
please help me to change this particular error. i think there is no folder like this, but i had already created the w2v_model/wordmodel3 in my folder. I had tried it in many ways. I will provide smart_open_lib.py program file below :
def file_smart_open(fname, mode='rb'):
"""
Stream from/to local filesystem, transparently (de)compressing gzip and bz2
files if necessary.
"""
_, ext = os.path.splitext(fname)
if ext == '.bz2':
PY2 = sys.version_info[0] == 2
if PY2:
from bz2file import BZ2File
else:
from bz2 import BZ2File
return make_closing(BZ2File)(fname, mode)
if ext == '.gz':
from gzip import GzipFile
return make_closing(GzipFile)(fname, mode)
return open(fname, mode)
this is the trace back code which they tell. kindly request to help me fr change this error!!!

The Word2Vec .save() method needs a new path, to the filename (not directory!) to which you want to save the model's main file. (For most models of significant size, there will be extra files alongside that main file, with other extensions, that are also part of the model-save.)
If you're providing a path to an existing directory, you'll get an error like this.
Change your w2v_model_path to be a path to a desired file name, not an already-existing directory. (Perhaps that path will be inside that directory. But it shouldn't be the directory itself!)

Reading password protected Word Documents with zipfile

I am trying to read a password protected word document on Python using zipfile.
The following code works with a non-password protected document, but gives an error when used with a password protected file.
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
import zipfile
psw = "1234"
WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'
def get_docx_text(path):
document = zipfile.ZipFile(path, "r")
document.setpassword(psw)
document.extractall()
xml_content = document.read('word/document.xml')
document.close()
tree = XML(xml_content)
paragraphs = []
for paragraph in tree.getiterator(PARA):
texts = [node.text
for node in paragraph.getiterator(TEXT)
if node.text]
if texts:
paragraphs.append(''.join(texts))
return '\n\n'.join(paragraphs)
When running get_docx_text() with a password protected file, I received the following error:
Traceback (most recent call last):
File "<ipython-input-15-d2783899bfe5>", line 1, in <module>
runfile('/Users/username/Workspace/Python/docx2txt.py', wdir='/Users/username/Workspace/Python')
File "/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/spyderlib/widgets/externalshell/sitecustomize.py", line 680, in runfile
execfile(filename, namespace)
File "/Applications/Spyder-Py2.app/Contents/Resources/lib/python2.7/spyderlib/widgets/externalshell/sitecustomize.py", line 78, in execfile
builtins.execfile(filename, *where)
File "/Users/username/Workspace/Python/docx2txt.py", line 41, in <module>
x = get_docx_text("/Users/username/Desktop/file.docx")
File "/Users/username/Workspace/Python/docx2txt.py", line 23, in get_docx_text
document = zipfile.ZipFile(path, "r")
File "zipfile.pyc", line 770, in __init__
File "zipfile.pyc", line 811, in _RealGetContents
BadZipfile: File is not a zip file
Does anyone have any advice to get this code to work?

I don't think this is an encryption problem, for two reasons:
Decryption is not attempted when the ZipFile object is created. Methods like ZipFile.extractall, extract, and open, and read take an optional pwd parameter containing the password, but the object constructor / initializer does not.
Your stack trace indicates that the BadZipFile is being raised when you create the ZipFile object, before you call setpassword:
document = zipfile.ZipFile(path, "r")
I'd look carefully for other differences between the two files you're testing: ownership, permissions, security context (if you have that on your OS), ... even filename differences can cause a framework to "not see" the file you're working on.
Also --- the obvious one --- try opening the encrypted zip file with your zip-compatible command of choice. See if it really is a zip file.
I tested this by opening an encrypted zip file in Python 3.1, while "forgetting" to provide a password. I could create the ZipFile object (the variable zfile below) without any error, but got a RuntimeError --- not a BadZipFile exception --- when I tried to read a file without providing a password:
Traceback (most recent call last):
File "./zf.py", line 35, in <module>
main()
File "./zf.py", line 29, in main
print_checksums(zipfile_name)
File "./zf.py", line 22, in print_checksums
for checksum in checksum_contents(zipfile_name):
File "./zf.py", line 13, in checksum_contents
inner_file = zfile.open(inner_filename, "r")
File "/usr/lib64/python3.1/zipfile.py", line 903, in open
"password required for extraction" % name)
RuntimeError: File apache.log is encrypted, password required for extraction
I was also able to raise a BadZipfile exception, once by trying to open an empty file and once by trying to open some random logfile text that I'd renamed to a ".zip" extension. The two test files produced identical stack traces, down to the line numbers.
Traceback (most recent call last):
File "./zf.py", line 35, in <module>
main()
File "./zf.py", line 29, in main
print_checksums(zipfile_name)
File "./zf.py", line 22, in print_checksums
for checksum in checksum_contents(zipfile_name):
File "./zf.py", line 10, in checksum_contents
zfile = zipfile.ZipFile(zipfile_name, "r")
File "/usr/lib64/python3.1/zipfile.py", line 706, in __init__
self._GetContents()
File "/usr/lib64/python3.1/zipfile.py", line 726, in _GetContents
self._RealGetContents()
File "/usr/lib64/python3.1/zipfile.py", line 738, in _RealGetContents
raise BadZipfile("File is not a zip file")
zipfile.BadZipfile: File is not a zip file
While this stack trace isn't exactly the same as yours --- mine has a call to _GetContents, and the pre-3.2 "small f" spelling of BadZipfile --- but they're close enough that I think this is the kind of problem you're dealing with.

Django: How to allow a Suspicious File Operation / copy a file

I want to do a SuspiciousFileOperation which django disallows by default.
I am writing a command (to run via manage.py importfiles) to import a given directory structure on the real file system in my self written filestorage in Django.
I think, this is my relevant code:
def _handle_directory(self, directory_path, directory):
for root, subFolders, files in os.walk(directory_path):
for filename in files:
self.cnt_files += 1
new_file = File(directory=directory, filename=filename, file=os.path.join(root, filename),
uploader=self.uploader)
new_file.save()
The backtrace is:
Traceback (most recent call last):
File ".\manage.py", line 10, in <module>
execute_from_command_line(sys.argv)
File "C:\Python27\lib\site-packages\django\core\management\__init__.py", line 399, in execute_from_command_line
utility.execute()
File "C:\Python27\lib\site-packages\django\core\management\__init__.py", line 392, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "C:\Python27\lib\site-packages\django\core\management\base.py", line 242, in run_from_argv
self.execute(*args, **options.__dict__)
File "C:\Python27\lib\site-packages\django\core\management\base.py", line 285, in execute
output = self.handle(*args, **options)
File "D:\Development\github\Palco\engine\filestorage\management\commands\importfiles.py", line 53, in handle
self._handle_directory(args[0], root)
File "D:\Development\github\Palco\engine\filestorage\management\commands\importfiles.py", line 63, in _handle_directory
new_file.save()
File "D:\Development\github\Palco\engine\filestorage\models.py", line 157, in save
self.sha512 = hashlib.sha512(self.file.read()).hexdigest()
File "C:\Python27\lib\site-packages\django\core\files\utils.py", line 16, in <lambda>
read = property(lambda self: self.file.read)
File "C:\Python27\lib\site-packages\django\db\models\fields\files.py", line 46, in _get_file
self._file = self.storage.open(self.name, 'rb')
File "C:\Python27\lib\site-packages\django\core\files\storage.py", line 33, in open
return self._open(name, mode)
File "C:\Python27\lib\site-packages\django\core\files\storage.py", line 160, in _open
return File(open(self.path(name), mode))
File "C:\Python27\lib\site-packages\django\core\files\storage.py", line 261, in path
raise SuspiciousFileOperation("Attempted access to '%s' denied." % name)
django.core.exceptions.SuspiciousFileOperation: Attempted access to 'D:\Temp\importme\readme.html' denied.
The full model can be found at GitHub. The full command is currently on gist.github.com available.
If you do not want to check the model: the attribute file of my File class is a FileField.
I assume, this problem happens, because I am just "linking" to the file found. But I need to copy it, huh? How can I copy the file into the file?

In Django, SuspiciousFileOperation can be avoid by read the file from external dir and make a tmp file within the project media then save in the appropriate file filed as below
import tempfile
file_name="file_name.pdf"
EXT_FILE_PATH = "/home/somepath/"
file_path = EXT_FILE_PATH + file_name
if exists(file_path):
#create a named temporary file within the project base , here in media
lf = tempfile.NamedTemporaryFile(dir='media')
f = open(file_path, 'rb')
lf.write(f.read())
#doc object with file FileField.
doc.file.save(file_name, File(lf), save=True)
lf.close()

I haven't faced similar problem but related issue. I have recently upgraded Django 1.8 to 1.11.
Now I am getting the following error if try to save a file in a model having FileField field:
SuspiciousFileOperation at /api/send_report/
The joined path (/vagrant/tmp/test_file.pdf) is located outside of the base path component (/vagrant/media)
My model where I want to save the file:
class Report(BaseModel):
file = models.FileField(max_length=200, upload_to=os.path.join(settings.REPORTS_URL, '%Y/week_%W/'))
type = models.CharField(max_length=20, verbose_name='Type', blank=False, default='', db_index=True)
I am trying following codes to save the file from tmp folder which is not located in MEDIA_ROOT:
from django.core.files import File
filepath = "/vagrant/tmp/test_file.pdf"
file = File(open(filepath, "rb"))
report_type = "My_report_type"
report = Report.objects.create(
file=file,
type=report_type,
)
What I have done to solve the issue:
import os
from django.core.files import File
filepath = "/vagrant/tmp/test_file.pdf"
file = File(open(filepath, "rb"))
file_name = os.path.basename(file.name)
report_type = "My_report_type"
report = Report.objects.create(
type=report_type,
)
report.file.save(file_name, file, save=True)
Hope it will help someone.

Analyzing this part of stacktrace:
File "C:\Python27\lib\site-packages\django\core\files\storage.py", line 261, in path
raise SuspiciousFileOperation("Attempted access to '%s' denied." % name)
leads to the standard Django FileSystemStorage. It expects files to be within your MEDIA_ROOT. Your files can be anywhere in the file system, therefore this problem occurs.
You should pass file-like object instead of a path to your File model. The easiest way to achieve that would be to use Django File class, which is a wrapper around python file-like objects. See File object documentation for more details.
Update:
Ok, I am suggesting here a route taken from the docs:
from django.core.files import File as FileWrapper
def _handle_directory(self, directory_path, directory):
for root, subFolders, files in os.walk(directory_path):
for filename in files:
self.cnt_files += 1
new_file = File(
directory=directory, filename=filename,
file=os.path.join(root, filename),
uploader=self.uploader)
with open(os.path.join(root, filename), 'r') as f:
file_wrapper = FileWrapper(f)
new_file = File(
directory=directory, filename=filename,
file=file_wrapper,
uploader=self.uploader)
new_file.save()
If it works it should copy the file to the location provided by your secure_storage callable.

Check if there is slash before your filepath
file_item = models.FileField(upload_to=content_file_name)
def content_file_name(username, filename):
return '/'.join(['content', username, filename])
Note here "content" not "/content". That was the problem for me.

opening corrupted .tgz in python

I should just unzip RESULT folder inside of output.tgz archive
#!/usr/bin/python
import os, sys, tarfile, gzip
def unZip(self,file, filename,dire):
tar = tarfile.open(filename,"r:gz")
for file in tar.getmembers():
if file.name.startswith("RESULT"):
tar.extract(file,dire)
print 'Done.'
def main():
utilities = ZipUtil()
filename = 'output.tgz'
directory = str(sys.argv[1]);
path = os.path.join(directory, filename)
utilities.unZip(directory,path,directory)
main()
and I'm getting this error
File "/usr/lib64/python2.7/tarfile.py", line 1805, in getmembers
self._load() # all members, we first have to
File "/usr/lib64/python2.7/tarfile.py", line 2380, in _load
tarinfo = self.next()
File "/usr/lib64/python2.7/tarfile.py", line 2315, in next
self.fileobj.seek(self.offset)
File "/usr/lib64/python2.7/gzip.py", line 429, in seek
self.read(1024)
File "/usr/lib64/python2.7/gzip.py", line 256, in read
self._read(readsize)
File "/usr/lib64/python2.7/gzip.py", line 303, in _read
self._read_eof()
File "/usr/lib64/python2.7/gzip.py", line 342, in _read_eof
hex(self.crc)))
IOError: CRC check failed 0xccf8af55 != 0x43cb4d43L
There are many questions like this but still now answer for this proble. The file output.tgz is downloaded and it might be corrupted this is why I guess its not working. But is there anyway around to tell tar to open it. THnak a lot!

why won't recursive ftp work in this directory?

Python newb...beware!
I am trying to recursively ftp some files. In the script below, I get an error:
Traceback (most recent call last):
File "dump.py", line 7, in <module>
for root,dirs,files in recursive:
File "/usr/local/lib/python2.6/site-packages/ftputil/ftputil.py", line 880, in walk
if self.path.isdir(self.path.join(top, name)):
File "/usr/local/lib/python2.6/site-packages/ftputil/ftp_path.py", line 133, in isdir
path, _exception_for_missing_path=False)
File "/usr/local/lib/python2.6/site-packages/ftputil/ftputil.py", line 860, in stat
return self._stat._stat(path, _exception_for_missing_path)
File "/usr/local/lib/python2.6/site-packages/ftputil/ftp_stat.py", line 624, in _stat
_exception_for_missing_path)
File "/usr/local/lib/python2.6/site-packages/ftputil/ftp_stat.py", line 578, in __call_with_parser_retry
result = method(*args, **kwargs)
File "/usr/local/lib/python2.6/site-packages/ftputil/ftp_stat.py", line 543, in _real_stat
lstat_result = self._real_lstat(path, _exception_for_missing_path)
File "/usr/local/lib/python2.6/site-packages/ftputil/ftp_stat.py", line 502, in _real_lstat
for stat_result in self._stat_results_from_dir(dirname):
File "/usr/local/lib/python2.6/site-packages/ftputil/ftp_stat.py", line 419, in _stat_results_from_dir
lines = self._host_dir(path)
File "/usr/local/lib/python2.6/site-packages/ftputil/ftp_stat.py", line 411, in _host_dir
return self._host._dir(path)
File "/usr/local/lib/python2.6/site-packages/ftputil/ftputil.py", line 811, in _dir
descend_deeply=True)
File "/usr/local/lib/python2.6/site-packages/ftputil/ftputil.py", line 578, in _robust_ftp_command
self.chdir(path)
File "/usr/local/lib/python2.6/site-packages/ftputil/ftputil.py", line 603, in chdir
ftp_error._try_with_oserror(self._session.cwd, path)
File "/usr/local/lib/python2.6/site-packages/ftputil/ftp_error.py", line 146, in _try_with_oserror
raise PermanentError(*exc.args)
ftputil.ftp_error.PermanentError: 550 /usr/local/web: No such file or directory
Debugging info: ftputil 2.6, Python 2.6.5 (linux2)
Exited: 256
I have no idea what that means. However, when I change the directory to "/path/dir2" it works and prints out "file.txt".
Here's my directory structures:
/path/dir1/another/file.txt # I get the above error with this path
/path/dir2/another/file.txt # this works fine, even though the directories have the same structure
My script:
import ftputil
ftp = ftputil.FTPHost('ftp.site.com','user','pass')
recursive = ftp.walk("/path/dir1",topdown=True,onerror=None)
for root,dirs,files in recursive:
for name in files:
print name
ftp.close

Is there a symblolic link in /path/dir1/ that points back to /usr/local/web?
If you want to step over this error you could put a try catch block in and log...
ftp = ftputil.FTPHost('ftp.site.com','user','pass')
try:
recursive = ftp.walk("/path/dir1",topdown=True,onerror=None)
for root,dirs,files in recursive:
for name in files:
print name
except Error e:
print "Error: %s occurred" % (e)
ftp.close
the above will catch all errors. you can also import the specific error your getting and do this...
import ftputil.ftp_error.PermanentError as PermanentError
ftp = ftputil.FTPHost('ftp.site.com','user','pass')
try:
recursive = ftp.walk("/path/dir1",topdown=True,onerror=None)
for root,dirs,files in recursive:
for name in files:
print name
except PermanentError e:
print "Permanent Error: %s occurred" % (e)
ftp.close

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python hashing folder contents results in memory error - python

Related

IOError : [Error no: 21] is a directory : './w2v-model/wordmodel3'

Reading password protected Word Documents with zipfile

Django: How to allow a Suspicious File Operation / copy a file

opening corrupted .tgz in python

why won't recursive ftp work in this directory?

Categories

Resources