python 2.5 error on unzip large dbf file - python

So, I have a directory of rather large, zipped, shapefiles. I currently have code in python 2.5 that will unzip most of the files (i.e. all of the shapefile component parts .shp, .prj, .dbf...) but I run into occational problems unzipping some .dbf files.
These files area generally quite large when I have a problem with them (e.g. 30 MB) but the file size does not sem to be an overarching problem with the unzipping process as sometimes a smaller file will not work.
I have looked at possible special characters in the file path (it contains "-" and "/") but this seems not to be an issue with other .dbf files. I have also looked at the length of the file path, also not an issue as other long file paths do not present a problem.
7Zip will unzip the .dbf files I have issues unzipping with python unzip so the files are not corrupt.
I know a simple solution would be to unzip all of the files prior to running my additional processing in python but as they come in a zipped archive it woukld be most convenient not to have to do this.
Thoughts appreciated.

Two possible candidate problems: the file to extract is either empty, or is larger than 2Gb. Both of these issues were fixed in 2.6 or 2.7.
If neither of these is the case, putting one of the culprit zip archives somewhere public would help us track down the issue.

Related

Should I add Python's pyc files to .dockerignore?

I've seen several examples of .dockerignore files for Python projects where *.pyc files and/or __pycache__ folders are ignored:
**/__pycache__
*.pyc
Since these files/folders are going to be recreated in the container anyway, I wonder if it's a good practice to do so.
Yes, it's a recommended practice. There are several reasons:
Reduce the size of the resulting image
In .dockerignore you specify files that won't go to the resulting image, it may be crucial when you're building the smallest image. Roughly speaking the size of bytecode files is equal to the size of actual files. Bytecode files aren't intended for distribution, that's why we usually put them into .gitignore as well.
Cache related problems
In earlier versions of Python 3.x there were several cached related issues:
Python’s scheme for caching bytecode in .pyc files did not work well
in environments with multiple Python interpreters. If one interpreter
encountered a cached file created by another interpreter, it would
recompile the source and overwrite the cached file, thus losing the
benefits of caching.
Since Python 3.2 all the cached files prefixed with interpreter version as mymodule.cpython-32.pyc and presented under __pychache__ directory. By the way, starting from Python 3.8 you can even control a directory where the cache will be stored. It may be useful when you're restricting write access to the directory but still want to get benefits of cache usage.
Usually, the cache system works perfectly, but someday something may go wrong. It worth to note that the cached .pyc (lives in the same directory) file will be used instead of the .py file if the .py the file is missing. In practice, it's not a common occurrence, but if some stuff keeps up being "there", thinking about remove cache files is a good point. It may be important when you're experimenting with the cache system in Python or executing scripts in different environments.
Security reasons
Most likely that you don't even need to think about it, but cache files can contain some sort of sensitive information. Due to the current implementation, in .pyc files presented an absolute path to the actual files. There are situations when you don't want to share such information.
It seems that interacting with bytecode files is a quite frequent necessity, for example, django-extensions have appropriate options compile_pyc and clean_pyc.

Zip/Unzip files on HDFS

I have my code deployed on HDFS and have two basic tasks, I am having trouble figuring out -
fetching a zip file from an ObjectStore to HDFS, unzipping it on HDFS, reading it's contents, deleting the zip and contents.
creating some content on HDFS, zipping it on HDFS, posting it to ObjectStore and then deleting the zip.
Regular libraries for zipping/unzipping in a python script like shutil, etc do not work on HDFS URLs when referring to resources. I tried looking up some python libraries that would allow it, but found none.
The closest solution I got to was this but it came with a fair warning of not working when multiple files are zipped together. Can someone help with a pointer to the solution to the tasks mentioned above in bold?

Python wheel: same source code but different md5sum

We need to check the md5sum of self made python packages, actually taking it from resulting *.whl file. The problem is that the md5sum changes on every build, even if there no changes in source code. Also we have tested this on third party packages, i.e. django-celery, and get the same behavior.
So the questions are:
What differs if we don't change the source code?
Is it possible to get the same md5sum for the same python builds?
upd.
To illustrate the issue I get two reports made on two django-celery builds.
Build content checksums is exactly the same (4th column), but the checksums of the *.whl files itself differs.
Links to the reports:
https://www.dropbox.com/s/0kkbhwd2fgopg67/django_celery-3.1.17-py2-none-any2.htm?dl=0
https://www.dropbox.com/s/vecrq587jjrjh2r/django_celery-3.1.17-py2-none-any1.htm?dl=0
Quoting the relevant PEP:
A wheel is a ZIP-format archive with a specially formatted file name and the .whl extension.
ZIP archives preserve the modification time of each file.
Wheel archives do not contain just source code, but also other files and directories that are generated on the fly when the archive is created. Therefore, even if you don't touch your Python source code, the wheel will still contain contents that have a different modification time.
One way to work around this problem is to unzip the wheel and compute the checksums of the contents.

How to extract multiple files from 7zip compressed file using pylzma

i have a 7zip compressed file with .bup extension, after extracting this file using 7zip utility it creates a folder which contains two files....i would like to do the same thing with PyLZMA, can all the files be extracted into a folder using PyLZMA (decompression)?, could you let me know how can that be done?, i new to this so any detailed help will be really helpful.
The Python 3.3 module has lzma support built in and comes with examples

Creating fewer files when freezing a Python application

I'm using cxFreeze to freeze my Python application. All seems to be working as expected but peering into the build directory got me thinking...
Is there a way I could have fewer files in the build directory?
Currently, there's a bunch of PYD files and the necessary DLL files lying around. Then I have some configuration files (custom) and the rest of the stuff is thrown into a library.zip file. Is there a way I could bundle pretty much everything into the library.zip file so I could have fewer files in there?
(This seems to be more a-nice-clean-directory fetish than a real "issue" but nonetheless, sometimes you've just got to fulfill the curiosity/fetish)
Thanks a ton guys (in advance).
PyInstaller is cross-platform too, and has more features than cx_Freeze, but doesn't support Python 3. See also py2exe - generate single executable file.

Categories

Resources