Zip/Unzip files on HDFS

Zip/Unzip files on HDFS - python

I have my code deployed on HDFS and have two basic tasks, I am having trouble figuring out -
fetching a zip file from an ObjectStore to HDFS, unzipping it on HDFS, reading it's contents, deleting the zip and contents.
creating some content on HDFS, zipping it on HDFS, posting it to ObjectStore and then deleting the zip.
Regular libraries for zipping/unzipping in a python script like shutil, etc do not work on HDFS URLs when referring to resources. I tried looking up some python libraries that would allow it, but found none.
The closest solution I got to was this but it came with a fair warning of not working when multiple files are zipped together. Can someone help with a pointer to the solution to the tasks mentioned above in bold?

Related

Parallel upload of files from a directory in Google Cloud Storage

I'm trying to write a Python utility function which uploads files from a directory to a destination folder. Although, the documentation provides a snippet with blob.upload_from_filename() for a single file and then it jumps to gsutil -m solution for multiple files.
So far, I have found only 2 suboptimal solutions:
Simple iteration of the single file operation
Use of subprocess and hardcode the gsutil command.
Is there any elegant solution for this issue or should I use Python's multiprocessing?

If anyone else is looking for this, It has been included in the lastet version of google-cloud-storage with transfer_manager
Here is an example:
https://github.com/googleapis/python-storage/blob/9998a5e1c9e9e8920c4d40e13e39095585de657a/samples/snippets/storage_transfer_manager.py

How to share a Collaboratory project that comes with EXTRA csv files

Apologies if this has been asked before, I've been reading about this issue for a while and all solutions seem similar to the same 3 available options.
For example in this thread Load local data files to Colaboratory
It is explained how to manually upload a file. That could work. But what if we had to share a file with 100 users? I believe that with that type of solutions, they would all have to copy the collaboratory project to their local machine, as usual, and everyone would have to upload the same file, and then proceed to use the code, once they have the required CSV.
Am I missing something? Is there any way to share both the project and a number of files (even if that implies a longer load time since the file has to be downloaded for ever person, but I would like to know if there is an automatic way to do this).
Particularly interested in solutions that do not involve auth tokens since the people that would get the shared project are not technologically shavy enough for the task.

If the data is not secret, git may be the best solution. You upload those csv files to github. Then use git clone in your Colab notebook.

Python built-in module for handling Excel files

I know similar questions have been popular in the past, but none refers to my problem. I'm looking for a way to read data from Excel file in Python, but I'm strongly against using non-builtin modules.
The reason why is that in my case Python is a component of another software, so incorporating additional module would require from every user knowledge about how to use pip, which Python installation on your pc should one install module into, etc. The solution must not require any additional actions from user.
I can read CSV files with Python builtin easily, so that could work, but how can I convert Excel to CSV in the first place? Or is there a way to read Excel directly?
Edit: It is Python 2, that is used in this software.
Edit2:
Anyone minds explaining the downvote? I think this isn't a question about a ready solution or module, but rather a method and is well detailed. It is not always possible to use external modules, so this is an actual problem. If it is not possible at all though, then I would simply expect an answer instead of -1.

Not really the prettiest solution, but you could download the complete code repository of one of the excel handling packages for python (openpyxl for example) and put these files in the same directory as the python script that you're going to run. Subsequently you can do an import of these local package files in your script.
Note: if the excel handling package has dependencies on other packages, then you'll need to download these as well.

python 2.5 error on unzip large dbf file

So, I have a directory of rather large, zipped, shapefiles. I currently have code in python 2.5 that will unzip most of the files (i.e. all of the shapefile component parts .shp, .prj, .dbf...) but I run into occational problems unzipping some .dbf files.
These files area generally quite large when I have a problem with them (e.g. 30 MB) but the file size does not sem to be an overarching problem with the unzipping process as sometimes a smaller file will not work.
I have looked at possible special characters in the file path (it contains "-" and "/") but this seems not to be an issue with other .dbf files. I have also looked at the length of the file path, also not an issue as other long file paths do not present a problem.
7Zip will unzip the .dbf files I have issues unzipping with python unzip so the files are not corrupt.
I know a simple solution would be to unzip all of the files prior to running my additional processing in python but as they come in a zipped archive it woukld be most convenient not to have to do this.
Thoughts appreciated.

Two possible candidate problems: the file to extract is either empty, or is larger than 2Gb. Both of these issues were fixed in 2.6 or 2.7.
If neither of these is the case, putting one of the culprit zip archives somewhere public would help us track down the issue.

Is there a python ftp library for uploading whole directories (including subdirectories)?

So I know about ftplib, but that's a bit too low for me as it still requires me to handle uploading files one at a time as well as determining if there are subdirectories, creating the equivalent subdirectories on the server, cd'ing into those subdirectories and then finally uploading the correct files into those subdirectories. It's an annoying task that I'd rather avoid if I can, what with writing tests, setting up test ftp servers etc etc..
Any of you know of a library (or mb some code scrawled on the bathroom wall..) that takes care of this for me or should I just accept my fate and roll my own?
Thanks

The ftputil Python library is a high-level interface to the ftplib module.
Looks like this could help. ftputil website

If wget is installed on your system, you could have your script call it to do the ftp'ing for you. It supports recursive transfers, site mirroring, and many other features.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Zip/Unzip files on HDFS - python

Related

Parallel upload of files from a directory in Google Cloud Storage

How to share a Collaboratory project that comes with EXTRA csv files

Python built-in module for handling Excel files

python 2.5 error on unzip large dbf file

Is there a python ftp library for uploading whole directories (including subdirectories)?

Categories

Resources