Using ptrepack to reclaim deleted nodes in hdf5 file

Using ptrepack to reclaim deleted nodes in hdf5 file - python

I have written a bunch of pandas DataFrames to a h5 file using the Pytables integration in pandas. Since then I've deleted some of the groups in the h5 file and I want to repack it in order to reclaim the space. From what I've found I know I need to use the Pytables ptrepack tool. However I can't get it to work. Can someone let me know if I'm messing something up in my script or if I'm actually running across a bug in pytables? If I am messing it up can you give me an example for importing, and calling ptrepack to simply repack a h5 file in order to save space?
Here's my script and the errors I get:
When I looked at the ptrepack.py script in the pytables folder in anaconda I also saw that I should be able to pass a help flag to it.. but that's also not working. Here's the error I get when I try get the help flag to work
Currently I'm working on a windows 10 machine
with the following package versions:
python 3.5.1
pytables: 3.2.2
pandas: 0.18.0
Thanks!

Ok, firstly, to get the help dialog to show in the command prompt you have to do either ptrepack -h or ptrepack --help
I didn't manage to get the script working in python as it seems it has been made specially for the command line- I did however find this very helpful notebook on the subject ([Reclaiming HDF5 Space][1]) which have the following solution
from subprocess import call
outfilename = 'out.h5'
command = ["ptrepack", "-o", "--chunkshape=auto", "--propindexes", filename, outfilename]
call(command)
Note that this essentially just starts a subprocess which calls the repack function.
[1]: https://github.com/jackdotwa/python-concepts/blob/master/hdf5/reclaiming_space.ipynb "Reclaiming HDF5 space"

Related

Python import pandas not working when used within .bat file

I had the following issue that was giving me a lot of troubles. I managed to solve it after 2 1/2 hours and to spare some poor soul the same waste of time I wanted to show how I resolved it.
Loading a python file inside a .bat file was usually working quite well. However, I encountered issues when I tried to import pandas.
Code could look like this
import pandas as pd
print ("hello")
and the following result in the cmd prompt would be
ImportError: Unable to import required dependencies:
numpy:
IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!
Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.
We have compiled some common reasons and troubleshooting tips at:
https://numpy.org/devdocs/user/troubleshooting-importerror.html
My .bat file would look like
#echo off
"C:\Users\myUserName\Anaconda3\python.exe" "C:\path to .py file\MyPythonFile.py"
pause

To solve this I tried a variety of things like playing around with paths within windows and a large variety of other things.
After I opened python.exe within the Anaconda3 folder, I received
Warning:
This Python interpreter is in a conda environment, but the environment has
not been activated. Libraries may fail to load. To activate this environment
please see https://conda.io/activation
I found myself unable to resolve this within the command prompt but I managed to finally understand the core problem. With Anaconda3 being unactivated, it would never import pandas as intented while other imports were working as intented.
The solution that ended up working was adding the path of the activate.bat file inside the Anaconda3 folder. So the final .bat file would look like
#echo off
call "C:\Users\myUserName\Anaconda3\Scripts\activate.bat"
"C:\Users\myUserName\Anaconda3\python.exe" "C:\path to my Python Script\MyPythonFile.py"
pause
In a ideal scenario, I would be able to keep it activated but calling it within the .bat file is a adequate enough solution for me and may be enough for you too.

Import module failed even though installed

i have a couple of different questions regarding importing modules and another non-related syntax question.
So, i'm using a module called netmiko to automate networking scripts, this isn't a networking question but a python one, i create the scripts in Pycharm and then run them but when i tried to do one last night for the first time using Netmiko it's coming up with an import module failed exception which is confusing me because i've installed it using "pip install netmiko" and i saw it install AND if i do "import netmiko" using the command line on Windows then it works as well with no exceptions. So i've been building the scripts in Pycharm but having to copy/paste them into then cli which isn't great. Does anybody know what the issue may be here?
The second question is just a general syntax question. I've seen "+=" used when using the same variable names in Python (mainly used in netmiko scripts but i assume it's obviously used in other Python scripts to) such as:
output = net_connect.send_command(cmd, expect_string=r'Destination filename')
output += net_connect.send_command('\n', expect_string=r'#')
....the rest of the script isn't import, but i'm wondering what the "+=" is actually doing here because to me it seems no different than doing:
output = net_connect.send_command(cmd, expect_string=r'Destination filename')
output = net_connect.send_command('\n', expect_string=r'#')
Can anyone shed some light on this as well?
Thanks as usual everyone!!

Error writing to parquet file using pyspark

I am working on windows 10. I installed spark, and the goal is to use pyspark. I have made the following steps:
I have installed Python 3.7 with anaconda -- Python was added to C:\Python37
I download wintils from this link -- winutils is added to C:\winutils\bin
I downloaded spark -- spark was extracted is: C:\spark-3.0.0-preview2-bin-hadoop2.7
I downloaded Java 8 from AdoptOpenJDK
under system variables, I set following variables:
HADOOP_HOME : C:\winutils
SPARK_HOME: C:\spark-3.0.0-preview2-bin-hadoop2.7
JAVA_HOME: C:\PROGRA~1\AdoptOpenJDK\jdk-8.0.242.08-hotspot
And finally, under system path, I added:
%JAVA_HOME%\bin
%SPARK_HOME%\bin
%HADOOP_HOME%\bin
In the terminal:
So I would like to know why I am getting this warning:
unable to load native-hadoop library... And why I couldn't bind on port 4040...
Finally, inside Jupyter Notebook, I am getting the following error when trying to write into Parquet file. This image shows a working example, and the following one shows the code with errors:
And here is DataMaster__3.csv on my disk:
And the DaterMaster_par2222.parquet:
Any help is much appreciated!!

If you are writing the file in csv format, I have found that the best way to do that is using the following approach
LCL_POS.toPandas().to_csv(<path>)
There is another way to save it directly without converting to pandas but the issue is it ends up getting split into multiple files (with weird names so I tend to avoid those). If you are happy to split the file up, its much better to write a parquet file in my opinion.
LCL_POS.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(<path>)
Hope that answers your question.

Saving figures with matplotlib on NFS filesystem

I am using ipython notebook through Anaconda on RHEL 6.7. The machine is set up with an NFS storage; that is, df -P -T /home/USERNAME | tail -n +2 | awk '{print $2}' prints 'nfs'.
So I want to save matplotlib figures created in ipython notebooks. Calling the savefig function, however, gives me this error (I have suppressed most of it):
RuntimeError: dvipng was not able to process the following file:
/home/USERNAME/.cache/matplotlib/tex.cache/3007d273a0b2642aa3abce6d3d640283.dvi
Here is the full report generated by dvipng:
No dvipng error report available.
My suspicion is that this has to do with NFS (since it has given me other problems in the past) but otherwise I don't really know where to go from here. Any help greatly appreciated, and please let me know if I can provide more information.

The same problem occurs here in an up-to-date OpenSuse Leap VM with an also up-to-date stack of Anaconda. To my dismay it is not deterministic: batch generating plots fails at very different datasets. It helped to insert a
time.sleep(5)
and now the problem occurs much less frequently. Still a PITA though.

This may not strictly be a problem of NFS.
Looking at its source in this line and this line (note that here is an error: It should have said "dvips failed", not "dvipng"), it seems that the external command with dvipng or dvips failed.
So there are many possibilities. First you need to figure out which external program was being system()ed. Then, maybe you need to check whether this command could be found in PATH environment variable at all, or whether the file itself is crashing that command. Try running dvip(ng|s) manually on that file and see if you can get an error report.
BTW, from the linked source, if I read the source correctly (IIRTSC), I don't think the matplotlib library is doing the right thing to catch an error report by capturing the stdout of dvip(ng|s) command. And we know os.system() is evil ...

Without properly maintained LateX packages, matplotlib (at least running through a Jupyter notebook) apparently is unable to save LateX-formatted plots, even if they display correctly.

Import error for openpyxl

I'm very very new to coding and I'm having an issue importing openpyxl into my python program. I imagine the issue is due to where I have it saved on my computer.
I've downloaded other libraries (xlrd, xlwt, xlutils) before and just saved them in my: C:\Python27\ArcGIS10.1\Lib, or C:\Python27\ArcGIS10.1\Lib\site-packages, or C:\Python27\ArcGISx6410.1\Lib, or C:\Python27\ArcGISx6410.1\Lib\site-packages directories and python has been able to "see" them when i import them into a script.
I've done some trolling on the web and it looks like I may be performing the "installation" of openpyxl incorrectly. I downloaded "setuptools-5.7" in order to try to run the setup.py script contained within the openpyxl library, and so far I haven't gotten that to work out.
Since I'm so new to python, I don't really understand some of the other stuff I've been finding about how to correctly install the library, like "pip install" etc.
If anyone has any ideas about how I can install or save or locate the openpyxl library in the easiest fashion (without using other programs that I don't already have), that would be great!

Your import is probably incorrect.
It needs to be.
from openpyxl import workbook

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using ptrepack to reclaim deleted nodes in hdf5 file - python

Related

Python import pandas not working when used within .bat file

Import module failed even though installed

Error writing to parquet file using pyspark

Saving figures with matplotlib on NFS filesystem

Import error for openpyxl

Categories

Resources