How do I truncate an EARRAY in an HDF5 file using pytables? - python

I have an HDF5 file containing a very large EARRAY that I would like to truncate in order to save disk space and process it more quickly. I am using the truncate method on the node containing the EARRAY. pytables reports that the array has been truncated but it still takes up the same amount of space on disk.
Directory listing before truncation:
$ ll total 3694208
-rw-rw-r-- 1 chris 189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3782858816 Aug 27 13:00 original.hdf5
The script I am using to truncate (main.py):
import tables
filename = 'original.hdf5'
h5file = tables.open_file(filename, 'a')
print h5file
node = h5file.get_node('/recordings/0/data')
node.truncate(30000)
print h5file
h5file.close()
Output of the script. As expected, the EARRAY goes from very large to much smaller.
original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree:
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(43893300, 43)) ''
/recordings/0/application_data (Group) ''
original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree:
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(30000, 43)) ''
/recordings/0/application_data (Group) ''
Yet the file takes up almost exactly the same amount of space on disk:
ll
total 3693196
-rw-rw-r-- 1 chris 189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3781824064 Aug 27 13:03 original.hdf5
What am I doing wrong? How can I reclaim this disk space?
If there were a way to directly modify the contents of the earray, instead of using the truncate method, this would be even more useful for me. Something like node = node[idx1:idx2, :], so that I could select which chunk of data I want to keep. But when I use this syntax, the variable node simply becomes a numpy array and the hdf5 file is not modified.

As discussed in this question you can't really deallocate disk space from an existing hdf5 file. It's just not a part of how hdf5 is designed, and therefore it's not really a part of pytables. You can either load the data from the file, then rewrite it all as a new file (potentially with the same name), or you can use the command line utility h5repack to do that for you.

Related

get filename , file path , get the line when the search string is found and extract only a part followed by search string of that line

may be I will directly explain with example : I am writing my code in python , for grep part also using bash commands.
I have few files , where I need to grep for some pattern , let's say "INFO"
All those files can be present two different dir structure : tyep1, type2
/home/user1/logs/MAIN_JOB/121/patching/a.log (type1)
/home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log (type2)
/home/user1/logs/MAIN_JOB/SUB_JOB1/142/DB:2/patching/c.log (type2)
contents of file :
a.log :
[Thu Jan 20 21:05:00 UTC 2022]: database1: INFO: Subject1: This is subject 1.
b.log :
[Thu Jan 22 18:01:00 UTC 2022]: database1: INFO: Subject2: This is subject 2.
c.log :
[Thu Jan 22 18:01:00 UTC 2022]: database1: ERR: Subject3: This is subject 3.
So I need to know which are all the files does "INFO" string is present. if present I need to get following :
filename : a.log / b.log
filepath : /home/user1/logs/MAIN_JOB/121/patching or /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching
immediate string after search string : Subject1 / Subject2
So I tried using grep command with -r to know what are all the files I can find "INFO"
$ grep -r /home/user1/logs/MAIN_JOB
/home/user1/logs/MAIN_JOB/121/patching/a.log:[Thu Jan 20 21:05:00 UTC 2022]: database1: INFO: Subject1: This is subject 1.
/home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log:[Thu Jan 22 18:01:00 UTC 2022]: database1: INFO: Subject2: This is subject 2.
$
So I will store above grep python variable and need to extract above things from this output.
I tried initially splitting grep o/p with "\n" , so I will get two separate rows
/home/user1/logs/MAIN_JOB/121/patching/a.log:[Thu Jan 20 21:05:00 UTC 2022]: database1: INFO: Subject1: This is subject 1.
/home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log:[Thu Jan 22 18:01:00 UTC 2022]: database1: INFO: Subject2: This is subject 2.
and by taking each row , I can split with ":"
First row: I am able to split properly as ":" is at correct places.
file_with_path : /home/user1/logs/MAIN_JOB/121/patching/a.log(I can get file name separate with os.path.basename(file_with_path))
immediate str after search word : "Subject1"
Second row : This is where I need help , As in the path we have this "DB:1" which has ":" which will break my proper split. If I split I will get as below
file_with_path : /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB (not correct)
actually should be /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log
I am unable to apply split here as it doesn't work properly for both the cases.
Can you please help me with this? any command that can do this work in bash or python would be very helpful.
Thank you In Advance. Also let me know if some info is needed from me.
giving code below:
# main dir
patch_log_home = '/home/user1/logs/MAIN_JOB'
cmd = "grep -r 'INFO' {0}"
patch_bug_inc = self._core.exec_os_cmd(cmd.format(patch_log_home))
# if no occurrance reported continue
if len(patch_bug_inc) == 0:
return
if patch_bug_inc:
patch_bug_inc = patch_bug_inc.split("\n");
for inc in patch_bug_inc:
print("_________________________________________________")
inc = inc.split(":")
# to get subject part
patch_bug_str_index = [i for i, s in enumerate(inc) if 'INFO' in s][0]
inc_name = inc[patch_bug_str_index+1]
# file name
log_file_name = os.path.basename(inc[0])
# get file path
log_path = os.path.split(inc[0])
print("log_path :", log_path)
full_path = log_path[0]
print("FULL PATH: ", full_path)
Here's one way you could achieve this without calling out to grep which, as I said in my comment, may not be portable:
import os
import sys
for root, _, files in os.walk('/home/user1/logs/MAIN_JOB'):
for file in files:
if file.endswith('.log'):
path = os.path.join(root, file)
try:
with open(path) as infile:
for line in infile:
if 'INFO:' in line:
print(path)
break
except Exception:
print(f"Unable to process {path}", file=sys.stderr)

Writing to CSV in Python- Rows inserted out of place

I have a function that writes a set of information in rows to a CSV file in Python. The function is supposed to append the file with the new row, however I am finding that sometimes it misbehaves and places the new row in a separate space of the CSV (please see picture as an example).
Whenever I reformat the data manually I delete all of the empty cells again, so you know.
Hoping someone can help, thanks!
def Logger():
fileName = myDict[Sub]
with open(fileName, 'a+', newline="") as file:
writer = csv.writer(file)
if file.tell() == 0:
writer.writerow(["Date", "Asset", "Fear", "Anger", "Anticipation", "Trust", "Surprise", "Sadness", "Disgust", "Joy",
"Positivity", "Negativity"])
writer.writerow([date, Sub, fear, anger, anticip, trust, surprise, sadness, disgust, joy, positivity, negativity])
At first I thought it was a simple matter of there not being a trailing newline, and the new row being appended on the same line, right after the last row, but I can see what looks like a row's worth of empty columns between them.
This whole appending thing looks tricky. If you don't have to use Python, and can use a command-line tool instead, I recommend GoCSV.
Here's a sample file based on your screenshot I mocked up:
base.csv
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59
Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17
Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32,,,,,,,
I'm calling it base because it's the file that will be growing, and you can see it's got a problem on the last line: all those extras commas (I don't know how they got there 🤷🏻‍♂️).
The first step will be to clean it, and trim those pesky extra commas:
% gocsv clean base.csv > tmp
% mv tmp > base.csv
and now base.csv looks like:
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59
Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17
Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32
Here's another set of data to append, sample2.csv:
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 4,6040,0.69,0.89,0.72,0.44,0.21,0.15,0.03,0.63,0.78,0.42
Nov 5,7726,0.72,0.12,0.95,0.6,0.88,0.1,0.43,1.0,1.0,0.68
Nov 6,9028,0.87,0.34,0.46,0.57,0.15,0.3,0.8,0.32,0.17,0.42
Nov 7,3544,0.16,0.9,0.37,0.8,0.67,0.0,0.11,0.72,0.93,0.35
GoCSV's stack command will do this job:
% gocsv stack base.csv sample2.csv > tmp
% mv tmp base.csv
and now base.csv looks like:
Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity
Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59
Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17
Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32
Nov 4,6040,0.69,0.89,0.72,0.44,0.21,0.15,0.03,0.63,0.78,0.42
Nov 5,7726,0.72,0.12,0.95,0.6,0.88,0.1,0.43,1.0,1.0,0.68
Nov 6,9028,0.87,0.34,0.46,0.57,0.15,0.3,0.8,0.32,0.17,0.42
Nov 7,3544,0.16,0.9,0.37,0.8,0.67,0.0,0.11,0.72,0.93,0.35
This can be scripted and simplified like this:
% gocsv clean base.csv > base
% gocsv clean sample.csv > sample
% gocsv stack base sample > base.csv
% rm base sample
Try this instead...
def Logger(col_one, col_two):
fileName = 'data.csv'
with open(fileName, 'a+') as file:
writer = csv.writer(file)
file.seek(0)
if file.read().strip() == '':
writer.writerow(["Date", "Asset"])
writer.writerow([col_one, col_two])

Python Create hex file with unique data

I am trying to create a hex file with unique data such that if the file size is 1 MB, the file will have a hex equivalent of values from 1 to 1<<20, but the output file's size is 6.9 MB. I tried replacing 'wb' with 'w' but it didn't affect the file size.
f=open('/tmp/hex_data','wb')
for i in xrange(1<<20):
f.write(hex(i))
f.close()
$ ls -lrth hex_data
-rw-r--r-- 1 aameershaikh wheel 6.9M Apr 5 16:43 hex_data

How do I get the "biggest" path?

I need to write some Python code to get the latest version of Android from a path. For example:
$ ls -l android_tools/sdk/platforms/
total 8
drwxrwxr-x 5 deqing deqing 4096 Mar 21 11:42 android-18
drwxrwxr-x 5 deqing deqing 4096 Mar 21 11:42 android-19
$
In this case I'd like to have android_tools/sdk/platforms/android-19.
The max function can take a key=myfunc parameter to specify a function that will return a comparison value. So you could do something like:
import os, re
dirname = 'android_tools/sdk/platforms'
files = os.listdir(my_dir)
def mykeyfunc(fname):
digits = re.search(r'\d+$', fname).group()
return int(digits)
print max(files, mykeyfunc)
Adjust that regular expression as needed for the actual files you're dealing with, and that should get you started.

Using Python's ftplib to get a directory listing, portably

You can use ftplib for full FTP support in Python. However the preferred way of getting a directory listing is:
# File: ftplib-example-1.py
import ftplib
ftp = ftplib.FTP("www.python.org")
ftp.login("anonymous", "ftplib-example-1")
data = []
ftp.dir(data.append)
ftp.quit()
for line in data:
print "-", line
Which yields:
$ python ftplib-example-1.py
- total 34
- drwxrwxr-x 11 root 4127 512 Sep 14 14:18 .
- drwxrwxr-x 11 root 4127 512 Sep 14 14:18 ..
- drwxrwxr-x 2 root 4127 512 Sep 13 15:18 RCS
- lrwxrwxrwx 1 root bin 11 Jun 29 14:34 README -> welcome.msg
- drwxr-xr-x 3 root wheel 512 May 19 1998 bin
- drwxr-sr-x 3 root 1400 512 Jun 9 1997 dev
- drwxrwxr-- 2 root 4127 512 Feb 8 1998 dup
- drwxr-xr-x 3 root wheel 512 May 19 1998 etc
...
I guess the idea is to parse the results to get the directory listing. However this listing is directly dependent on the FTP server's way of formatting the list. It would be very messy to write code for this having to anticipate all the different ways FTP servers might format this list.
Is there a portable way to get an array filled with the directory listing?
(The array should only have the folder names.)
Try using ftp.nlst(dir).
However, note that if the folder is empty, it might throw an error:
files = []
try:
files = ftp.nlst()
except ftplib.error_perm as resp:
if str(resp) == "550 No files found":
print "No files in this directory"
else:
raise
for f in files:
print f
The reliable/standardized way to parse FTP directory listing is by using MLSD command, which by now should be supported by all recent/decent FTP servers.
import ftplib
f = ftplib.FTP()
f.connect("localhost")
f.login()
ls = []
f.retrlines('MLSD', ls.append)
for entry in ls:
print entry
The code above will print:
modify=20110723201710;perm=el;size=4096;type=dir;unique=807g4e5a5; tests
modify=20111206092323;perm=el;size=4096;type=dir;unique=807g1008e0; .xchat2
modify=20111022125631;perm=el;size=4096;type=dir;unique=807g10001a; .gconfd
modify=20110808185618;perm=el;size=4096;type=dir;unique=807g160f9a; .skychart
...
Starting from python 3.3, ftplib will provide a specific method to do this:
http://bugs.python.org/issue11072
http://hg.python.org/cpython/file/67053b135ed9/Lib/ftplib.py#l535
I found my way here while trying to get filenames, last modified stamps, file sizes etc and wanted to add my code. It only took a few minutes to write a loop to parse the ftp.dir(dir_list.append) making use of python std lib stuff like strip() (to clean up the line of text) and split() to create an array.
ftp = FTP('sick.domain.bro')
ftp.login()
ftp.cwd('path/to/data')
dir_list = []
ftp.dir(dir_list.append)
# main thing is identifing which char marks start of good stuff
# '-rw-r--r-- 1 ppsrt ppsrt 545498 Jul 23 12:07 FILENAME.FOO
# ^ (that is line[29])
for line in dir_list:
print line[29:].strip().split(' ') # got yerself an array there bud!
# EX ['545498', 'Jul', '23', '12:07', 'FILENAME.FOO']
There's no standard for the layout of the LIST response. You'd have to write code to handle the most popular layouts. I'd start with Linux ls and Windows Server DIR formats. There's a lot of variety out there, though.
Fall back to the nlst method (returning the result of the NLST command) if you can't parse the longer list. For bonus points, cheat: perhaps the longest number in the line containing a known file name is its length.
I happen to be stuck with an FTP server (Rackspace Cloud Sites virtual server) that doesn't seem to support MLSD. Yet I need several fields of file information, such as size and timestamp, not just the filename, so I have to use the DIR command. On this server, the output of DIR looks very much like the OP's. In case it helps anyone, here's a little Python class that parses a line of such output to obtain the filename, size and timestamp.
import datetime
class FtpDir:
def parse_dir_line(self, line):
words = line.split()
self.filename = words[8]
self.size = int(words[4])
t = words[7].split(':')
ts = words[5] + '-' + words[6] + '-' + datetime.datetime.now().strftime('%Y') + ' ' + t[0] + ':' + t[1]
self.timestamp = datetime.datetime.strptime(ts, '%b-%d-%Y %H:%M')
Not very portable, I know, but easy to extend or modify to deal with various different FTP servers.
This is from Python docs
>>> from ftplib import FTP_TLS
>>> ftps = FTP_TLS('ftp.python.org')
>>> ftps.login() # login anonymously before securing control
channel
>>> ftps.prot_p() # switch to secure data connection
>>> ftps.retrlines('LIST') # list directory content securely
total 9
drwxr-xr-x 8 root wheel 1024 Jan 3 1994 .
drwxr-xr-x 8 root wheel 1024 Jan 3 1994 ..
drwxr-xr-x 2 root wheel 1024 Jan 3 1994 bin
drwxr-xr-x 2 root wheel 1024 Jan 3 1994 etc
d-wxrwxr-x 2 ftp wheel 1024 Sep 5 13:43 incoming
drwxr-xr-x 2 root wheel 1024 Nov 17 1993 lib
drwxr-xr-x 6 1094 wheel 1024 Sep 13 19:07 pub
drwxr-xr-x 3 root wheel 1024 Jan 3 1994 usr
-rw-r--r-- 1 root root 312 Aug 1 1994 welcome.msg
That helped me with my code.
When I tried feltering only a type of files and show them on screen by adding a condition that tests on each line.
Like this
elif command == 'ls':
print("directory of ", ftp.pwd())
data = []
ftp.dir(data.append)
for line in data:
x = line.split(".")
formats=["gz", "zip", "rar", "tar", "bz2", "xz"]
if x[-1] in formats:
print ("-", line)

Categories

Resources