How do I truncate an EARRAY in an HDF5 file using pytables? - python
I have an HDF5 file containing a very large EARRAY that I would like to truncate in order to save disk space and process it more quickly. I am using the truncate method on the node containing the EARRAY. pytables reports that the array has been truncated but it still takes up the same amount of space on disk.
Directory listing before truncation:
$ ll total 3694208
-rw-rw-r-- 1 chris 189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3782858816 Aug 27 13:00 original.hdf5
The script I am using to truncate (main.py):
import tables
filename = 'original.hdf5'
h5file = tables.open_file(filename, 'a')
print h5file
node = h5file.get_node('/recordings/0/data')
node.truncate(30000)
print h5file
h5file.close()
Output of the script. As expected, the EARRAY goes from very large to much smaller.
original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree:
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(43893300, 43)) ''
/recordings/0/application_data (Group) ''
original.hdf5 (File) ''
Last modif.: 'Thu Aug 27 13:00:12 2015'
Object Tree:
/ (RootGroup) ''
/recordings (Group) ''
/recordings/0 (Group) ''
/recordings/0/data (EArray(30000, 43)) ''
/recordings/0/application_data (Group) ''
Yet the file takes up almost exactly the same amount of space on disk:
ll
total 3693196
-rw-rw-r-- 1 chris 189 Aug 27 13:03 main.py
-rw-rw-r-- 1 chris 3781824064 Aug 27 13:03 original.hdf5
What am I doing wrong? How can I reclaim this disk space?
If there were a way to directly modify the contents of the earray, instead of using the truncate method, this would be even more useful for me. Something like node = node[idx1:idx2, :], so that I could select which chunk of data I want to keep. But when I use this syntax, the variable node simply becomes a numpy array and the hdf5 file is not modified.
As discussed in this question you can't really deallocate disk space from an existing hdf5 file. It's just not a part of how hdf5 is designed, and therefore it's not really a part of pytables. You can either load the data from the file, then rewrite it all as a new file (potentially with the same name), or you can use the command line utility h5repack to do that for you.
Related
get filename , file path , get the line when the search string is found and extract only a part followed by search string of that line
may be I will directly explain with example : I am writing my code in python , for grep part also using bash commands. I have few files , where I need to grep for some pattern , let's say "INFO" All those files can be present two different dir structure : tyep1, type2 /home/user1/logs/MAIN_JOB/121/patching/a.log (type1) /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log (type2) /home/user1/logs/MAIN_JOB/SUB_JOB1/142/DB:2/patching/c.log (type2) contents of file : a.log : [Thu Jan 20 21:05:00 UTC 2022]: database1: INFO: Subject1: This is subject 1. b.log : [Thu Jan 22 18:01:00 UTC 2022]: database1: INFO: Subject2: This is subject 2. c.log : [Thu Jan 22 18:01:00 UTC 2022]: database1: ERR: Subject3: This is subject 3. So I need to know which are all the files does "INFO" string is present. if present I need to get following : filename : a.log / b.log filepath : /home/user1/logs/MAIN_JOB/121/patching or /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching immediate string after search string : Subject1 / Subject2 So I tried using grep command with -r to know what are all the files I can find "INFO" $ grep -r /home/user1/logs/MAIN_JOB /home/user1/logs/MAIN_JOB/121/patching/a.log:[Thu Jan 20 21:05:00 UTC 2022]: database1: INFO: Subject1: This is subject 1. /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log:[Thu Jan 22 18:01:00 UTC 2022]: database1: INFO: Subject2: This is subject 2. $ So I will store above grep python variable and need to extract above things from this output. I tried initially splitting grep o/p with "\n" , so I will get two separate rows /home/user1/logs/MAIN_JOB/121/patching/a.log:[Thu Jan 20 21:05:00 UTC 2022]: database1: INFO: Subject1: This is subject 1. /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log:[Thu Jan 22 18:01:00 UTC 2022]: database1: INFO: Subject2: This is subject 2. and by taking each row , I can split with ":" First row: I am able to split properly as ":" is at correct places. file_with_path : /home/user1/logs/MAIN_JOB/121/patching/a.log(I can get file name separate with os.path.basename(file_with_path)) immediate str after search word : "Subject1" Second row : This is where I need help , As in the path we have this "DB:1" which has ":" which will break my proper split. If I split I will get as below file_with_path : /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB (not correct) actually should be /home/user1/logs/MAIN_JOB/SUB_JOB1/121/DB:1/patching/b.log I am unable to apply split here as it doesn't work properly for both the cases. Can you please help me with this? any command that can do this work in bash or python would be very helpful. Thank you In Advance. Also let me know if some info is needed from me. giving code below: # main dir patch_log_home = '/home/user1/logs/MAIN_JOB' cmd = "grep -r 'INFO' {0}" patch_bug_inc = self._core.exec_os_cmd(cmd.format(patch_log_home)) # if no occurrance reported continue if len(patch_bug_inc) == 0: return if patch_bug_inc: patch_bug_inc = patch_bug_inc.split("\n"); for inc in patch_bug_inc: print("_________________________________________________") inc = inc.split(":") # to get subject part patch_bug_str_index = [i for i, s in enumerate(inc) if 'INFO' in s][0] inc_name = inc[patch_bug_str_index+1] # file name log_file_name = os.path.basename(inc[0]) # get file path log_path = os.path.split(inc[0]) print("log_path :", log_path) full_path = log_path[0] print("FULL PATH: ", full_path)
Here's one way you could achieve this without calling out to grep which, as I said in my comment, may not be portable: import os import sys for root, _, files in os.walk('/home/user1/logs/MAIN_JOB'): for file in files: if file.endswith('.log'): path = os.path.join(root, file) try: with open(path) as infile: for line in infile: if 'INFO:' in line: print(path) break except Exception: print(f"Unable to process {path}", file=sys.stderr)
Writing to CSV in Python- Rows inserted out of place
I have a function that writes a set of information in rows to a CSV file in Python. The function is supposed to append the file with the new row, however I am finding that sometimes it misbehaves and places the new row in a separate space of the CSV (please see picture as an example). Whenever I reformat the data manually I delete all of the empty cells again, so you know. Hoping someone can help, thanks! def Logger(): fileName = myDict[Sub] with open(fileName, 'a+', newline="") as file: writer = csv.writer(file) if file.tell() == 0: writer.writerow(["Date", "Asset", "Fear", "Anger", "Anticipation", "Trust", "Surprise", "Sadness", "Disgust", "Joy", "Positivity", "Negativity"]) writer.writerow([date, Sub, fear, anger, anticip, trust, surprise, sadness, disgust, joy, positivity, negativity])
At first I thought it was a simple matter of there not being a trailing newline, and the new row being appended on the same line, right after the last row, but I can see what looks like a row's worth of empty columns between them. This whole appending thing looks tricky. If you don't have to use Python, and can use a command-line tool instead, I recommend GoCSV. Here's a sample file based on your screenshot I mocked up: base.csv Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59 Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17 Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32,,,,,,, I'm calling it base because it's the file that will be growing, and you can see it's got a problem on the last line: all those extras commas (I don't know how they got there 🤷🏻♂️). The first step will be to clean it, and trim those pesky extra commas: % gocsv clean base.csv > tmp % mv tmp > base.csv and now base.csv looks like: Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59 Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17 Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32 Here's another set of data to append, sample2.csv: Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity Nov 4,6040,0.69,0.89,0.72,0.44,0.21,0.15,0.03,0.63,0.78,0.42 Nov 5,7726,0.72,0.12,0.95,0.6,0.88,0.1,0.43,1.0,1.0,0.68 Nov 6,9028,0.87,0.34,0.46,0.57,0.15,0.3,0.8,0.32,0.17,0.42 Nov 7,3544,0.16,0.9,0.37,0.8,0.67,0.0,0.11,0.72,0.93,0.35 GoCSV's stack command will do this job: % gocsv stack base.csv sample2.csv > tmp % mv tmp base.csv and now base.csv looks like: Date,Asset,Fear,Anger,Anticipation,Trust,Surprise,Sadness,Disgust,Joy,Positivity,Negativity Nov 1,5088,0.84,0.58,0.73,1.0,0.26,0.89,0.22,0.5,0.69,0.59 Nov 2,4580,0.0,0.88,0.7,0.71,0.57,0.78,0.2,0.22,0.21,0.17 Nov 3,2469,0.72,0.4,0.66,0.53,0.65,0.64,0.67,0.78,0.54,0.32 Nov 4,6040,0.69,0.89,0.72,0.44,0.21,0.15,0.03,0.63,0.78,0.42 Nov 5,7726,0.72,0.12,0.95,0.6,0.88,0.1,0.43,1.0,1.0,0.68 Nov 6,9028,0.87,0.34,0.46,0.57,0.15,0.3,0.8,0.32,0.17,0.42 Nov 7,3544,0.16,0.9,0.37,0.8,0.67,0.0,0.11,0.72,0.93,0.35 This can be scripted and simplified like this: % gocsv clean base.csv > base % gocsv clean sample.csv > sample % gocsv stack base sample > base.csv % rm base sample
Try this instead... def Logger(col_one, col_two): fileName = 'data.csv' with open(fileName, 'a+') as file: writer = csv.writer(file) file.seek(0) if file.read().strip() == '': writer.writerow(["Date", "Asset"]) writer.writerow([col_one, col_two])
Python Create hex file with unique data
I am trying to create a hex file with unique data such that if the file size is 1 MB, the file will have a hex equivalent of values from 1 to 1<<20, but the output file's size is 6.9 MB. I tried replacing 'wb' with 'w' but it didn't affect the file size. f=open('/tmp/hex_data','wb') for i in xrange(1<<20): f.write(hex(i)) f.close() $ ls -lrth hex_data -rw-r--r-- 1 aameershaikh wheel 6.9M Apr 5 16:43 hex_data
How do I get the "biggest" path?
I need to write some Python code to get the latest version of Android from a path. For example: $ ls -l android_tools/sdk/platforms/ total 8 drwxrwxr-x 5 deqing deqing 4096 Mar 21 11:42 android-18 drwxrwxr-x 5 deqing deqing 4096 Mar 21 11:42 android-19 $ In this case I'd like to have android_tools/sdk/platforms/android-19.
The max function can take a key=myfunc parameter to specify a function that will return a comparison value. So you could do something like: import os, re dirname = 'android_tools/sdk/platforms' files = os.listdir(my_dir) def mykeyfunc(fname): digits = re.search(r'\d+$', fname).group() return int(digits) print max(files, mykeyfunc) Adjust that regular expression as needed for the actual files you're dealing with, and that should get you started.
Using Python's ftplib to get a directory listing, portably
You can use ftplib for full FTP support in Python. However the preferred way of getting a directory listing is: # File: ftplib-example-1.py import ftplib ftp = ftplib.FTP("www.python.org") ftp.login("anonymous", "ftplib-example-1") data = [] ftp.dir(data.append) ftp.quit() for line in data: print "-", line Which yields: $ python ftplib-example-1.py - total 34 - drwxrwxr-x 11 root 4127 512 Sep 14 14:18 . - drwxrwxr-x 11 root 4127 512 Sep 14 14:18 .. - drwxrwxr-x 2 root 4127 512 Sep 13 15:18 RCS - lrwxrwxrwx 1 root bin 11 Jun 29 14:34 README -> welcome.msg - drwxr-xr-x 3 root wheel 512 May 19 1998 bin - drwxr-sr-x 3 root 1400 512 Jun 9 1997 dev - drwxrwxr-- 2 root 4127 512 Feb 8 1998 dup - drwxr-xr-x 3 root wheel 512 May 19 1998 etc ... I guess the idea is to parse the results to get the directory listing. However this listing is directly dependent on the FTP server's way of formatting the list. It would be very messy to write code for this having to anticipate all the different ways FTP servers might format this list. Is there a portable way to get an array filled with the directory listing? (The array should only have the folder names.)
Try using ftp.nlst(dir). However, note that if the folder is empty, it might throw an error: files = [] try: files = ftp.nlst() except ftplib.error_perm as resp: if str(resp) == "550 No files found": print "No files in this directory" else: raise for f in files: print f
The reliable/standardized way to parse FTP directory listing is by using MLSD command, which by now should be supported by all recent/decent FTP servers. import ftplib f = ftplib.FTP() f.connect("localhost") f.login() ls = [] f.retrlines('MLSD', ls.append) for entry in ls: print entry The code above will print: modify=20110723201710;perm=el;size=4096;type=dir;unique=807g4e5a5; tests modify=20111206092323;perm=el;size=4096;type=dir;unique=807g1008e0; .xchat2 modify=20111022125631;perm=el;size=4096;type=dir;unique=807g10001a; .gconfd modify=20110808185618;perm=el;size=4096;type=dir;unique=807g160f9a; .skychart ... Starting from python 3.3, ftplib will provide a specific method to do this: http://bugs.python.org/issue11072 http://hg.python.org/cpython/file/67053b135ed9/Lib/ftplib.py#l535
I found my way here while trying to get filenames, last modified stamps, file sizes etc and wanted to add my code. It only took a few minutes to write a loop to parse the ftp.dir(dir_list.append) making use of python std lib stuff like strip() (to clean up the line of text) and split() to create an array. ftp = FTP('sick.domain.bro') ftp.login() ftp.cwd('path/to/data') dir_list = [] ftp.dir(dir_list.append) # main thing is identifing which char marks start of good stuff # '-rw-r--r-- 1 ppsrt ppsrt 545498 Jul 23 12:07 FILENAME.FOO # ^ (that is line[29]) for line in dir_list: print line[29:].strip().split(' ') # got yerself an array there bud! # EX ['545498', 'Jul', '23', '12:07', 'FILENAME.FOO']
There's no standard for the layout of the LIST response. You'd have to write code to handle the most popular layouts. I'd start with Linux ls and Windows Server DIR formats. There's a lot of variety out there, though. Fall back to the nlst method (returning the result of the NLST command) if you can't parse the longer list. For bonus points, cheat: perhaps the longest number in the line containing a known file name is its length.
I happen to be stuck with an FTP server (Rackspace Cloud Sites virtual server) that doesn't seem to support MLSD. Yet I need several fields of file information, such as size and timestamp, not just the filename, so I have to use the DIR command. On this server, the output of DIR looks very much like the OP's. In case it helps anyone, here's a little Python class that parses a line of such output to obtain the filename, size and timestamp. import datetime class FtpDir: def parse_dir_line(self, line): words = line.split() self.filename = words[8] self.size = int(words[4]) t = words[7].split(':') ts = words[5] + '-' + words[6] + '-' + datetime.datetime.now().strftime('%Y') + ' ' + t[0] + ':' + t[1] self.timestamp = datetime.datetime.strptime(ts, '%b-%d-%Y %H:%M') Not very portable, I know, but easy to extend or modify to deal with various different FTP servers.
This is from Python docs >>> from ftplib import FTP_TLS >>> ftps = FTP_TLS('ftp.python.org') >>> ftps.login() # login anonymously before securing control channel >>> ftps.prot_p() # switch to secure data connection >>> ftps.retrlines('LIST') # list directory content securely total 9 drwxr-xr-x 8 root wheel 1024 Jan 3 1994 . drwxr-xr-x 8 root wheel 1024 Jan 3 1994 .. drwxr-xr-x 2 root wheel 1024 Jan 3 1994 bin drwxr-xr-x 2 root wheel 1024 Jan 3 1994 etc d-wxrwxr-x 2 ftp wheel 1024 Sep 5 13:43 incoming drwxr-xr-x 2 root wheel 1024 Nov 17 1993 lib drwxr-xr-x 6 1094 wheel 1024 Sep 13 19:07 pub drwxr-xr-x 3 root wheel 1024 Jan 3 1994 usr -rw-r--r-- 1 root root 312 Aug 1 1994 welcome.msg
That helped me with my code. When I tried feltering only a type of files and show them on screen by adding a condition that tests on each line. Like this elif command == 'ls': print("directory of ", ftp.pwd()) data = [] ftp.dir(data.append) for line in data: x = line.split(".") formats=["gz", "zip", "rar", "tar", "bz2", "xz"] if x[-1] in formats: print ("-", line)