Python format print with a list - python

Which is the most pythonic way to produce my output. Let me illustrate the behavior I'm trying to achieve.
For a project of my I'm building a function that takes different parameters to print an the output in columns.
Example of the list its receives.
[('Field', 'Integer', 'Hex'),
('Machine;', 332, '0x14c'),
('NumberOfSections;', 9, '0x9'),
('Time Date Stamp;', 4, '0x4'),
('PointerToSymbolTable;', 126976, '0x1f000')
** The size of items can differ (Only 3 items per tuple now, can be 4 for another list or any number**
The output should be something like this
Field Integer Hex
-------------------------------------------------------------------------------
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000
For working purposes I created a list which only contains the header fields:
This isn't necessary but it made it a little bit easier trying stuff out
Header field is ['Field', 'Integer', 'Hex']
The first tuple in the list declares the so called "Header fields" as shown in the list example.
For this case there are only 3 items, but this can differ from time to time. So I tried to calculate the size of items with:
length_container_header = len(container[0])
This variable can be used to correctly build up the output.
Building the header "print" I would build something like this.
print("{:21} {:7} {:7}".format(header_field[0], header_field[1], header_field[2]))
Now this is a manual version on how it should be. As you noticed the header field "Field" is shorter then
PointerToSymbolTable in the list. I wrote this function to determine the longest item for each position in the list
container_lenght_list = []
local_l = 0
for field in range(0, lenght_container_body):
for item in container[1:]:
if len(str(item[field])) > local_l:
local_l = len(str(item[field]))
else:
continue
container_lenght_list.append(str(local_l))
local_l = 0
Produces a list along the lines like [21, 7, 7] in this occasion.
creating the format string can be done pretty simple,
formatstring = ""
for line in lst:
formatstring+= "{:" + str(line) +"}"
Which produces string:
{:21}{:7}{:7}
This is the part were a run into trouble, how can I produce the last part of the format string?
I tried a nested for loop in the format() function but I ended up with all sort of Errors. I think it can be done with a
for loop, I just can't figure out how. If someone could push me in the right direction for the header print I would be very grateful. Once I figured out how to print the header I can pretty much figure out the rest. I hope I explained it well enough
With Kind Regards,

You can use * to unpack argument list:
container = [
('Field', 'Integer', 'Hex'),
('Machine;', 332, '0x14c'),
('NumberOfSections;', 9, '0x9'),
('Time Date Stamp;', 4, '0x4'),
('PointerToSymbolTable;', 126976, '0x1f000')
]
lengths = [
max(len(str(row[i])) for row in container) for i in range(len(container[0]))
] # => [21, 7, 7]
# OR lengths = [max(map(len, map(str, x))) for x in zip(*container)]
fmt = ' '.join('{:<%d}' % l for l in lengths)
# => '{:<21} {:<7} {:<7}' # < for left-align
print(fmt.format(*container[0])) # header
print('-' * (sum(lengths) + len(lengths) - 1)) # separator
for row in container[1:]:
print(fmt.format(*row)) # <------- unpacking argument list
# similar to print(fmt.format(row[0], row[1], row[2])
output:
Field Integer Hex
-------------------------------------
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000

Formatting data in tabular form requires four important steps
Determine the field layout i.e. representing data row wise or column wise. Based on the decision you might need to transpose the data using zip
Determine the field sizes. Unless you wan;t to hard-code the field size (not-recommend), you should actually determine the maximum field size based on the data, allowing customized padding between fields. Generally this requires reading the data and determining the maximum length of the fields [len(max(map(str, field), key = len)) + pad
for field in zip(*data)]
Extract the header row. This is easy as it only requires indexing the 0th row i.e. data[0]
Formatting the data. This requires some understanding of python format string
Implementation
class FormatTable(object):
def __init__(self, data, pad = 2):
self.data = data
self.pad = pad
self.header = data[0]
self.field_size = [len(max(map(str, field), key = len)) + pad
for field in zip(*data)]
self.format = ''.join('{{:<{}}}'.format(s) for s in self.field_size)
def __iter__(self):
yield ''.join(self.format.format(*self.header))
yield '-'*(sum(self.field_size) + self.pad * len(self.header))
for row in data[1:]:
yield ''.join(self.format.format(*row))
Demo
for row in FormatTable(data):
print row
Field Integer Hex
-----------------------------------------------
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000

I don't know if it is "Pythonic", but you can use pandas to format your output.
import pandas as pd
data = [('Field', 'Integer', 'Hex'),
('Machine;', 332, '0x14c'),
('NumberOfSections;', 9, '0x9'),
('Time Date Stamp;', 4, '0x4'),
('PointerToSymbolTable;', 126976, '0x1f000')]
s = pd.DataFrame(data[1:], columns=data[0])
print s.to_string(index=False)
Result:
Field Integer Hex
Machine; 332 0x14c
NumberOfSections; 9 0x9
Time Date Stamp; 4 0x4
PointerToSymbolTable; 126976 0x1f000

Related

Python ValueError: dictionary update sequence element #4 has length 3; 2 is required

I need to transform a tuple to dictionary with its respective key->value. The issue is that for a specific tuple I get the following error:
ValueError: dictionary update sequence element #4 has length 3; 2 is required.
But for other tuples with the same format it transforms it without problems. Could someone guide me to what is the reason of the error?
In the attached code the tuple1 value works fine, but the tuple value gives the above error.
tupla = ['.1.3.6.1.4.1.35873.5.1.2.1.1.1.1="314"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.2="10943"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.3="RTU : otu-8000e-Comtec (172.17.74.133)..Alarm type: OPTICAL..Timestamp: Jan 15 2022 - 08:31..Severity: CLEAR..Link name: PROV-21-82-83-84 (PRI) RUTA 7 (PROV) - Port 2..Probable cause:"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.5="1"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.4="port=2"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.6="1"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.7="1"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.8="0x07e6010f081f1400"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.9="otu-8000e-Comtec (172.17.74.133)"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.1="PROV-21-82-83-84 (PRI) RUTA 7 (PROV)"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.2="0"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.3="0.18"', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.4=""', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.5=""', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.6=""', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.1=""', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.2=""', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.3=""', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.4=""', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.5=""', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.6=""', '.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.7=""']
tupla1 = ['.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.1.3701361="3701361"', '.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.2.3701361="CRITICAL"', '.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.3.3701361="CRITICAL"', '.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.4.3701361="VALE-078-001"', '.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.5.3701361="Microreflection Threshold 1 Violation"', '.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.6.3701361="2021-09-02T19:14:04.834Z"', '.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.7.3701361="0"', '.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.8.3701361="1333972"', '.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.9.3701361="http://SRVXPTPRODSTG01.vtr.cl/pathtrak/analysis/view.html#/node/1333972"', '.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.10.3701361="7"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.2.3701361.0="28400000"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.3.3701361.0="0"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.4.3701361.0="HOLA"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.2.3701361.0="30800000"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.3.3701361.0="7"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.4.3701361.0="CRITICAL"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.2.3701361.0="40700000"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.3.3701361.0="0"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.4.3701361.0="NONE"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.2.3701361.0="35600000"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.3.3701361.0="2"', '.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.4.3701361.0="CRITICAL"']
miDiccionarioTupla= dict([(tupla[x].split('"')[0]+tupla[x].split('"')[1]).split('=') for x in range(len(tupla))])
print(miDiccionarioTupla)
#miDiccionarioTupla1= dict([(tupla1[x].split('"')[0]+tupla1[x].split('"')[1]).split('=') for x in range(len(tupla1))])
#print(miDiccionarioTupla1)
The problem is the fifth item in tupla:
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.4="port=2"'
That line contains two equal signs, so the final .split('=') produces too many values.
As noted by John Gordon, the data has an extraneous "=" in one of the rows.
I am not one hundred percent sure what you are hoping to achieve with your code, but I have a potential solution that might help to deal with the extraneous equals sign. The code may also be a bit easier to read:
tupla = ['.1.3.6.1.4.1.35873.5.1.2.1.1.1.1="314"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.2="10943"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.3="RTU : otu-8000e-Comtec (172.17.74.133)..Alarm type: OPTICAL..Timestamp: Jan 15 2022 - 08:31..Severity: CLEAR..Link name: PROV-21-82-83-84 (PRI) RUTA 7 (PROV) - Port 2..Probable cause:"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.5="1"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.4="port=2"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.6="1"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.7="1"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.8="0x07e6010f081f1400"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.9="otu-8000e-Comtec (172.17.74.133)"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.1="PROV-21-82-83-84 (PRI) RUTA 7 (PROV)"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.2="0"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.3="0.18"',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.4=""',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.5=""',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.10.6=""',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.1=""',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.2=""',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.3=""',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.4=""',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.5=""',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.6=""',
'.1.3.6.1.4.1.35873.5.1.2.1.1.1.11.7=""']
tupla1 = ['.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.1.3701361="3701361"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.2.3701361="CRITICAL"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.3.3701361="CRITICAL"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.4.3701361="VALE-078-001"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.5.3701361="Microreflection Threshold 1 Violation"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.6.3701361="2021-09-02T19:14:04.834Z"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.7.3701361="0"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.8.3701361="1333972"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.9.3701361="http://SRVXPTPRODSTG01.vtr.cl/pathtrak/analysis/view.html#/node/1333972"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.101.1.10.3701361="7"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.2.3701361.0="28400000"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.3.3701361.0="0"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.4.3701361.0="HOLA"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.2.3701361.0="30800000"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.3.3701361.0="7"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.4.3701361.0="CRITICAL"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.2.3701361.0="40700000"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.3.3701361.0="0"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.4.3701361.0="NONE"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.2.3701361.0="35600000"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.3.3701361.0="2"',
'.1.3.6.1.4.1.4100.2.2.1.2.1.102.1.4.3701361.0="CRITICAL"']
With Python, it is not necessary to use the length of the object and reference the index of the object (using [x]). We can simply parse the object directly with a for loop (for item in tupla):
miDiccionarioTupla = dict()
for item in tupla:
We can split strings on a character (such as =) AND with this data, we can choose how many times to check for that using the maxsplit parameter.
key, value = item.split('=', maxsplit=1)
I am presuming you want to eliminate any extra quotes in the values to the right side of your items, so I added a .replace() method call on the value: (i.e. "CRITICAL" becomes CRITICAL). This replaces any examples of " with an empty string, essentially removing all the double quotes.
value = value.replace('"', '')
miDiccionarioTupla.update({key: value})
print(miDiccionarioTupla)
miDiccionarioTupla1 = dict()
for item in tupla1:
key, value = item.split('=', 1)
value = value.replace('"', '')
miDiccionarioTupla1.update({key: value})
print(miDiccionarioTupla1)

Push data to google sheet from dataframe

I'm trying to push data into my google sheet with the following code, how can i change the code so that it will print in the 2nd row at the correct column base on the header that I've created.
First code:
class Header:
def __init__(self):
self.No_DOB_Y=1
self.No_DOB_M=2
self.No_DOB_D=3
self.Paid_too_much_little=4
self.No_number_of_ins=5
self.No_gender=6
self.No_first_login=7
self.No_last_login=8
self.Too_young_old=9
def __repr__(self):
return str(self.__dict__)
def add_col(self,name):
setattr(self,name,max(anomali_header.__dict__.values())+1)
anomali_header=Header()
2nd part of code (NEW):
# No_gender
a = list(df.loc[df['gender'].isnull()]['id'])
#print(a)
cells=sh3.range(1,1,len(a),1)
for i,cell in enumerate(cells):
cell.value=a[i]
sh3.update_cells(cells)
At the moment it updates into A1 cell....
This is what I essentially want to
As you can see, the code writes the results onto the first available cell which is A1, i essentially want it to appear at the bottom of my anomali_header of "No_gender" but I'm not sure how to link my 1st part of the code to the 2nd part of the code...
Thanks to v25, the code below works, but rather than going through the code one by one, i wanted to create a loop which goes through all the function
I'm trying to run the code below, but it seems I get an error when I use the loop.
Error:
TypeError: 'list' object cannot be interpreted as an integer
Code:
# No_DOB_Y
a = list(df.loc[df['Year'].isnull()]['id'])
# No number of ins
b = list(df.loc[df['number of ins'].isnull()]['id'])
# No_gender
c = list(df.loc[df['gender'].isnull()]['id'])
# Updating anomalies to sheet
condition = [a,b,c]
column = [1,2,3]
for j in range(column,condition):
cells=sh3.range(2,column,len(condition)+1,column)
for i,cell in enumerate(cells):
cell.value=condition[i]
print('end of check')
sh3.update_cells(cells)
You need to change the range() parameters:
first_row (int) – Row number
first_col (int) – Row number
last_row (int) – Row number
last_col (int) – Row number
So something like:
cells=sh3.range(2, 6, len(a)+1, 6)
Or you could issue the range as a string:
cells=sh3.range('F2:F' + str(len(a)+1))
These numbers may not be perfect, but this should change the positioning. You might need to tweak the digits slightly ;)
UPDATE:
I've encountered an error use a loop, updated my original post
TypeError: 'list' object cannot be interpreted as an integer
This is happneing because the function range which you use in the for loop (not to be confused with sh3.range which is a different function altogether) expects integers, but you're passing it lists.
However, a simpler way to implement this would be to create a list of tuples which map the strings to column integers, then loop based on this. Something like:
col_map = [ ('Year', 1),
('number of ins', 5),
('gender', 6)
]
for col_tup in col_map:
df_list = list(df.loc[df[col_tup[0]].isnull()]['id'])
cells = sh3.range(2, col_tup[1], len(df_list)+1, col_tup[1])
for i, cell in enumerate(cells)
cell.value=df_list[i]
sh3.update_cells(cells)

Reading binary data in python

Firstly, before this question gets marked as duplicate, I'm aware others have asked similar questions but there doesn't seem to be a clear explanation. I'm trying to read in a binary file into an 2D array (documented well here http://nsidc.org/data/docs/daac/nsidc0051_gsfc_seaice.gd.html).
The header is a 300 byte array.
So far, I have;
import struct
with open("nt_197912_n07_v1.1_n.bin",mode='rb') as file:
filecontent = file.read()
x = struct.unpack("iiii",filecontent[:300])
Throws up an error of string argument length.
Reading the Data (Short Answer)
After you have determined the size of the grid (n_rowsxn_cols = 448x304) from your header (see below), you can simply read the data using numpy.frombuffer.
import numpy as np
#...
#Get data from Numpy buffer
dt = np.dtype(('>u1', (n_rows, n_cols)))
x = np.frombuffer(filecontent[300:], dt) #we know the data starts from idx 300 onwards
#Remove unnecessary dimension that numpy gave us
x = x[0,:,:]
The '>u1' specifies the format of the data, in this case unsigned integers of size 1-byte, that are big-endian format.
Plotting this with matplotlib.pyplot
import matplotlib.pyplot as plt
#...
plt.imshow(x, extent=[0,3,-3,3], aspect="auto")
plt.show()
The extent= option simply specifies the axis values, you can change these to lat/lon for example (parsed from your header)
Explanation of Error from .unpack()
From the docs for struct.unpack(fmt, string):
The string must contain exactly the amount of data required by the format (len(string) must equal calcsize(fmt))
You can determine the size specified in the format string (fmt) by looking at the Format Characters section.
Your fmt in struct.unpack("iiii",filecontent[:300]), specifies 4 int types (you can also use 4i = iiii for simplicity), each of which have size 4, requiring a string of length 16.
Your string (filecontent[:300]) is of length 300, whilst your fmt is asking for a string of length 16, hence the error.
Example Usage of .unpack()
As an example, reading your supplied document I extracted the first 21*6 bytes, which has format:
a 21-element array of 6-byte character strings that contain information such as polar stereographic grid characteristics
With:
x = struct.unpack("6s"*21, filecontent[:126])
This returns a tuple of 21 elements. Note the whitespace padding in some elements to meet the 6-byte requirement.
>> print x
# ('00255\x00', ' 304\x00', ' 448\x00', '1.799\x00', '39.43\x00', '45.00\x00', '558.4\x00', '154.0\x00', '234.0\x00', '
# SMMR\x00', '07 cn\x00', ' 336\x00', ' 0000\x00', ' 0034\x00', ' 364\x00', ' 0000\x00', ' 0046\x00', ' 1979\x00', ' 33
# 6\x00', ' 000\x00', '00250\x00')
Notes:
The first argument fmt, "6s"*21 is a string with 6s repeated 21
times. Each format-character 6s represents one string of 6-bytes
(see below), this will match the required format specified in your
document.
The number 126 in filecontent[:126] is calculated as 6*21 = 126.
Note that for the s (string) specifier, the preceding number does
not mean to repeat the format character 6 times (as it would
normally for other format characters). Instead, it specifies the size
of the string. s represents a 1-byte string, whilst 6s represents
a 6-byte string.
More Extensive Solution for Header Reading (Long)
Because the binary data must be manually specified, this may be tedious to do in source code. You can consider using some configuration file (like .ini file)
This function will read the header and store it in a dictionary, where the structure is given from a .ini file
# user configparser for Python 3x
import ConfigParser
def read_header(data, config_file):
"""
Read binary data specified by a INI file which specifies the structure
"""
with open(config_file) as fd:
#Init the config class
conf = ConfigParser.ConfigParser()
conf.readfp(fd)
#preallocate dictionary to store data
header = {}
#Iterate over the key-value pairs under the
#'Structure' section
for key in conf.options('structure'):
#determine the string properties
start_idx, end_idx = [int(x) for x in conf.get('structure', key).split(',')]
start_idx -= 1 #remember python is zero indexed!
strLength = end_idx - start_idx
#Get the data
header[key] = struct.unpack("%is" % strLength, data[start_idx:end_idx])
#Format the data
header[key] = [x.strip() for x in header[key]]
header[key] = [x.replace('\x00', '') for x in header[key]]
#Unmap from list-type
#use .items() for Python 3x
header = {k:v[0] for k, v in header.iteritems()}
return header
An example .ini file below. The key is the name to use when storing the data, and the values is a comma-separated pair of values, the first being the starting index and the second being the ending index. These values were taken from Table 1 in your document.
[structure]
missing_data: 1, 6
n_cols: 7, 12
n_rows: 13, 18
latitude_enclosed: 25, 30
This function can be used as follows:
header = read_header(filecontent, 'headerStructure.ini')
n_cols = int(header['n_cols'])

Writing and reading a row array (nx1) to a binary file in Python with struct pack

I'm having a lot of trouble writing to and reading from a binary file when working with a nx1 row vector that has been written to a binary file using struct.pack. The file structure looks like this (given an argument data that is of type numpy.array) :
test.file
--------
[format_code = 3] : 4 bytes (the code 3 means a vector) - fid.write(struct.pack('i',3))
[rows] : 4 bytes (fid.write(struct.pack('i',sz[0])) where sz = data.shape
[cols] : 4 bytes (fid.write(struct.pack('i',sz[1]))
[data] : type double = 8 bytes * (rows * cols)
Unfortunately, since these files are mostly written in MATLAB, where I have a working class that reads and writes these fields, I can't only write the amount of rows (I need columns as well even if a column does only = 1).
I've tried a few ways to pack data, none of which have worked when trying to unpack it (assume I've opened my file denoted by fid in 'rb'/'wb' and have done some error checking):
# write data
sz = data.shape
datalen=8*sz[0]*sz[1]
fid.write(struct.pack('i',3)) # format code
fid.write(struct.pack('i',sz[0])) # rows
fid.write(struct.pack('i',sz[1])) # columns
### write attempt ###
for i in xrange(sz[0]):
for j in xrange(sz[1]):
fid.write(struct.pack('d',float(data[i][j]))) # write in 'c' convention, so we transpose
### read attempt ###
format_code = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
rows = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
cols = struct.unpack('i',fid.read(struct.calcsize('i')))[0]
out_datalen = 8 * rows * cols # size of structure
output_data=numpy.array(struct.unpack('%dd' % out_datalen,fid.read(datalen)),dtype=float)
So far, when reading, my output has just seemingly been multiplied by random things. I don't know whats happening.
I found another similar question, and so I wrote my data as such:
fid.write(struct.pack('%sd' % len(data), *data))
However, when reading it back using:
numpy.array(struct.unpack('%sd' % out_datalen,fid.read(datalen)),dtype=float)
I get nothing in my array.
Similarly, just doing:
fid.write(struct.pack('%dd' % datalen, *data))
and reading it back with:
numpy.array(struct.unpack('%dd' % out_datalen,fid.read(datalen)),dtype=float)
also gives me an empty array. How can I fix this?

Accessing data range with h5py

I have an h5 file that contains 62 different attributes. I would like to access the data range of each one of them.
to explain more here what I'm doing
import h5py
the_file = h5py.File("myfile.h5","r")
data = the_file["data"]
att = data.keys()
the previous code gives me a list of attributes "U","T","H",.....etc
lets say I want to know what is the minimum and maximum value of "U". how can I do that ?
this is the output of running "h5dump -H"
HDF5 "myfile.h5" {
GROUP "/" {
GROUP "data" {
ATTRIBUTE "datafield_names" {
DATATYPE H5T_STRING {
STRSIZE 8;
STRPAD H5T_STR_SPACEPAD;
CSET H5T_CSET_ASCII;
CTYPE H5T_C_S1;
}
DATASPACE SIMPLE { ( 62 ) / ( 62 ) }
}
ATTRIBUTE "dimensions" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 4 ) / ( 4 ) }
}
ATTRIBUTE "time_variables" {
DATATYPE H5T_IEEE_F64BE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
}
DATASET "Temperature" {
DATATYPE H5T_IEEE_F64BE
DATASPACE SIMPLE { ( 256, 512, 1024 ) / ( 256, 512, 1024 ) }
}
It might be a difference in terminology, but hdf5 attributes are access via the attrs attribute of a Dataset object. I call what you have variables or datasets. Anyway...
I'm guessing by your description that the attributes are just arrays, you should be able to do the following to get the data for each attribute and then calculate the min and max like any numpy array:
attr_data = data["U"][:] # gets a copy of the array
min = attr_data.min()
max = attr_data.max()
So if you wanted the min/max of each attribute you can just do a for loop over the attribute names or you could use
for attr_name,attr_value in data.items():
min = attr_value[:].min()
Edit to answer your first comment:
h5py's objects can be used like python dictionaries. So when you use 'keys()' you are not actually getting data, you are getting the name (or key) of that data. For example, if you run the_file.keys() you will get a list of every hdf5 dataset in the root path of that hdf5 file. If you continue along a path you will end up with the dataset that holds the actual binary data. So for example, you might start with (in an interpreter at first):
the_file = h5py.File("myfile.h5","r")
print the_file.keys()
# this will result in a list of keys maybe ["raw_data","meta_data"] or something
print the_file["raw_data"].keys()
# this will result in another list of keys maybe ["temperature","humidity"]
# eventually you'll get to the dataset that actually has the data or attributes you are looking for
# think of this process as going through a directory structure or a path to get to a file (or a dataset/variable in this case)
the_data_var = the_file["raw_data"]["temperature"]
the_data_array = the_data_var[:]
print the_data_var.attrs.keys()
# this will result in a list of attribute names/keys
an_attr_of_the_data = data_var.attrs["measurement_time"][:]
# So now you have "the_data_array" which is a numpy array and "an_attr_of_the_data" which is whatever it happened to be
# you can get the min/max of the data by doing like before
print the_data_array.min()
print the_data_array.max()
Edit 2 - Why do people format their hdf files this way? It defeats the purpose.
I think you may have to talk to the person who made this file if possible. If you made it, then you'll be able to answer my questions for yourself. First, are you sure that in your original example data.keys() returned "U","T",etc.? Unless h5py is doing something magical or if you didn't provide all of the output of the h5dump, that could not have been your output. I'll explain what the h5dump is telling me, but please try to understand what I am doing and not just copy and paste into your terminal.
# Get a handle to the "data" Group
data = the_file["data"]
# As you can see from the dump this data group has 3 attributes and 1 dataset
# The name of the attributes are "datafield_names","dimensions","time_variables"
# This should result in a list of those names:
print data.attrs.keys()
# The name of the dataset is "Temperature" and should be the only item in the list returned by:
print data.keys()
As you can see from the h5dump, there are 62 datafield_names (strings), 4 dimensions (32-bit integers, I think), and 2 time_variables (64-bit floats). It also tells me that Temperature is a 3-dimensional array, 256 x 512 x 1024 (64-bit floats). Do you see where I'm getting this information? Now comes the hard part, you will need to determine how the datafield_names match up with the Temperature array. This was done by the person who made the file, so you'll have to figure out what each row/column in the Temperature array means. My first guess would be that each row in the Temperature array is one of the datafield_names, maybe 2 more for each time? But this doesn't work since there are too many rows in the array. Maybe the dimensions fit in there some how? Lastly here is how you get each of those pieces of information (continuing from before):
# Get the temperature array (I can't remember if the 3 sets of colons is required, but try it and if not just use one)
temp_array = data["Temperature"][:,:,:]
# Get all of the datafield_names (list of strings of length 62)
datafields = data.attrs["datafield_names"][:]
# Get all of the dimensions (list of integers of length 4)
dims = data.attrs["dimensions"][:]
# Get all of the time variables (list of floats of length 2)
time_variables = data.attrs["time_variables"]
# If you want the min/max of the entire temperature array this should work:
print temp_array.min()
print temp_array.max()
# If you knew that row 0 of the array had the temperatures you wanted to analyze
# then this would work, but it all depends on how the creator organized the data/file:
print temp_array[0].min()
print temp_array[1].max()
I'm sorry I can't be of more help, but without actually having the file and knowing what each field means this is about all I can do. Try to understand how I used h5py to read the information. Try to understand how I translated the header information (h5dump output) into information that I could actually use with h5py. If you know how the data is organized in the array you should be able to do what you want. Good luck, I'll help more if I can.
Since h5py arrays are closely related to numpy arrays, you can use the numpy.min and numpy.max functions to do this:
maxItem = numpy.max(data['U'][:]) # Find the max of item 'U'
minItem = numpy.min(data['H'][:]) # Find the min of item 'H'
Note the ':', it is needed to convert the data to a numpy array.
You can call min and max (row-wise) on the DataFrame:
In [1]: df = pd.DataFrame([[1, 6], [5, 2], [4, 3]], columns=list('UT'))
In [2]: df
Out[2]:
U T
0 1 6
1 5 2
2 4 3
In [3]: df.min(0)
Out[3]:
U 1
T 2
In [4]: df.max(0)
Out[4]:
U 5
T 6
Did you mean data.attrs rather than data itself? If so,
import h5py
with h5py.File("myfile.h5", "w") as the_file:
dset = the_file.create_dataset('MyDataset', (100, 100), 'i')
dset.attrs['U'] = (0,1,2,3)
dset.attrs['T'] = (2,3,4,5)
with h5py.File("myfile.h5", "r") as the_file:
data = the_file["MyDataset"]
print({key:(min(value), max(value)) for key, value in data.attrs.items()})
yields
{u'U': (0, 3), u'T': (2, 5)}

Categories

Resources