Let's say I need to save a matrix(each line corresponds one row) that could be loaded from fortran later. What method should I prefer? Is converting everything to string is the only one approach?
You can save them in binary format as well. Please see the documentation on the struct standard module, it has a pack function for converting Python object into binary data.
For example:
import struct
value = 3.141592654
data = struct.pack('d', value)
open('file.ext', 'wb').write(data)
You can convert each element of your matrix and write to a file. Fortran should be able to load that binary data. You can speed up the process by converting a row as a whole, like this:
row_data = struct.pack('d' * len(matrix_row), *matrix_row)
Please note, that 'd' * len(matrix_row) is a constant for your matrix size, so you need to calculate that format string only once.
I don't know fortran, so it's hard to tell what is easy for you to perform on that side for parsing.
It sounds like your options are either saving the doubles in plaintext (meaning, 'converting' them to string), or in binary (using struct and the likes). The decision for which one is better depends.
I would go with the plaintext solution, as it means the files will be easily readable, and you won't have to mess with different kinds of details (endianity, default double sizes).
But, there are cases where binary is better (for example, if you have a really big list of doubles and space is of importance, or if it is easier for you to parse it and you need the optimization) - but this is likely not your case.
You can use JSON
import json
matrix = [[2.3452452435, 3.34134], [4.5, 7.9]]
data = json.dumps(matrix)
open('file.ext', 'wb').write(data)
File content will look like:
[[2.3452452435, 3.3413400000000002], [4.5, 7.9000000000000004]]
If legibility and ease of access is important (and file size is reasonable), Fortran can easily parse a simple array of numbers, at least if it knows the size of the matrix beforehand (with something like READ(FILE_ID, '2(F)'), I think):
1.234 5.6789e4
3.1415 9.265358978
42 ...
Two nested for loops in your Python code can easily write your matrix in this form.
Related
I'm new to Python, and to programming in general, so please don't take it too hard on me
I am currently trying to figure out how to write a new wav file using a string (which was derived from another wave file's data)
I performed a fourier transform on that file's data, so now I'm trying to get the values from the Fourier transform written into a new wav file.
I can only use numpy and the included Python library, not scipy
According to the documentation, I have to use wave_write(), but I have no idea what the code is supposed to look like for this function.
I think I'm supposed to do something pertaining to
wave_write.writeframesraw(data)
Then again, not totally sure of what to do.
Any help is greatly appreciated!
Two functions in NumPy can help you with this: astype and tostring.
If you have an array of sound samples, say X then you can convert it to the right format using astype. This will depend on what data type is used in the wav file, and the library you are using to save it. But let us for this example say you want to store it as 16 bit integer. You'll need to scale X according to the data type selected - so in this case the range will be -32768 to 32767 for a signed 16 bit int. If you sample goes from -1.0 to 1.0 then you can simply multiply with 32767.
The next part is simply to convert it to a string using tostring, it could look something the following:
scaled = X * 32767
scaled.astype('<i2').tostring()
You can find the documentation for the functions here:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.astype.html
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.tostring.html
I have a 60GB FITS file containing a binary table. I would like to read (and process) this table one row/entry/line/block* at a time.
(*I'm unsure of the correct nomenclature)
I am using pyfits and what I would like to do boils down to simply:
import pyfits
hdulist = = pyfits.open("file.fits")
# the binary table has to be in the 2nd extension
# hence it is in hdulist[1]
n_entries = hdulist[1].header['NAXIS2']
for i in xrange(n_entries):
entry = hdulist[1].data[i] # I am confused what happens at this step
# now do stuff with the values in entry
# .....
The variable entry is of type <class 'pyfits.fitsrec.FITS_record'> and has a length equal to the number of columns in the binary table. However what appears to happen is the whole of the binary table is read into memory at this line: entry = hdulist[1].data[i].
I have looked through the pyfits documentation but I can't find any methods that seem to read data from a binary table extension on a table entry by table entry basis (or small sets of entries at a time). I don't want to select certain entries from the table, just simply scan through them in order.
I guess my questions are:
0) What is happening at the hdulist[1].data[i] step? Why is everything being read into memory? (is there some way around this?)
1) Have I missed something and can pyfits actually do what I want?
2) Is there another python library out there that will?
(ie using a binary table in a FITS extension)
3) If not, can I re-write the data in a different binary (or other compressed/not ascii) format (that is not FITS) and find some other python library or module to do what I want?
pyfits currently lacks a row iterator for tables. If the data columns are such that they require no conversion from the on disk storage format to their "physical" values then reading tables is fast. But otherwise it currently blows up if you try to read such columns. I wouldn't fight it too much as the table interface is being rewritten, but in the meantime you might want to try the fitsio library which is a Python wrapper around CFITSIO and provides efficient row-based iteration of tables.
I have to read a binary file in python. This is first written by a Fortran 90 program in this way:
open(unit=10,file=filename,form='unformatted')
write(10)table%n1,table%n2
write(10)table%nH
write(10)table%T2
write(10)table%cool
write(10)table%heat
write(10)table%cool_com
write(10)table%heat_com
write(10)table%metal
write(10)table%cool_prime
write(10)table%heat_prime
write(10)table%cool_com_prime
write(10)table%heat_com_prime
write(10)table%metal_prime
write(10)table%mu
if (if_species_abundances) write(10)table%n_spec
close(10)
I can easily read this binary file with the following IDL code:
n1=161L
n2=101L
openr,1,file,/f77_unformatted
readu,1,n1,n2
print,n1,n2
spec=dblarr(n1,n2,6)
metal=dblarr(n1,n2)
cool=dblarr(n1,n2)
heat=dblarr(n1,n2)
metal_prime=dblarr(n1,n2)
cool_prime=dblarr(n1,n2)
heat_prime=dblarr(n1,n2)
mu =dblarr(n1,n2)
n =dblarr(n1)
T =dblarr(n2)
Teq =dblarr(n1)
readu,1,n
readu,1,T
readu,1,Teq
readu,1,cool
readu,1,heat
readu,1,metal
readu,1,cool_prime
readu,1,heat_prime
readu,1,metal_prime
readu,1,mu
readu,1,spec
print,spec
close,1
What I want to do is reading this binary file with Python. But there are some problems.
First of all, here is my attempt to read the file:
import numpy
from numpy import *
import struct
file='name_of_my_file'
with open(file,mode='rb') as lines:
c=lines.read()
I try to read the first two variables:
dummy, n1, n2, dummy = struct.unpack('iiii',c[:16])
But as you can see I had to add to dummy variables because, somehow, the fortran programs add the integer 8 in those positions.
The problem is now when trying to read the other bytes. I don't get the same result of the IDL program.
Here is my attempt to read the array n
double = 8
end = 16+n1*double
nH = struct.unpack('d'*n1,c[16:end])
However, when I print this array I get non sense value. I mean, I can read the file with the above IDL code, so I know what to expect. So my question is: how can I read this file when I don't know exactly the structure? Why with IDL it is so simple to read it? I need to read this data set with Python.
What you're looking for is the struct module.
This module allows you to unpack data from strings, treating it like binary data.
You supply a format string, and your file string, and it will consume the data returning you binary objects.
For example, using your variables:
import struct
content = f.read() #I'm not sure why in a binary file you were using "readlines",
#but if this is too much data, you can supply a size to read()
n, T, Teq, cool = struct.unpack("dddd",content[:32])
This will make n, T, Teq, and cool hold the first four doubles in your binary file. Of course, this is just a demonstration. Your example looks like it wants lists of doubles - conveniently struct.unpack returns a tuple, which I take for your case will still work fine (if not, you can listify them). Keep in mind that struct.unpack needs to consume the whole string passed into it - otherwise you'll get a struct.error. So, either slice your input string, or only read the number of characters you'll use, like I said above in my comment.
For example,
n_content = f.read(8*number_of_ns) #8, because doubles are 8 bytes
n = struct.unpack("d"*number_of_ns,n_content)
Did you give scipy.io.readsav a try?
Simply read you file like this:
mydict = scipy.io.readsav('name_of_file')
It looks like you are trying to read the cooling_0000x.out file generated by RAMSES.
Note that the first two integers (n1, n2) provide the dimensions of the two dimentional tables (arrays) that follow in the body of the file... So you need to first process those two integers before you know how much real*8 data is in the rest of the file.
scipy should be of help -- it lets you read arbitrary dimensioned binary data:
http://wiki.scipy.org/Cookbook/InputOutput#head-e35c7736718209eea00ebf37a7e1dfb91df696e1
If you already have this python code, please let me know as I was going to write it today (17Sep2014).
Rick
I know there have been some questions regarding file reading, binary data handling and integer conversion using struct before, so I come here to ask about a piece of code I have that I think is taking too much time to run. The file being read is a multichannel datasample recording (short integers), with intercalated intervals of data (hence the nested for statements). The code is as follows:
# channel_content is a dictionary, channel_content[channel]['nsamples'] is a string
for rec in xrange(number_of_intervals)):
for channel in channel_names:
channel_content[channel]['recording'].extend(
[struct.unpack( "h", f.read(2))[0]
for iteration in xrange(int(channel_content[channel]['nsamples']))])
With this code, I get 2.2 seconds per megabyte read with a dual-core with 2Mb RAM, and my files typically have 20+ Mb, which gives some very annoying delay (specially considering another benchmark shareware program I am trying to mirror loads the file WAY faster).
What I would like to know:
If there is some violation of "good practice": bad-arranged loops, repetitive operations that take longer than necessary, use of inefficient container types (dictionaries?), etc.
If this reading speed is normal, or normal to Python, and if reading speed
If creating a C++ compiled extension would be likely to improve performance, and if it would be a recommended approach.
(of course) If anyone suggests some modification to this code, preferrably based on previous experience with similar operations.
Thanks for reading
(I have already posted a few questions about this job of mine, I hope they are all conceptually unrelated, and I also hope not being too repetitive.)
Edit: channel_names is a list, so I made the correction suggested by #eumiro (remove typoed brackets)
Edit: I am currently going with Sebastian's suggestion of using array with fromfile() method, and will soon put the final code here. Besides, every contibution has been very useful to me, and I very gladly thank everyone who kindly answered.
Final Form after going with array.fromfile() once, and then alternately extending one array for each channel via slicing the big array:
fullsamples = array('h')
fullsamples.fromfile(f, os.path.getsize(f.filename)/fullsamples.itemsize - f.tell())
position = 0
for rec in xrange(int(self.header['nrecs'])):
for channel in self.channel_labels:
samples = int(self.channel_content[channel]['nsamples'])
self.channel_content[channel]['recording'].extend(
fullsamples[position:position+samples])
position += samples
The speed improvement was very impressive over reading the file a bit at a time, or using struct in any form.
You could use array to read your data:
import array
import os
fn = 'data.bin'
a = array.array('h')
a.fromfile(open(fn, 'rb'), os.path.getsize(fn) // a.itemsize)
It is 40x times faster than struct.unpack from #samplebias's answer.
If the files are only 20-30M, why not read the entire file, decode the nums in a single call to unpack and then distribute them among your channels by iterating over the array:
data = open('data.bin', 'rb').read()
values = struct.unpack('%dh' % len(data)/2, data)
del data
# iterate over channels, and assign from values using indices/slices
A quick test showed this resulted in a 10x speedup over struct.unpack('h', f.read(2)) on a 20M file.
A single array fromfile call is definitively fastest, but wont work if the dataseries is interleaved with other value types.
In such cases, another big speedincrease that can be combined with the previous struct answers, is that instead of calling the unpack function multiple times, precompile a struct.Struct object with the format for each chunk. From the docs:
Creating a Struct object once and calling its methods is more
efficient than calling the struct functions with the same format since
the format string only needs to be compiled once.
So for instance, if you wanted to unpack 1000 interleaved shorts and floats at a time, you could write:
chunksize = 1000
structobj = struct.Struct("hf" * chunksize)
while True:
chunkdata = structobj.unpack(fileobj.read(structobj.size))
(Note that the example is only partial and needs to account for changing the chunksize at the end of the file and breaking the while loop.)
extend() acepts iterables, that is to say instead of .extend([...]) , you can write .extend(...) . It is likely to speed up the program because extend() will process on a generator , no more on a built list
There is an incoherence in your code: you define first channel_content = {} , and after that you perform channel_content[channel]['recording'].extend(...) that needs the preliminary existence of a key channel and a subkey 'recording' with a list as a value to be able to extend to something
What is the nature of self.channel_content[channel]['nsamples'] so that it can be submitted to int() function ?
Where do number_of_intervals come from ? What is the nature of the intervals ?
In the rec in xrange(number_of_intervals)): loop , I don't see anymore rec . So it seems to me that you are repeating the same loop process for channel in channel_names: as many times as the number expressed by number_of_intervals . Are there number_of_intervals * int(self.channel_content[channel]['nsamples']) * 2 values to read in f ?
I read in the doc:
class struct.Struct(format)
Return a
new Struct object which writes and
reads binary data according to the
format string format. Creating a
Struct object once and calling its
methods is more efficient than calling
the struct functions with the same
format since the format string only
needs to be compiled once.
This expresses the same idea as samplebias.
If your aim is to create a dictionary, there is also the possibility to use dict() with a generator as argument
.
EDIT
I propose
channel_content = {}
for rec in xrange(number_of_intervals)):
for channel in channel_names:
N = int(self.channel_content[channel]['nsamples'])
upk = str(N)+"h", f.read(2*N)
channel_content[channel]['recording'].extend(struct.unpack(x) for i,x in enumerate(upk) if not i%2)
I don't know how to take account of the J.F. Sebastian's suggestion to use array
Not sure if it would be faster, but I would try to decode chunks of words instead of one word a time. For example, you could read 100 bytes of data a time like:
s = f.read(100)
struct.unpack(str(len(s)/2)+"h", s)
I'm reading in a binary file into a list and parsing the binary data. I'm using unpack() to extract certain parts of the data as primitive data types, and I want to edit that data and insert it back into the original list of bytes. Using pack_into() would make it easy, except that I'm using Python 2.4, and pack_into() wasn't introduced until 2.5
Does anyone know of a good way to go about serializing the data this way so that I can accomplish essentially the same functionality as pack_into()?
Have you looked at the bitstring module? It's designed to make the construction, parsing and modification of binary data easier than using the struct and array modules directly.
It's especially made for working at the bit level, but will work with bytes just as well. It will also work with Python 2.4.
from bitstring import BitString
s = BitString(filename='somefile')
# replace byte range with new values
# The step of '8' signifies byte rather than bit indicies.
s[10:15:8] = '0x001122'
# Search and replace byte value with two bytes
s.replace('0xcc', '0xddee', bytealigned=True)
# Different interpretations of the data are available through properties
if s[5:7:8].int > 1000:
s[5:7:8] = 1000
# Use the bytes property to get back to a Python string
open('newfile', 'wb').write(s.bytes)
The underlying data stored in the BitString is just an array object, but with a comprehensive set of functions and special methods to make it simple to modify and interpret.
Do you mean editing data in a buffer object? Documentation on manipulating those at all from Python directly is fairly scarce.
If you just want to edit bytes in a string, it's simple enough, though; struct.pack_into is new to 2.5, but struct.pack isn't:
import struct
s = open("file").read()
ofs = 1024
fmt = "Ih"
size = struct.calcsize(fmt)
before, data, after = s[0:ofs], s[ofs:ofs+size], s[ofs+size:]
values = list(struct.unpack(fmt, data))
values[0] += 5
values[1] /= 2
data = struct.pack(fmt, *values)
s = "".join([before, data, after])