Adding Header to Numpy array - python

I have an array I would like to add a header for.
This is what i have now:
0.0,1.630000e+01,1.990000e+01,1.840000e+01
1.0,1.630000e+01,1.990000e+01,1.840000e+01
2.0,1.630000e+01,1.990000e+01,1.840000e+01
This is what i want:
SP,1,2,3
0.0,1.630000e+01,1.990000e+01,1.840000e+01
1.0,1.630000e+01,1.990000e+01,1.840000e+01
2.0,1.630000e+01,1.990000e+01,1.840000e+01
Notes:
"SP" will always be 1st followed by the numbering of the columns which may vary
here is my existing code:
fmt = ",".join(["%s"] + ["%10.6e"] * (my_array.shape[1]-1))
np.savetxt('final.csv', my_array, fmt=fmt,delimiter=",")

Ever since Numpy 1.7.0, three parameters have been added to numpy.savetxt for exactly this purpose: header, footer and comments. So the code to do as you wanted can easily be written as:
import numpy
a = numpy.array([[0.0,1.630000e+01,1.990000e+01,1.840000e+01],
[1.0,1.630000e+01,1.990000e+01,1.840000e+01],
[2.0,1.630000e+01,1.990000e+01,1.840000e+01]])
fmt = ",".join(["%s"] + ["%10.6e"] * (a.shape[1]-1))
numpy.savetxt("temp", a, fmt=fmt, header="SP,1,2,3", comments='')

Note: this answer was written for an older version of numpy, relevant when the question was written. With modern numpy, makhlaghi's answer provides a more elegant solution.
Since numpy.savetxt can also write to file objects, you can open the file youself and write your header before the data:
import numpy
a = numpy.array([[0.0,1.630000e+01,1.990000e+01,1.840000e+01],
[1.0,1.630000e+01,1.990000e+01,1.840000e+01],
[2.0,1.630000e+01,1.990000e+01,1.840000e+01]])
fmt = ",".join(["%s"] + ["%10.6e"] * (a.shape[1]-1))
# numpy.savetxt, at least as of numpy 1.6.2, writes bytes
# to file, which doesn't work with a file open in text mode. To
# work around this deficiency, open the file in binary mode, and
# write out the header as bytes.
with open('final.csv', 'wb') as f:
f.write(b'SP,1,2,3\n')
#f.write(bytes("SP,"+lists+"\n","UTF-8"))
#Used this line for a variable list of numbers
numpy.savetxt(f, a, fmt=fmt, delimiter=",")

It is also possible to save other things than numpy arrays to file using the savez or savez_compressed functions. Using load function you can retrieve all data like it was pickled like a dict.
import numpy as np
np.savez("filename.npz", array_to_save=np.array([0.0, 0.0]), header="Some header")
data = np.load("filename.npz")
array = data["array_to_save"]
header = str(data["header"])

Related

How to conserve header when saving an edited .fits file with Astropy?

I'm editing a .fits file I have in python but I want the header to stay the exact same. This is the code:
import numpy as np
from astropy.io import fits
import matplotlib.pyplot as plt
# read in the fits file
im = fits.getdata('myfile.fits')
header = fits.getheader('myfile.fits')
ID = 1234
newim = np.copy(im)
newim[newim == ID] = 0
newim[newim == 0] = -99
newim[newim > -99] = 0
newim[newim == -99] = 1
plt.imshow(newim,cmap='gray', origin='lower')
plt.colorbar()
hdu = fits.PrimaryHDU(newim)
hdu.writeto('mynewfile.fits')
All this is fine and does exactly what I want it to do except that it does not conserve the header after it saves the new file. Is there any way to fix this such that the original header file is not lost?
First of all don't do this:
im = fits.getdata('myfile.fits')
header = fits.getheader('myfile.fits')
As explained in the warning here, this kind of usage is discouraged (newer versions of the library have a caching mechanism that makes this less inefficient than it used to be, but it's still a problem). This is because the first one returns just the data array from the file, and the latter returns just the header from a file. At that point there's no longer any association between them; it's just a plain Numpy ndarray and a plain Header and their associations with a specific file are not tracked.
You can return the full HDUList data structure which represents the HDUs in a file, and for each HDU there's an HDU object associating headers with their arrays.
In your example you can just open the file, modify the data array in-place, and then use the .writeto method on it to write it to a new file, or if you open it with mode='update' you can modify the existing file in-place. E.g.
hdul = fits.open('old.fits')
# modify the data in the primary HDU; this is just an in-memory operation and will not change the data on disk
hdul[0].data +=1
hdul.writeto('new.fits')
There's also no clear reason for doing this in your code
newim = np.copy(im)
Unless you have a specific reason to keep an unmodified copy of the original array in memory, you can just directly modify the original array in-place.

How do I read a text file of numbers into an array of arrays

In python, using the OpenCV library, I need to create some polylines. The example code for the polylines method shows:
cv2.polylines(img,[pts],True,(0,255,255))
I have all the 'pts' laid out in a text file in the format:
x1,y1,x2,y2,x3,y3,x4,y4
x1,y1,x2,y2,x3,y3,x4,y4
x1,y1,x2,y2,x3,y3,x4,y4
How can I read this file and provide the data to the [pts] variable in the method call?
I've tried the np.array(csv.reader(...)) method as well as a few others I've found examples of. I can successfully read the file, but it's not in the format the polylines method wants. (I am a newbie when it comes to python, if this was C++ or Java, it wouldn't be a problem).
I would try to use numpy to read the csv as an array.
from numpy import genfromtxt
p = genfromtxt('myfile.csv', delimiter=',')
cv2.polylines(img,p,True,(0,255,255))
You may have to pass a dtype argument to the genfromtext if you need to coerce the data to a specific format.
https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html
In case you know it is a fixed number of items in each row:
import csv
with open('myfile.csv') as csvfile:
rows = csv.reader(csvfile)
res = list(zip(*rows))
print(res)
I know it's not pretty and there is probably a MUCH BETTER way to do this, but it works. That being said, if someone could show me a better way, it would be much appreciated.
pointlist = []
f = open(args["slots"])
data = f.read().split()
for row in data:
tmp = []
col = row.split(";")
for points in col:
xy = points.split(",")
tmp += [[int(pt) for pt in xy]]
pointlist += [tmp]
slots = np.asarray(pointlist)
You might need to draw each polyline individually (to expand on #Chris's answer):
from numpy import genfromtxt
lines = genfromtxt('myfile.csv', delimiter=',')
for line in lines:
cv2.polylines(img, line.reshape((-1, 2)), True, (0,255,255))

python sparse matrix creation paralellize to speed up

I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).

How to pipe binary data into numpy arrays without tmp storage?

There are several similar questions but none of them answers this simple question directly:
How can i catch a commands output and stream that content into numpy arrays without creating a temporary string object to read from?
So, what I would like to do is this:
import subprocess
import numpy
import StringIO
def parse_header(fileobject):
# this function moves the filepointer and returns a dictionary
d = do_some_parsing(fileobject)
return d
sio = StringIO.StringIO(subprocess.check_output(cmd))
d = parse_header(sio)
# now the file pointer is at the start of data, parse_header takes care of that.
# ALL of the data is now available in the next line of sio
dt = numpy.dtype([(key, 'f8') for key in d.keys()])
# i don't know how do make this work:
data = numpy.fromxxxx(sio , dt)
# if i would do this, I create another copy besides the StringIO object, don't I?
# so this works, but isn't this 'bad' ?
datastring = sio.read()
data = numpy.fromstring(datastring, dtype=dt)
I tried it with StringIO and cStringIO but both are not accepted by numpy.frombuffer and numpy.fromfile.
Using StringIO object I first have to read the stream into a string and then use numpy.fromstring, but I would like to avoid creating the intermediate object (several Gigabytes).
An alternative for me would be if I can stream sys.stdin into numpy arrays, but that does not work with numpy.fromfile either (seek needs to be implemented).
Are there any work-arounds for this? I can't be the first one trying this (unless this is a PEBKAC case?)
Solution:
This is the current solution, it's a mix of unutbu's instruction how to use the Popen with PIPE and the hint of eryksun to use bytearray, so I don't know who to accept!? :S
proc = sp.Popen(cmd, stdout = sp.PIPE, shell=True)
d = parse_des_header(proc.stdout)
rec_dtype = np.dtype([(key,'f8') for key in d.keys()])
data = bytearray(proc.stdout.read())
ndata = np.frombuffer(data, dtype = rec_dtype)
I didn't check if the data is really not creating another copy, don't know how. But what I noticed that this works much faster than everything I tried before, so many thanks to both the answers' authors!
Update 2022:
I just tried above solution steps without the bytearray() step and it just works fine. Thanks to Python 3 I guess?
You can use Popen with stdout=subprocess.PIPE. Read in the header, then load the rest into a bytearray to use with np.frombuffer.
Additional comments based on your edit:
If you're going to call proc.stdout.read(), it's equivalent to using check_output(). Both create a temporary string. If you preallocate data, you could use proc.stdout.readinto(data). Then if the number of bytes read into data is less than len(data), free the excess memory, else extend data by whatever is left to be read.
data = bytearray(2**32) # 4 GiB
n = proc.stdout.readinto(data)
if n < len(data):
data[n:] = ''
else:
data += proc.stdout.read()
You could also come at this starting with a pre-allocated ndarray ndata and use buf = np.getbuffer(ndata). Then readinto(buf) as above.
Here's an example to show that the memory is shared between the bytearray and the np.ndarray:
>>> data = bytearray('\x01')
>>> ndata = np.frombuffer(data, np.int8)
>>> ndata
array([1], dtype=int8)
>>> ndata[0] = 2
>>> data
bytearray(b'\x02')
Since your data can easily fit in RAM, I think the easiest way to load the data into a numpy array is to use a ramfs.
On Linux,
sudo mkdir /mnt/ramfs
sudo mount -t ramfs -o size=5G ramfs /mnt/ramfs
sudo chmod 777 /mnt/ramfs
Then, for example, if this is the producer of the binary data:
writer.py:
from __future__ import print_function
import random
import struct
N = random.randrange(100)
print('a b')
for i in range(2*N):
print(struct.pack('<d',random.random()), end = '')
Then you could load it into a numpy array like this:
reader.py:
import subprocess
import numpy
def parse_header(f):
# this function moves the filepointer and returns a dictionary
header = f.readline()
d = dict.fromkeys(header.split())
return d
filename = '/mnt/ramfs/data.out'
with open(filename, 'w') as f:
cmd = 'writer.py'
proc = subprocess.Popen([cmd], stdout = f)
proc.communicate()
with open(filename, 'r') as f:
header = parse_header(f)
dt = numpy.dtype([(key, 'f8') for key in header.keys()])
data = numpy.fromfile(f, dt)

python program to export numpy/lists in svmlight format

Any way to export a python array into SVM light format?
There is one in scikit-learn:
http://scikit-learn.org/stable/modules/generated/sklearn.datasets.dump_svmlight_file.html
It's basic but it works both for numpy arrays and scipy.sparse matrices.
I wrote this totally un-optimized script a while ago, maybe it can help! Data and labels must be in two separate numpy arrays.
def save_svmlight_data(data, labels, data_filename, data_folder = ''):
file = open(data_folder+data_filename,'w')
for i,x in enumerate(data):
indexes = x.nonzero()[0]
values = x[indexes]
label = '%i'%(labels[i])
pairs = ['%i:%f'%(indexes[i]+1,values[i]) for i in xrange(len(indexes))]
sep_line = [label]
sep_line.extend(pairs)
sep_line.append('\n')
line = ' '.join(sep_line)
file.write(line)
The svmlight-loader module can load an svmlight file into a numpy array. I don't think anything exists for the other direction, but the module is probably a good starting point for extending its functionality.

Categories

Resources