Write a Scalar to CSV (Numpy) - python

I'm generating a number of test files iteratively, the process derives a 0, 1 or 2 dimensional numpy array, then writes that array to CSV, at least that's the intent.
Does anyone have a good solution for this?
My code (expectedly) fails when the output is zero-dimensional (scalar):
for key in testfiles:
tname = key + ".csv"
np.savetxt(tname, testfiles[key], delimiter=",",newline=';',fmt='%0.15f')

There are a couple of ways to ensure that your input is not a scalar in numpy. For example, you could use np.array:
arr = np.array(testfiles[key], ndmin=1, copy=False)
Another option is np.atleast_1d:
arr = np.atleast_1d(testfiles[key])
Both options will attempt to make an object without copying the data. In both cases, pass arr to np.savetxt instead of testfiles[key].

Related

Numpy broadcasting - using a variable value

EDIT:
As my question was badly formulated, I decided to rewrite it.
Does numpy allow to create an array with a function, without using Python's standard list comprehension ?
With list comprehension I could have:
array = np.array([f(i) for i in range(100)])
with f a given function.
But if the constructed array is really big, using Python's list would be slow and would eat a lot of memory.
If such a way doesn't exist, I suppose I could first create an array of my wanted size
array = np.arange(100)
And then map a function over it.
array = f(array)
According to results from another post, it seems that it would be a reasonable solution.
Let's say I want to use the add function with a simple int value, it will be as follows:
array = np.array([i for i in range(5)])
array + 5
But now what if I want the value (here 5) as something that varies according to the index of the array element. For example the operation:
array + [i for i in range(5)]
What object can I use to define special rules for a variable value within a vectorized operation ?
You can add two arrays together like this:
Simple adding two arrays using numpy in python?
This assumes your "variable by index" is just another array.
For your specific example, a jury-rigged solution would be to use numpy.arange() as in:
In [4]: array + np.arange(5)
Out[4]: array([0, 2, 4, 6, 8])
In general, you can find some numpy ufunc that does the job of your custom function or you can compose then in a python function to do so, which then returns an ndarray, something like:
def custom_func():
# code for your tasks
return arr
You can then simply add the returned result to your already defined array as in:
array + cusom_func()

Chain datasets from multiple HDF5 files/datasets

The benefits and simplistic mapping that h5py provides (through HDF5) for persisting datasets on disk is exceptional. I run some analysis on a set of files and store the result into a dataset, one for each file. At the end of this step, I have a set of h5py.Dataset objects which contain 2D arrays. The arrays all have the same number of columns, but different number of rows, i.e., (A,N), (B,N), (C,N), etc.
I would now like to access these multiple 2D arrays as a single array 2D array. That is, I would like to read them on-demand as an array of shape (A+B+C, N).
For this purpose, h5py.Link classes do not help as it works at the level of HDF5 nodes.
Here is some pseudocode:
import numpy as np
import h5py
a = h5py.Dataset('a',data=np.random.random((100, 50)))
b = h5py.Dataset('b',data=np.random.random((300, 50)))
c = h5py.Dataset('c',data=np.random.random((253, 50)))
# I want to view these arrays as a single array
combined = magic_array_linker([a,b,c], axis=1)
assert combined.shape == (100+300+253, 50)
For my purposes, suggestions of copying the arrays into a new file do not work. I'm also open to solving this on the numpy level, but I don't find any suitable options with numpy.view or numpy.concatenate that would work without copying out the data.
Does anybody know of a way to view multiple arrays as a stacked set of arrays, without copying and from h5py.Dataset?
First up, I don't think there is a way to do this without copying the data in order to return a single array. As far as I can tell, it's not possible to concatenate numpy views into one array - unless, of course, you create your own wrapper.
Here I demonstrate a proof of concept using Object/Region references. The basic premise is that we make a new dataset in the file which is an array of references to the constituent subarrays. By storing references like this, the subarrays can change size dynamically and indexing the wrapper will always index the correct subarrays.
As this is just a proof of concept, I haven't implemented proper slicing, just very simple indexing. There's also no attempt at error checking - this will almost definitely break in production.
class MagicArray(object):
"""Magically index an array of references
"""
def __init__(self, file, references, axis=0):
self.file = file
self.references = references
self.axis = axis
def __getitem__(self, items):
# We need to modify the indices, so make sure items is a list
items = list(items)
for item in items:
if hasattr(item, 'start'):
# items is a slice object
raise ValueError('Slices not implemented')
for ref in self.references:
size = self.file[ref].shape[self.axis]
# Check if the requested index is in this subarray
# If not, subtract the subarray size and move on
if items[self.axis] < size:
item_ref = ref
break
else:
items[self.axis] = items[self.axis] - size
return self.file[item_ref][tuple(items)]
Here's how you use it:
with h5py.File("/tmp/so_hdf5/test.h5", 'w') as f:
a = f.create_dataset('a',data=np.random.random((100, 50)))
b = f.create_dataset('b',data=np.random.random((300, 50)))
c = f.create_dataset('c',data=np.random.random((253, 50)))
ref_dtype = h5py.special_dtype(ref=h5py.Reference)
ref_dataset = f.create_dataset("refs", (3,), dtype=ref_dtype)
for i, key in enumerate([a, b, c]):
ref_dataset[i] = key.ref
with h5py.File("/tmp/so_hdf5/test.h5", 'r') as f:
foo = MagicArray(f, f['refs'], axis=0)
print(foo[104, 4])
print(f['b'][4,4])
This should be fairly trivial to extend to fancier indexing (i.e. being able to handle slices), but I can't see how to do so without copying data.
You might be able to subclass from numpy.ndarray and get all the usual methods as well.

How do I fill two (or more) numpy arrays from a single iterable of tuples?

The actual problem I have is that I want to store a long sorted list of (float, str) tuples in RAM. A plain list doesn't fit in my 4Gb RAM, so I thought I could use two numpy.ndarrays.
The source of the data is an iterable of 2-tuples. numpy has a fromiter function, but how can I use it? The number of items in the iterable is unknown. I can't consume it to a list first due to memory limitations. I thought of itertools.tee, but it seems to add a lot of memory overhead here.
What I guess I could do is consume the iterator in chunks and add those to the arrays. Then my question is, how to do that efficiently? Should I maybe make 2 2D arrays and add rows to them? (Then later I'd need to convert them to 1D).
Or maybe there's a better approach? Everything I really need is to search through an array of strings by the value of the corresponding number in logarithmic time (that's why I want to sort by the value of float) and to keep it as compact as possible.
P.S. The iterable is not sorted.
Perhaps build a single, structured array using np.fromiter:
import numpy as np
def gendata():
# You, of course, have a different gendata...
for i in xrange(N):
yield (np.random.random(), str(i))
N = 100
arr = np.fromiter(gendata(), dtype='<f8,|S20')
Sorting it by the first column, using the second for tie-breakers will take O(N log N) time:
arr.sort(order=['f0','f1'])
Finding the row by the value in the first column can be done with searchsorted in O(log N) time:
# Some pseudo-random value in arr['f0']
val = arr['f0'][10]
print(arr[10])
# (0.049875262239617246, '46')
idx = arr['f0'].searchsorted(val)
print(arr[idx])
# (0.049875262239617246, '46')
You've asked many important questions in the comments; let me attempt to answer them here:
The basic dtypes are explained in the numpybook. There may be one or
two extra dtypes (like float16 which have been added since that
book was written, but the basics are all explained there.)
Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.
Dtypes can be used to define structured arrays with column names, or
with default column names. 'f0', 'f1', etc. are default column
names. Since I defined the dtype as '<f8,|S20' I failed to provide
column names, so NumPy named the first column 'f0', and the second
'f1'. If we had used
dtype='[('fval','<f8'), ('text','|S20')]
then the structured array arr would have column names 'fval' and
'text'.
Unfortunately, the dtype has to be fixed at the time np.fromiter is called. You
could conceivably iterate through gendata once to discover the
maximum length of the strings, build your dtype and then call
np.fromiter (and iterate through gendata a second time), but
that's rather burdensome. It is of course better if you know in
advance the maximum size of the strings. (|S20 defines the string
field as having a fixed length of 20 bytes.)
NumPy arrays place data of a
pre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the dtype) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then it
would be hard for NumPy to find the right offsets. By hard, I mean
NumPy would need an index or somehow be redesigned. NumPy is simply not
built this way.
NumPy does have an object dtype which allows you to place a 4-byte
pointer to any Python object you desire. This way, you can have NumPy
arrays with arbitrary Python data. Unfortunately, the np.fromiter
function does not allow you to create arrays of dtype object. I'm not sure why there is this restriction...
Note that np.fromiter has better performance when the count is
specified. By knowing the count (the number of rows) and the
dtype (and thus the size of each row) NumPy can pre-allocate
exactly enough memory for the resultant array. If you do not specify
the count, then NumPy will make a guess for the initial size of the
array, and if too small, it will try to resize the array. If the
original block of memory can be extended you are in luck. But if
NumPy has to allocate an entirely new hunk of memory then all the old
data will have to be copied to the new location, which will slow down
the performance significantly.
Here is a way to build N separate arrays out of a generator of N-tuples:
import numpy as np
import itertools as IT
def gendata():
# You, of course, have a different gendata...
N = 100
for i in xrange(N):
yield (np.random.random(), str(i))
def fromiter(iterable, dtype, chunksize=7):
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
result = [chunk[name].copy() for name in chunk.dtype.names]
size = len(chunk)
while True:
chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
N = len(chunk)
if N == 0:
break
newsize = size + N
for arr, name in zip(result, chunk.dtype.names):
col = chunk[name]
arr.resize(newsize, refcheck=0)
arr[size:] = col
size = newsize
return result
x, y = fromiter(gendata(), '<f8,|S20')
order = np.argsort(x)
x = x[order]
y = y[order]
# Some pseudo-random value in x
N = 10
val = x[N]
print(x[N], y[N])
# (0.049875262239617246, '46')
idx = x.searchsorted(val)
print(x[idx], y[idx])
# (0.049875262239617246, '46')
The fromiter function above reads the iterable in chunks (of size chunksize). It calls the NumPy array method resize to extend the resultant arrays as necessary.
I used a small default chunksize since I was testing this code on small data. You, of course, will want to either change the default chunksize or pass a chunksize parameter with a larger value.

numpy.savetxt Problems with 1D array writing

I'm trying to use numpy's savetxt function to generate a bunch of files as inputs for another piece of software.
I'm trying to write an array of the form:
a=np.array([1,2,3,4,...])
a.shape=>(1,n)
to a text file with the formatting
1,2,3,4,...
when I enter the command
np.savetxt('test.csv',a,fmt='%d',delimiter=',')
I get a file that looks like:
1
2
3
4
...
savetxt works as I would expect for a 2D array, but I can't get all of the values for a 1D array onto a single line
Any suggestions?
Thanks
EDIT:
I solved the problem. Using np.atleast_2d(a) as the input to savetxt forces savetxt to write the array as a row, not a column
There are different ways to fix this. The one closest to your current approach is:
np.savetxt('test.csv', a[None], fmt='%d', delimiter=',')
i.e. add the slicing [None] to your array to make it two-dimensional with only a single line.
If you only want to save a 1D array, it's actually a lot faster to use this method:
>>> x = numpy.array([0,1,2,3,4,5])
>>> ','.join(map(str, x.tolist()))
'0,1,2,3,4,5'

numpy: efficient execution of a complex reshape of an array

I am reading a vendor-provided large binary array into a 2D numpy array tempfid(M, N)
# load data
data=numpy.fromfile(file=dirname+'/fid', dtype=numpy.dtype('i4'))
# convert to complex data
fid=data[::2]+1j*data[1::2]
tempfid=fid.reshape(I*J*K, N)
and then I need to reshape it into a 4D array useful4d(N,I,J,K) using non-trivial mappings for the indices. I do this with a for loop along the following lines:
for idx in range(M):
i=f1(idx) # f1, f2, and f3 are functions involving / and % as well as some lookups
j=f2(idx)
k=f3(idx)
newfid[:,i,j,k] = tempfid[idx,:] #SLOW! CAN WE IMPROVE THIS?
Converting to complex takes 33% of the time while the copying of these slices M slices takes the remaining 66%. Calculating the indices is fast irrespective of whether I do this one by one in a loop as shown or by numpy.vectorizing the operation and applying it to an arange(M).
Is there a way to speed this up? Any help on more efficient slicing, copying (or not) etc appreciated.
EDIT:
As learned in the answer to question "What's the fastest way to convert an interleaved NumPy integer array to complex64?" the conversion to complex can be sped up by a factor of 6 if a view is used instead:
fid = data.astype(numpy.float32).view(numpy.complex64)
idx = numpy.arange(M)
i = numpy.vectorize(f1)(idx)
j = numpy.vectorize(f2)(idx)
k = numpy.vectorize(f3)(idx)
# you can index arrays with other arrays
# that lets you specify this operation in one line.
newfid[:, i,j,k] = tempfid.T
I've never used numpy's vectorize. Vectorize just means that numpy will call your python function multiple times. In order to get speed, you need use array operations like the one I showed here and you used to get complex numbers.
EDIT
The problem is that the dimension of size 128 was first in newfid, but last in tempfid. This is easily by using .T which takes the transpose.
How about this. Set us your indicies using the vectorized versions of f1,f2,f3 (not necessarily using np.vectorize, but perhaps just writing a function that takes an array and returns an array), then use np.ix_:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.ix_.html
to get the index arrays. Then reshape tempfid to the same shape as newfid and then use the results of np.ix_ to set the values. For example:
tempfid = np.arange(10)
i = f1(idx) # i = [4,3,2,1,0]
j = f2(idx) # j = [1,0]
ii = np.ix_(i,j)
newfid = tempfid.reshape((5,2))[ii]
This maps the elements of tempfid onto a new shape with a different ordering.

Categories

Resources