Column_stacking arbitrary amounts of data in Python

Column_stacking arbitrary amounts of data in Python - python

I have a list of "spectral" objects that have "ydata" attributes and I need to column stack all the ydata together.
I can iterate through all the objects but I have to somehow create an array of the same length so that I can stack.
Here is a barebones version of what I have:
import numpy as np
class Spectrum(object):
def __init__(self, ydata):
self.ydata = ydata
spec = {}
spec[1] = Spectrum([1,2,3])
spec[2] = Spectrum([4,5,6])
array = np.empty(len(spec[1].ydata))
for i in range(1,len(spec)+1):
array = np.column_stack((array,spec[i].ydata))
print(array)
So the above works, but the first column of array is always the empty (random) values.
I know there has to be an easy way to do this but I am just missing it.
One option that I thought of is to start with:
array = spec[1].ydata
then move into the for-loop but that doesn't seem right since that assumes there is a spec[1].
The desired output would be:
>>>array
>>>[[1 4]
[2 5]
[3 6]]

Assuming that all your instances of Spectrum have the same ydata-length, I would go with a simple list comprehension:
array = np.array([spec[i+1].ydata for i in range(len(spec))])
print(array)
output:
[[1 2 3]
[4 5 6]]
EDIT:
I took another look at the desired output, in that case it would be
array = np.array([spec[i+1].ydata for i in range(len(spec))]).T
and
[[1 4]
[2 5]
[3 6]]
EDIT:
I wrote a small test program to compare the performances of np.array().T and np.column_stack() against each other:
import numpy as np
from timeit import Timer
class Spectrum(object):
def __init__(self, ydata):
self.ydata = ydata
def create_by_array():
return np.array([spec[i+1].ydata for i in range(len(spec))]).T
def create_by_column_stack():
return np.column_stack([spec[i+1].ydata for i in range(len(spec))])
I = 1000
spec = {i: Spectrum([j for j in range(3*i,3*(i+1))]) for i in range(1,I+1)}
t1 = Timer(
"""create_by_array()""",
setup="""from __main__ import create_by_array"""
)
res1 = t1.repeat(10,1000)
t2 = Timer(
"""create_by_column_stack()""",
setup="""from __main__ import create_by_column_stack"""
)
res2 = t2.repeat(10,1000)
print(
'Results of the two tests: ',
'{:5}, {:5}, {:5}'.format('min','mean','max')
)
print(
'With np.array and transpose:',
'{:5.3}, {:5.3}, {:5.3}'.format(np.min(res1), np.mean(res1),np.max(res1))
)
print(
'With np.column_stack(): ',
'{:5.4}, {:5.4}, {:5.4}'.format(np.min(res2), np.mean(res2),np.max(res2))
)
The program first produces a dict of 1000 Spectrum instances and then times the two methods 10 x 1000 times. Here are the results:
('Results of the two tests: ', 'min , mean , max ')
('With np.array and transpose:', '0.687, 0.709, 0.742')
('With np.column_stack(): ', '3.982, 4.367, 5.263')
As you can see, np.array().T method is about 5 times faster than the np.column_stack(). I'm not entirely sure why that is, but according to the numpy column_stack documentation page,
Take a sequence of 1-D arrays and stack them as columns to make a
single 2-D array. 2-D arrays are stacked as-is, just like with hstack.
1-D arrays are turned into 2-D columns first.
This sounds a lot like every individual sub-list is first turned into an ndarray, while np.array() only creates the final array. The transposing of the matrix is very fast, as it does not do any re-arranging in memory. See for instance here. I hope this clears it up.

Related

How can I optimize this element by element vector addition and multiplication

I am trying to optimize the inner loop of my python code. I've been reading about map, and reduce, but struggle to apply these concepts to the following code, since it also contains a multiplication. My data structure looks like this
f.m: [NDArray[float]]
f.l: [NDArray[float]]
f.h: [NDArray[float]]
I have several of these in a list and I would like to calculate the sum for each array element (i.e., m, l, h) in the list. Right now, I use a loop to iterate through the list of arrays. This scenario can be done with map etc. However, each array also carries a sign (+1. vs -1). Is there a way to optimize this and keeping the sign separate?
f1 = type('test', (object, ), {})()
f2 = type('test', (object, ), {})()
f1.n = "f1"
f1.m = f1.l = [1, 2]
f2.n = "f2"
f2.m = f2.l = [2, 4]
flux_list = [f1, f2]
dirs = {"f1": -1, "f2": 1}
new = [0, 0]
i = 0 # set in outer loop
for f in flux_list:
direction = dirs[f.n]
new[0] += f.m[i] * direction
new[1] += f.l[i] * direction
print(new)

One way to avoid the above loop is to create the data structure outside of the object instances, and then use numpy.view() to create individual views on the data.
unlike np.r_, np.view() does not create a new copy of the data, so anything we do to the data is visible in all other views (similar to shared memory).
I.e., we can create a view where f1.m and f2.m appear to be adjacent to each other, and we can use np.sum on the view rather than a loop over each f1 to fn. While this involves a small overhead in creating the data structure, all of this can be done before we start iterating over the data.
Last but not least, the multiplication step can be moved outside the loop by a suitable grouping of f.m (I,.e.collect all f.m with positive sign into one group, and do the same for those with q negative sign).
The code below illustrates the principle:
import numpy as np
# create the data structure first
rand = np.random.RandomState(42)
A = rand.randint(10,size=25).reshape(5,5)
row = np.arange(5)
# create some empty object instances
f1 = type('test', (object, ), {})()
f2 = type('test', (object, ), {})()
# Assign views into A to each object
f1.m = A.view()[row[:, np.newaxis], 0]
f1.l = A.view()[row[:, np.newaxis], 3]
f2.m = A.view()[row[:, np.newaxis], 1]
f2.l = A.view()[row[:, np.newaxis], 2]
# Create view in such a way that we can directly sum the data
# in this case we want to add f1.m and f2.m
col = np.array([0,1])
C=A.view()[row[:, np.newaxis], col]
# do the sum over the view which will then also be
# reflected in A
np.sum(C,axis=1)

Nesting loops to arbitrary depth by passing method objects

I am trying to scan over iterable properties of n objects. I am looking for a pythonic way to perform functions in nested loops of arbitrary depth by passing functions to method calls of the loop one level up. I haven't been able to get more than the most inner loop to run when the depth is 3. Here is a non-working python pseudo code where I am querying a different value at each point in the loops. The other difficulty is I am trying to capture the output and pass it to the next outer loop
class Parent(object):
def __init__(self):
self.iterable = [None] * 2
self.result = self.iterable[:]
def loop(self, query_func):
def innerloop():
for i, x in enumerate(self.iterable):
self.result[i] = query_func(x)
return self.result[:]
return innerloop
class ChildA(Parent):
def __init___(self, A, object_to_queryA):
self.iterableA = [valueA for valueA in range(A)]
self.resultA = self.iterableA[:]
self.object_to_query = object_to_queryA
def query_valueA(self, x):
return self.object_to_query.some_query_function(x)
class ChildB(Parent):
def __init___(self, B, object_to_queryB):
self.iterableB = [valueB for valueB in range(B))]
self.resultB = self.iterableB[:]
self.object_to_query = object_to_queryB
def query_valueB(self, x):
return self.object_to_query.some_other_query_function(x)
class ChildC(Parent):
def __init___(self, C, , object_to_queryC):
self.iterableC = [valueC for valueC in range(C))]
self.resultC = self.iterableC[:]
self.object_to_query = object_to_queryC
def query_valueC(self, x):
return self.object_to_query.yet_another_query_function(x)
I want to be able to call these loops as follows:
import numpy
query_objA, query_objB, query_objC = (SomeObjA(), SomeObjB(), SomeObjC())
A, B, C = (len(query_objA.data), len(query_objB.data), len(query_objC.data))
instA = ChildA(A, query_objA)
instB = ChildB(B, query_objB)
instC = ChildC(C, query_objC)
my_scanning_func = ChildA.loop(ChildB.loop(ChildC.loop))
my_queries = numpy.array(my_scanning_func()).reshape(A,B,C)
# Equally valid call example below:
my_scanning_func2 = ChildB.loop(ChildC.loop(ChildA.loop))
my_queries2 = numpy.array(my_scanning_func2()).reshape(B,C,A)
The ultimate functionality im looking for would be similar to below, but for arbitrary depth and order:
for i, x in enumerate(query_objA.data):
response[i] = instA.some_query_function(x)
for j, y in enumerate(query_objB.data):
response[i][j] = instB.some_other_query_function(y)
for k, z in enumerate(query_objC.data):
response[i][j][k] = instC.yet_another_query_function(z)
Bonus points if this can be done via an inherited recursive function, rather than defining separate looping methods for each child, as I tried to do above. Last Note: I am trying to write Python 2.7 compatible code. Thanks in advance!

After much discussion with the OP I have a better idea of how you could generalize the construction of these arrays, first it seems that your objects would be designed to both iterate over predefined states or query the present state (possibly with only one of these being valid) so the iterface for object would be abstracted to something like this:
class Apparatus_interface:
def __init__(self,*needed_stuff):
#I have no idea how you are actually interacting with the device
self._device = SET_UP_OBJECT(needed_stuff)
#when iterating over this object we need to know how many states there are
#so we can predefine the shape (dimensions) of our arrays
self.num_of_states = 5
#it would make sense for each object to define
#the type of value that .query() returns (following spec of numpy's dtype)
self.query_type = [('f1', float), ('f2', float)]
def __iter__(self):
"""iterates over the physical positions/states of the apperatus
the state of the device is only active in between iterations
* calling list(device) doesn't give you any useful information, just a lot of mechanical work
"""
for position in range(self.num_of_states):
# ^ not sure what this should be either, you will have a better idea
self._device.move_to(position) #represents a physical change in the device
yield position #should it generate different information?
def query(self):
return self._device.query()
with this interface you would generate your array by iterating (nested loop) over a number of devices and at each combination of states between them you query the state of another device (and record that value into an array)
Normally you'd be able to use itertools.product to generate the combinations of states of the devices however due to optimizations itertools.product would run the iteration code that affects the physical device before it is used in iteration, so you will need an implementation that does not apply this kind of optimization:
#values is a list that contains the current elements generated
#the loop: for values[depth] in iterables[depth] basically sets the depth-th element to each value in that level of iterable
def _product(iterables, depth, values):
if len(iterables)-depth == 1:
for values[depth] in iterables[depth]:
yield tuple(values)
else:
for values[depth] in iterables[depth]:
#yield from _product(iterables, depth+1, values)
for tup in _product(iterables, depth+1, values):
yield tup
def product(*iterables):
"""
version of itertools.product to activate side-effects of iteration
only works with iterables, not iterators.
"""
values = [None]*len(iterables)
return _product(iterables, 0, values)
Now for actually generating the array - first a process that iterates through the product of all states and makes a query at each one, note that states variable is unused as I'm going to assume the placement in the numpy array will be determined by the order the states get iterated not the values produced
def traverse_states(variable_devices, queried_device):
"""queries a device at every combination of variable devices states"""
for states in product(*variable_devices):
yield queried_device.query()
then the function to put the array together is quite strait forward:
def array_from_apparatus(variable_devices, queried_object, dtype=None):
# the # of states in each device <==> # of elements in each dimension
arr_shape = [device.num_of_states for device in variable_devices]
iterator = traverse_states(variable_devices, queried_object)
if dtype is None:
dtype = queried_object.query_type
array = numpy.fromiter(iterator, dtype=dtype)
array.shape = arr_shape #this will fail if .num_of_states doesn't match the actual number of iterations
return array
I'm not sure how I could make a decent test of this but I believe it would work or at least be close.

I'm not sure if this answers your question but I think it is at least relevant, if you want to generate a numpy array such that array[tup] = func(tup) where tup is a tuple of integer indices you could use itertools.product in combination with numpy.fromiter like this:
import itertools
#from itertools import imap as map #for python 2
import numpy
def array_from_func(dimensions, func, dtype=float):
ranges = (range(i) for i in dimensions) #ranges of indices for all dimensions
all_indices = itertools.product(*ranges) #will iterate over all locations regardless of # of dimensions
value_gen = map(func, all_indices) #produces each value for each location
array = numpy.fromiter(value_gen, dtype=dtype)
array.shape = dimensions #modify the shape in place, .reshape would work but makes a copy.
return array
This is useful to me to see how indices relate to the actual array output, here are three demos to show basic functionality (second one I figured out recently)
from operator import itemgetter
>>> array_from_func((2,3,4), itemgetter(1),int) #second index
array([[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2]],
[[0, 0, 0, 0],
[1, 1, 1, 1],
[2, 2, 2, 2]]])
>>> def str_join(it):
return ",".join(map(str,it))
#the '<U5' in next line specifies strings of length 5, this only works when the string will actually be length 5
#changing to '<U%d'%len(str_join(dims)) would be more generalized but harder to understand
>>> print(array_from_func((3,2,7), str_join, '<U5'))
[[['0,0,0' '0,0,1' '0,0,2' '0,0,3' '0,0,4' '0,0,5' '0,0,6']
['0,1,0' '0,1,1' '0,1,2' '0,1,3' '0,1,4' '0,1,5' '0,1,6']]
[['1,0,0' '1,0,1' '1,0,2' '1,0,3' '1,0,4' '1,0,5' '1,0,6']
['1,1,0' '1,1,1' '1,1,2' '1,1,3' '1,1,4' '1,1,5' '1,1,6']]
[['2,0,0' '2,0,1' '2,0,2' '2,0,3' '2,0,4' '2,0,5' '2,0,6']
['2,1,0' '2,1,1' '2,1,2' '2,1,3' '2,1,4' '2,1,5' '2,1,6']]]
>>> array_from_func((3,4), sum) #the sum of the indices, not as useful but another good demo
array([[ 0., 1., 2., 3.],
[ 1., 2., 3., 4.],
[ 2., 3., 4., 5.]])
I think this is along the lines of what you are trying to accomplish but I'm not quite sure... please give me feedback if I can be more specific about what you need.

Rounding a list of values to the nearest value from another list in python

Suppose I have the following two arrays:
>>> a = np.random.normal(size=(5,))
>>> a
array([ 1.42185826, 1.85726088, -0.18968258, 0.55150255, -1.04356681])
>>> b = np.random.normal(size=(10,10))
>>> b
array([[ 0.64207828, -1.08930317, 0.22795289, 0.13990505, -0.9936441 ,
1.07150754, 0.1701072 , 0.83970818, -0.63938211, -0.76914925],
[ 0.07776129, -0.37606964, -0.54082077, 0.33910246, 0.79950839,
0.33353221, 0.00967273, 0.62224009, -0.2007335 , -0.3458876 ],
[ 2.08751603, -0.52128218, 1.54390634, 0.96715102, 0.799938 ,
0.03702108, 0.36095493, -0.13004965, -1.12163463, 0.32031951],
[-2.34856521, 0.11583369, -0.0056261 , 0.80155082, 0.33421475,
-1.23644508, -1.49667424, -1.01799365, -0.58232326, 0.404464 ],
[-0.6289335 , 0.63654201, -1.28064055, -1.01977467, 0.86871352,
0.84909353, 0.33036771, 0.2604609 , -0.21102014, 0.78748329],
[ 1.44763687, 0.84205291, 0.76841512, 1.05214051, 2.11847126,
-0.7389102 , 0.74964783, -1.78074088, -0.57582084, -0.67956203],
[-1.00599479, -0.93125754, 1.43709533, 1.39308038, 1.62793589,
-0.2744919 , -0.52720952, -0.40644809, 0.14809867, -1.49267633],
[-1.8240385 , -0.5416585 , 1.10750423, 0.56598464, 0.73927224,
-0.54362927, 0.84243497, -0.56753587, 0.70591902, -0.26271302],
[-1.19179547, -1.38993415, -1.99469983, -1.09749452, 1.28697997,
-0.74650318, 1.76384156, 0.33938808, 0.61647274, -0.42166111],
[-0.14147554, -0.96192206, 0.14434349, 1.28437894, -0.38865447,
-1.42540195, 0.93105528, 0.28993325, -1.16119916, -0.58244758]])
I have to find a way to round all values from b to the nearest value found in a.
Does anyone know of a good way to do this with python? I am at a total loss myself.

Here is something you can try
import numpy as np
def rounder(values):
def f(x):
idx = np.argmin(np.abs(values - x))
return values[idx]
return np.frompyfunc(f, 1, 1)
a = np.random.normal(size=(5,))
b = np.random.normal(size=(10,10))
rounded = rounder(a)(b)
print(rounded)
The rounder function takes the values which we want to round to. It creates a function which takes a scalar and returns the closest element from the values array. We then transform this function to a broadcast-able function using numpy.frompyfunc. This way you are not limited to using this on 2d arrays, numpy automatically does broadcasting for you without any loops.

If you sort a you can use bisect to find the index in array a where each element from the sub arrays of array b would land:
import numpy as np
from bisect import bisect
a = np.random.normal(size=(5,))
b = np.random.normal(size=(10, 10))
a.sort()
size = a.size
for sub in b:
for ind2, ele in enumerate(sub):
i = bisect(a, ele, hi=size-1)
i1, i2 = a[i], a[i-1]
sub[ind2] = i1 if abs(i1 - ele) < abs(i2 - ele) else i2

Assuming a will always be 1 dimensional, and that b can have any dimension in this solution.
Create two temporary arrays tiling a and b into the dimensions of the other (here both will now have a shape of (5,10,10)).
at = np.tile(np.reshape(a, (-1, *list(np.ones(len(b.shape)).astype(int)))), (1, *b.shape))
bt = np.tile(b, (a.size, *list(np.ones(len(b.shape)).astype(int))))
For the nearest operation, you can take the absolute value of the difference between the two. The minimum value of that operation in the first dimension (dimension 0) gives the index in the a array.
idx = np.argmin(np.abs(at-bt),axis=0)
All that is left is to select the values from array a using the index, which will return an array in the shape of b with the nearest values from a.
ans = a[idx]
This method can also be used (modifying how the index is calculated) to do other operations, such as a floor, ceil, etc.
Note that this solution can be memory intensive, which is not much of an issue with small arrays. A looping solution could be less memory intensive at the cost of speed.

I don't know Numpy, but I don't think knowledge of Numpy is needed to be able to answer this question. Assuming that an array can be iterated and modified in the same way as a list, the following code solves your problem by using a nested loop to find the closest value.
for i in range(len(b)):
for k in range(len(b[i])):
closest = a[0]
for j in range(1, len(a)):
if abs(a[j] - b[i][k]) < abs(closest - b[i][k]):
closest = a[j]
b[i][k] = closest
Disclaimer: a more pythonic approach may exist.

Inverse of random.shuffle()?

I have a function, for simplicity I'll call it shuffler and it takes an list, gives random a seed 17 and then prints that list shuffled.
def shuffler( n ):
import random
random.seed( 17 )
print( random.shuffle( n ) )
How would I create another function called unshuffler that "unshuffles" that list that is returned by shuffler(), bringing it back to the list I inputted into shuffler() assuming that I know the seed?

Just wanted to contribute an answer that's more compatible with functional patterns commonly used with numpy. Ultimately this solution should perform the fastest as it will take advantage of numpy's internal optimizations, which themselves can be further optimized via the use of projects like numba. It ought to be much faster than using conventional loop structures in python.
import numpy as np
original_data = np.array([23, 44, 55, 19, 500, 201]) # Some random numbers to represent the original data to be shuffled
data_length = original_data.shape[0]
# Here we create an array of shuffled indices
shuf_order = np.arange(data_length)
np.random.shuffle(shuf_order)
shuffled_data = original_data[shuf_order] # Shuffle the original data
# Create an inverse of the shuffled index array (to reverse the shuffling operation, or to "unshuffle")
unshuf_order = np.zeros_like(shuf_order)
unshuf_order[shuf_order] = np.arange(data_length)
unshuffled_data = shuffled_data[unshuf_order] # Unshuffle the shuffled data
print(f"original_data: {original_data}")
print(f"shuffled_data: {shuffled_data}")
print(f"unshuffled_data: {unshuffled_data}")
assert np.all(np.equal(unshuffled_data, original_data))

Here are two functions that do what you need:
import random
import numpy as np
def shuffle_forward(l):
order = range(len(l)); random.shuffle(order)
return list(np.array(l)[order]), order
def shuffle_backward(l, order):
l_out = [0] * len(l)
for i, j in enumerate(order):
l_out[j] = l[i]
return l_out
Example
l = range(10000); random.shuffle(l)
l_shuf, order = shuffle_forward(l)
l_unshuffled = shuffle_backward(l_shuf, order)
print l == l_unshuffled
#True

Reseed the random generator with the seed in question and then shuffle the list 1, 2, ..., n. This tells you exactly what ended up where in the shuffle.

In Python3:
import random
import numpy as np
def shuffle_forward(l):
order = list(range(len(l)); random.shuffle(order))
return list(np.array(l)[order]), order
def shuffle_backward(l, order):
l_out = [0] * len(l)
for i, j in enumerate(order):
l_out[j] = l[i]
return l_out

NumPy: 1D interpolation of a 3D array

I'm rather new to NumPy. Anyone have an idea for making this code, especially the nested loops, more compact/efficient? BTW, dist and data are three-dimensional numpy arrays.
def interpolate_to_distance(self,distance):
interpolated_data=np.ndarray(self.dist.shape[1:])
for j in range(interpolated_data.shape[1]):
for i in range(interpolated_data.shape[0]):
interpolated_data[i,j]=np.interp(
distance,self.dist[:,i,j],self.data[:,i,j])
return(interpolated_data)
Thanks!

Alright, I'll take a swag with this:
def interpolate_to_distance(self, distance):
dshape = self.dist.shape
dist = self.dist.T.reshape(-1, dshape[-1])
data = self.data.T.reshape(-1, dshape[-1])
intdata = np.array([np.interp(distance, di, da)
for di, da in zip(dist, data)])
return intdata.reshape(dshape[0:2]).T
It at least removes one loop (and those nested indices), but it's not much faster than the original, ~20% faster according to %timeit in IPython. On the other hand, there's a lot of (probably unnecessary, ultimately) transposing and reshaping going on.
For the record, I wrapped it up in a dummy class and filled some 3 x 3 x 3 arrays with random numbers to test:
import numpy as np
class TestClass(object):
def interpolate_to_distance(self, distance):
dshape = self.dist.shape
dist = self.dist.T.reshape(-1, dshape[-1])
data = self.data.T.reshape(-1, dshape[-1])
intdata = np.array([np.interp(distance, di, da)
for di, da in zip(dist, data)])
return intdata.reshape(dshape[0:2]).T
def interpolate_to_distance_old(self, distance):
interpolated_data=np.ndarray(self.dist.shape[1:])
for j in range(interpolated_data.shape[1]):
for i in range(interpolated_data.shape[0]):
interpolated_data[i,j]=np.interp(
distance,self.dist[:,i,j],self.data[:,i,j])
return(interpolated_data)
if __name__ == '__main__':
testobj = TestClass()
testobj.dist = np.random.randn(3, 3, 3)
testobj.data = np.random.randn(3, 3, 3)
distance = 0
print 'Old:\n', testobj.interpolate_to_distance_old(distance)
print 'New:\n', testobj.interpolate_to_distance(distance)
Which prints (for my particular set of randoms):
Old:
[[-0.59557042 -0.42706077 0.94629049]
[ 0.55509032 -0.67808257 -0.74214045]
[ 1.03779189 -1.17605275 0.00317679]]
New:
[[-0.59557042 -0.42706077 0.94629049]
[ 0.55509032 -0.67808257 -0.74214045]
[ 1.03779189 -1.17605275 0.00317679]]
I also tried np.vectorize(np.interp) but couldn't get that to work. I suspect that would be much faster if it did work.
I couldn't get np.fromfunction to work either, as it passed (2) 3 x 3 (in this case) arrays of indices to np.interp, the same arrays you get from np.mgrid.
One other note: according the the docs for np.interp,
np.interp does not check that the x-coordinate sequence xp is increasing. If
xp is not increasing, the results are nonsense. A simple check for
increasingness is::
np.all(np.diff(xp) > 0)
Obviously, my random numbers violate the 'always increasing' rule, but you'll have to be more careful.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Column_stacking arbitrary amounts of data in Python - python

Related

How can I optimize this element by element vector addition and multiplication

Nesting loops to arbitrary depth by passing method objects

Rounding a list of values to the nearest value from another list in python

Inverse of random.shuffle()?

NumPy: 1D interpolation of a 3D array

Categories

Resources