Let's say I define a record array
>>> y=np.zeros(4,dtype=('a4,int32,float64'))
and then I proceed to fill up the 4 records available. Now I get more data, something like
>>> c=('a',7,'24.5')
and I want to add this record to y. I can't figure out a clean way to do it. The best I have seen in np.concatenate(), but that would require turning c into an record array in and of itself. Is there any simple way to tack my tuple c onto y? This seems like it should be really straightforward and widely documented. Apologies if it is. I haven't been able to find it.
You can use numpy.append(), but as you need to convert the new data into a record array also:
import numpy as np
y = np.zeros(4,dtype=('a4,int32,float64'))
y = np.append(y, np.array([("0",7,24.5)], dtype=y.dtype))
Since ndarray can't dynamic change it's size, you need to copy all the data when you want to append some new data. You can create a class that reduce the resize frequency:
import numpy as np
class DynamicRecArray(object):
def __init__(self, dtype):
self.dtype = np.dtype(dtype)
self.length = 0
self.size = 10
self._data = np.empty(self.size, dtype=self.dtype)
def __len__(self):
return self.length
def append(self, rec):
if self.length == self.size:
self.size = int(1.5*self.size)
self._data = np.resize(self._data, self.size)
self._data[self.length] = rec
self.length += 1
def extend(self, recs):
for rec in recs:
self.append(rec)
#property
def data(self):
return self._data[:self.length]
y = DynamicRecArray(('a4,int32,float64'))
y.extend([("xyz", 12, 3.2), ("abc", 100, 0.2)])
y.append(("123", 1000, 0))
print y.data
for i in xrange(100):
y.append((str(i), i, i+0.1))
This is because concatenating numpy arrays is typically avoided as it requires reallocation of contiguous memory space. Size your array with room to spare, and then concatenate in large chunks if needed. This post may be of some help.
Related
I have created "C Array" in Python and added implementation for append as below:
class Array:
def append(self, val):
array_new_dtype = ctypes.py_object * (self.size+1)
new_array = array_new_dtype()
for i in range(self.size):
new_array[i] = self.array[i]
new_array[self.size] = val
self.size += 1
del self.array
self.array = new_array
I am trying to calculate complexity for append and want to confirm if I am doing it the right way.
def append(self, val):
# O(1) because one object is created
array_new_dtype = ctypes.py_object * (self.size+1)
# Creating new_array is O(n), because each individual py_object has to be initialized. Deleting
new_array = array_new_dtype()
# O(n) because for loop starts at 0 index and runs until size
for i in range(self.size):
new_array[i] = self.array[i]
# O(1) because last element of array gets its value
new_array[self.size] = val
# O(1) because size attribute is set
self.size += 1
# O(1) because an object is deleted
del self.array
# O(1) because `self.array` is now referencing to the object `new_array`
self.array = new_array
Please let me know if time analysis for all steps is correct. If not, what will be the correct time analysis for the above steps in append function.
Also, what is the time complexity of below code:
for i in range(10):
array.append(i)
Short description
I want to walk along a numpy 2D array starting from different points in specified directions (either 1 or -1) until a column changes (see below)
Current code
First let's generate a dataset:
# Generate big random dataset
# first column is id and second one is a number
np.random.seed(123)
c1 = np.random.randint(0,100,size = 1000000)
c2 = np.random.randint(0,20,size = 1000000)
c3 = np.random.choice([1,-1],1000000 )
m = np.vstack((c1, c2, c3)).T
m = m[m[:,0].argsort()]
Then I wrote the following code that starts at specific rows in the matrix (start_points) then keeps extending in the specified direction (direction_array) until the metadata changes:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = start_mat[:,1]
direction_array = start_mat[:,2]
walk_array = start_array
while True:
walk_array = np.add(walk_array, direction_array)
try:
walk_mat = mat[walk_array]
walk_metadata = walk_mat[:,1]
if sorted(metadata) != sorted(walk_metadata):
raise IndexError
except IndexError:
return start_mat, mat[walk_array + (direction_array *-1)]
s = time.time()
for i in range(100000):
start_points = np.random.randint(0,1000000,size = 3)
res = walk(m, start_points)
Question
While the above code works fine I think there must be an easier/more elegant way to walk along a numpy 2D array from different start points until the value of another column changes? This for example requires me to slice the input array for every step in the while loop which seems quite inefficient (especially when I have to run walk millions of times).
You don't have to whole input array in while loop. You could just use the column that values you want to check.
I refactored a little bit your code as well so there is no while True statement and so there is no if that raises error for no particular reason.
Code:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = sorted(start_mat[:,1])
direction_array = start_mat[:,2]
data = mat[:,1]
walk_array = np.add(start_array, direction_array)
try:
while metadata == sorted(data[walk_array]):
walk_array = np.add(walk_array, direction_array)
except IndexError:
pass
return start_mat, mat[walk_array - direction_array]
In this particular reason if len(start_array) is a big number (thousands of elements) you could use collections.Counter instead of sort as it will be much faster.
I was thinking of another approach that could be used and I that there could be a array with desired slices in correct direction.
But this approach seems very dirty. Anyway I will post it maybe you will find it anyhow useful
Code:
def walk(mat, start_array):
start_mat = mat[start_array]
metadata = sorted(start_mat[:,1])
direction_array = start_mat[:,2]
_data = mat[:,1]
walk_slices = zip(*[
data[start_points[i]+direction_array[i]::direction_array[i]]
for i in range(len(start_points))
])
for step, walk_metadata in enumerate(walk_slices):
if metadata != sorted(walk_metadata):
break
return start_mat, mat[start_array + (direction_array * step)]
To perform operation starting from a single row, define the following class:
class Walker:
def __init__(self, tbl, row):
self.tbl = tbl
self.row = row
self.dir = self.tbl[self.row, 2]
# How many rows can I move from "row" in the indicated direction
# while metadata doesn't change
def numEq(self):
# Metadata from "row" in the required direction
md = self.tbl[self.row::self.dir, 1]
return ((md != md[0]).cumsum() == 0).sum() - 1
# Get row "n" positions from "row" in the indicated direction
def getRow(self, n):
return self.tbl[self.row + n * self.dir]
Then, to get the result, run:
def walk_2(m, start_points):
# Create walkers for each starting point
wlk = [ Walker(m, n) for n in start_points ]
# How many rows can I move
dist = min([ w.numEq() for w in wlk ])
# Return rows from changed positions
return np.vstack([ w.getRow(dist) for w in wlk ])
The execution time of my code is roughly the same as yours,
but in my opinion my code is more readable and concise.
I need a 2D dict like structure that allows fast deletion operation.
F.e.
x['a']['b'] = 1
x['a']['c'] = 1
x['a']['d'] = 1
x['b']['a'] = 1
x['b']['f'] = 1
x['e']['b'] = 1
x['f']['c'] = 1
...
i.e. the keys a,b,c,e,f,c,... can be used in both dimentions
It is fast to delete by first dimention i.e.
del x[a]
but if you want to delete by the second dimension you have to enumerate the elements first, which is slow.
You can also imagine this structure as 2D table where columns and rows have names AND where you can delete Whole row or column fast. At the same time the addition happens one cell at a time.
What would be your solution ?
PS> One possibility would be to keep Lists of keys of 2 to 1 dimensions, but this will take up too much memory and would keep data for rows/cols which wont be deleted !
Will using Pandas dataframe be faster ?
key1, key2, data
I think this will be impossible without additional memory. As for the second problem, you can in fact store keys for both dimensions and delete when needed. A simple solution using two dictionaries:
from collections import defaultdict
class Table2D:
def __init__(self, *args, **kwargs):
self._row_to_cols = defaultdict(dict)
self._col_to_rows = defaultdict(dict)
def add(self, c, r):
self._col_to_rows[c][r] = 1
self._row_to_cols[r][c] = 1
def get_row(self, r):
return self._row_to_cols[r]
def get_col(self, c):
return self._col_to_rows[c]
def rem_row(self, r):
for c in self._row_to_cols[r]:
del self._col_to_rows[c][r]
del self._row_to_cols[r]
def rem_col(self, c):
for r in self._col_to_rows[c]:
del self._row_to_cols[r][c]
del self._col_to_rows[c]
t2d = Table2D()
t2d.add('a', 'c')
t2d.rem_col('a')
print(t2d.get_col('a')
I would like to sample a 26 dimensional space with say 10 points in every direction. This means that there are in total 10**26 samples, but I'll discard more than 99.9999... %. Using python, this immediately leads to memory errors.
A first naive approach is to use nested loops:
p = list(range(10))
for p1 in p:
for p2 in p:
...
However, Python has an in-built maximum on the amount of nested loops: 20.
A better approach would be to use the numpy.indices command:
import numpy as np
dimensions = (10,)*26
indices = np.indices(*dimensions)
This fails with an "array too big" message because Numpy can't fit all 10**26 indices in memory. Understandable.
My final approach was to use an iterator, hoping this didn't need more memory:
import numpy as np
dimensions = (10,)*26
for index in np.ndindex(*dimensions):
# do something with index
However, this ALSO fails with an "array too big" message, since under the hood Numpy still tries to create a dense array.
Does anybody else have a better approach?
Thanks!
Tom
EDIT: The "array too big" message is probably because 10**26 is larger than the maximum value an Int64 can store. If you could tell Numpy to store the size as an Int128, that might circumvent the ValueError at least. It'll still require almost 20GB to store all the indices as Int64 though ...
So far, this is the solution that I've found:
class IndicesGenerator:
def __init__(self, nbDimensions, nbSamplesPerDimension):
self.nbDimensions = nbDimensions
self.nbSamplesPerDimension = nbSamplesPerDimension
def getNbDimensions(self):
return self.nbDimensions
def getNbSamplesPerDimension(self):
return self.nbSamplesPerDimension
def getIndices(self):
d = self.getNbDimensions()
N = self.getNbSamplesPerDimension()
# create indices
indices = []
prevIndex = None
for i in range(d):
newIndex = Index(maxValue=N-1, prev=prevIndex)
indices.append(newIndex)
prevIndex = newIndex
lastIndex = indices[-1]
while True:
try:
yield list(map(lambda index: index.getValue(), indices))
lastIndex.increment()
except RuntimeError:
break
class Index:
def __init__(self, maxValue, prev=None):
assert prev is None or isinstance(prev, Index)
assert isinstance(maxValue, int)
self.prev = prev
self.value = 0
self.maxValue = maxValue
def getPrevious(self):
return self.prev
def getValue(self):
return self.value
def setValue(self, value):
assert isinstance(value, int)
self.value = value
def getMaximumValue(self):
return self.maxValue
def increment(self):
if self.getValue() == self.getMaximumValue():
# increment previous and set the current one to zero
if self.getPrevious() is None:
# the end is reached, so raise an error
raise RuntimeError
else:
self.setValue(0)
self.getPrevious().increment()
else:
self.setValue(self.getValue()+1)
if __name__ == '__main__':
import time
nbIndices = 0
d = 3
N = 5
start = time.time()
for indices in IndicesGenerator(nbDimensions=d, nbSamplesPerDimension=N).getIndices():
# print(indices)
nbIndices += 1
assert nbIndices == N**d
end = time.time()
print("Nb indices generated: ", nbIndices)
print("Computation time: ", round(end-start,2), "s.")
It's not fast for large dimensions but at least it works without memory errors.
I'm rather new to NumPy. Anyone have an idea for making this code, especially the nested loops, more compact/efficient? BTW, dist and data are three-dimensional numpy arrays.
def interpolate_to_distance(self,distance):
interpolated_data=np.ndarray(self.dist.shape[1:])
for j in range(interpolated_data.shape[1]):
for i in range(interpolated_data.shape[0]):
interpolated_data[i,j]=np.interp(
distance,self.dist[:,i,j],self.data[:,i,j])
return(interpolated_data)
Thanks!
Alright, I'll take a swag with this:
def interpolate_to_distance(self, distance):
dshape = self.dist.shape
dist = self.dist.T.reshape(-1, dshape[-1])
data = self.data.T.reshape(-1, dshape[-1])
intdata = np.array([np.interp(distance, di, da)
for di, da in zip(dist, data)])
return intdata.reshape(dshape[0:2]).T
It at least removes one loop (and those nested indices), but it's not much faster than the original, ~20% faster according to %timeit in IPython. On the other hand, there's a lot of (probably unnecessary, ultimately) transposing and reshaping going on.
For the record, I wrapped it up in a dummy class and filled some 3 x 3 x 3 arrays with random numbers to test:
import numpy as np
class TestClass(object):
def interpolate_to_distance(self, distance):
dshape = self.dist.shape
dist = self.dist.T.reshape(-1, dshape[-1])
data = self.data.T.reshape(-1, dshape[-1])
intdata = np.array([np.interp(distance, di, da)
for di, da in zip(dist, data)])
return intdata.reshape(dshape[0:2]).T
def interpolate_to_distance_old(self, distance):
interpolated_data=np.ndarray(self.dist.shape[1:])
for j in range(interpolated_data.shape[1]):
for i in range(interpolated_data.shape[0]):
interpolated_data[i,j]=np.interp(
distance,self.dist[:,i,j],self.data[:,i,j])
return(interpolated_data)
if __name__ == '__main__':
testobj = TestClass()
testobj.dist = np.random.randn(3, 3, 3)
testobj.data = np.random.randn(3, 3, 3)
distance = 0
print 'Old:\n', testobj.interpolate_to_distance_old(distance)
print 'New:\n', testobj.interpolate_to_distance(distance)
Which prints (for my particular set of randoms):
Old:
[[-0.59557042 -0.42706077 0.94629049]
[ 0.55509032 -0.67808257 -0.74214045]
[ 1.03779189 -1.17605275 0.00317679]]
New:
[[-0.59557042 -0.42706077 0.94629049]
[ 0.55509032 -0.67808257 -0.74214045]
[ 1.03779189 -1.17605275 0.00317679]]
I also tried np.vectorize(np.interp) but couldn't get that to work. I suspect that would be much faster if it did work.
I couldn't get np.fromfunction to work either, as it passed (2) 3 x 3 (in this case) arrays of indices to np.interp, the same arrays you get from np.mgrid.
One other note: according the the docs for np.interp,
np.interp does not check that the x-coordinate sequence xp is increasing. If
xp is not increasing, the results are nonsense. A simple check for
increasingness is::
np.all(np.diff(xp) > 0)
Obviously, my random numbers violate the 'always increasing' rule, but you'll have to be more careful.