Replace data of an array by two values of a second array - python

I have two numpy arrays "Elements" and "nodes". My aim is to gather some data of these arrays.
I need to replace "Elements" data of the two last columns by the two coordinates contains
in "nodes" array. The two arrays are very huge, I have to automate it.
This posts refers to an old one: Replace data of an array by 2 values of a second array
with a difference that arrays are very huge (Elements: (3342558,5) and nodes: (581589,4)) and the previous way out does not work.
An example :
import numpy as np
Elements = np.array([[1.,11.,14.],[2.,12.,13.]])
nodes = np.array([[11.,0.,0.],[12.,1.,1.],[13.,2.,2.],[14.,3.,3.]])
results = np.array([[1., 0., 0., 3., 3.],
[2., 1., 1., 2., 2.]])
The previous way out proposed by hpaulj
e = Elements[:,1:].ravel().astype(int)
n=nodes[:,0].astype(int)
I, J = np.where(e==n[:,None])
results = np.zeros((e.shape[0],2),nodes.dtype)
results[J] = nodes[I,:1]
results = results.reshape(2,4)
But with huge arrays, this script does not work:
DepreciationWarning: elementwise comparison failed; this will raise an error in the future...

Most of the game would be to figure out the corresponding matching indices from Elements in nodes.
Approach #1
Since it seems you are open to conversion to integer, let's assume we could take them as integers. With that, we could use an array-assignment + mapping based method, as shown below :
ar = Elements.astype(int)
a = ar[:,1:].ravel()
nd = nodes[:,0].astype(int)
n = a.max()+1
# for generalized case of neagtive ints in a or nodes having non-matching values:
# n = max(a.max()-min(0,a.min()), nd.max()-min(0,nd.min()))+1
lookup = np.empty(n, dtype=int)
lookup[nd] = np.arange(len(nd))
indices = lookup[a]
nc = (Elements.shape[1]-1)*(nodes.shape[1]-1) # 4 for given setup
out = np.concatenate((ar[:,0,None], nodes[indices,1:].reshape(-1,nc)),axis=1)
Approach #2
We could also use np.searchsorted to get those indices.
For nodes having rows sorted based on first col and matching case, we can simply use :
indices = np.searchsorted(nd, a)
For not-necessarily sorted case and matching case :
sidx = nd.argsort()
idx = np.searchsorted(nd, a, sorter=sidx)
indices = sidx[idx]
For non-matching case, use an invalid bool array :
invalid = idx==len(nd)
idx[invalid] = 0
indices = sidx[idx]
Approach #3
Another with concatenation + sorting -
b = np.concatenate((nd,a))
sidx = b.argsort(kind='stable')
n = len(nd)
v = sidx<n
counts = np.diff(np.flatnonzero(np.r_[v,True]))
r = np.repeat(sidx[v], counts)
indices = np.empty(len(a), dtype=int)
indices[sidx[~v]-n] = r[sidx>=n]
To detect non-matching ones, use :
nd[indices] != a
Port the idea here to numba :
from numba import njit
def numba1(Elements, nodes):
a = Elements[:,1:].ravel()
nd = nodes[:,0]
b = np.concatenate((nd,a))
sidx = b.argsort(kind='stable')
n = len(nodes)
ncols = Elements.shape[1]-1
size = nodes.shape[1]-1
dt = np.result_type(Elements.dtype, nodes.dtype)
nc = ncols*size
out = np.empty((len(Elements),1+nc), dtype=dt)
out[:,0] = Elements[:,0]
return numba1_func(out, sidx, nodes, n, ncols, size)
#njit
def numba1_func(out, sidx, nodes, n, ncols, size):
N = len(sidx)
for i in range(N):
if sidx[i]<n:
cur_id = sidx[i]
continue
else:
idx = sidx[i]-n
row = idx//ncols
col = idx-row*ncols
cc = col*size+1
for ii in range(size):
out[row, cc+ii] = nodes[cur_id,ii+1]
return out

Would you consider using pandas?
import pandas as pd
Elements = np.array([[1.,11.,14.],[2.,12.,13.]])
nodes = np.array([[11.,0.,0.],[12.,1.,1.],[13.,2.,2.],[14.,3.,3.]])
df_elements = pd.DataFrame(Elements,columns = ['idx','node1','node2'])
df_nodes = pd.DataFrame(nodes, columns = ['node_id','x','y'])
#Double merge to get the coordinates from df_nodes
results = df_elements.merge(df_nodes, left_on = 'node1', right_on="node_id", how='left').merge(df_nodes, left_on="node2",right_on = "node_id", how='left')[['idx',"x_x",'y_x','x_y','y_y']].values
Output
array([[1., 0., 0., 3., 3.],
[2., 1., 1., 2., 2.]])

First, let's estimate the sizes of the arrays to see if we will encounter a memory error
from sys import getsizeof
Element_size = getsizeof(np.random.randint(0,100,(3342558,5))) / (1024**3)
nodes_size = getsizeof(np.random.randint(0,100,(581589,4))) / (1024**3)
result_size = getsizeof(np.random.randint(0,100,(3342558,13))) / (1024**3)
total_size = Element_size + nodes_size + result_size
Running this script (13=(5-1)*(4-1)+1), the total_size is about 0.46 GB, this means we don't need to worry too much about memory error, but we should still do our best to avoid making copies of an array.
We first create arrays to work with
elements = np.random.randint(0,100,(100,5))
elements[:,0] = np.arange(100)
nodes = np.random.randint(0,100,(300,4))
# create an empty result array
results = np.empty((100,13)).astype(elements.dtype)
results[:,:5] = elements
As you can see, we create the array results in the first place, there are two benefits to create this array at the beginning:
Most operations can be in-place operations performed on results.
If the memory space is not sufficient, you will know this when you create results.
With these arrays, you can solve your problem with
aux_inds = np.arange(4)
def argmax_with_exception(row):
mask = row[1:5][:,None] == nodes[:,0]
indices = np.argmax(mask,axis=1)
node_slices = nodes[indices][:,1:]
# if a node in Element is not found in the array nodes
not_found = aux_inds[~np.any(mask,axis=1)]
node_slices[not_found] = np.ones(3) * -999
row[1:] = node_slices.flatten()
np.apply_along_axis(argmax_with_exception,1,results)
in which, if a node in Element is not found in nodes, its value will be assigned to (-999,-999,-999).
In this approach, np.apply_along_axis(argmax_with_exception,1, results) will perform an in-place operation on the array results, therefore, it is unlikely you will run into memory error as long as the arrays can be created in the first place. If, however, the machine you are working with has a very small RAM, you can save the array Elements to disk in the first place, then load it into results with results[:,:5] = np.load('Elements.npy')

In order to understand the pythonic solution first look at the solution provided by sgnfis on the old post:
Old solution
import numpy as np
# I used numpy 1.10.1 here
Elements = np.array([[1.,11.,14.],[2.,12.,13.]])
nodes = np.array([[11.,0.,0.],[12.,1.,1.],[13.,2.,2.],[14.,3.,3.]])
# Create an array with enough rows and five columns
res = np.zeros((np.shape(Elements)[0],5))
for i in range(np.shape(Elements)[0]):
res[i,0] = Elements[i,0] # The first column stays the same
# Find the Value of the 2nd column of Elements in the first column of nodes.
nodesindex = np.where(nodes[:,0]==Elements[i,1])
# Replace second and third row of the results with the ventries from nodes.
res[i,1:3]=nodes[nodesindex,1:3]
#Do the same for the 3rd column of Elements
nodesindex = np.where(nodes[:,0]==Elements[i,2])
res[i,3:5]=nodes[nodesindex,1:3]
print(res)
The above solution is now turned into pythonic solution as given below:
New Solution:
import numpy as np
Elements = np.array([[1.,11.,14.],[2.,12.,13.]])
nodes = np.array([[11.,0.,0.],[12.,1.,1.],[13.,2.,2.],[14.,3.,3.]])
# Create an array with enough rows and five columns
res = np.zeros((np.shape(Elements)[0],5))
res[:,0] = Elements[:,0] # The first column stays the same
res[:,1:3]=[nodes[np.where(nodes[:,0]==Elements[i,1]),1:3] for i in range(np.shape(Elements)[0])]
res[:,3:5]=[nodes[np.where(nodes[:,0]==Elements[i,2]),1:3] for i in range(np.shape(Elements)[0])]
print(res)

Related

Extracting elements from a 1D array. Getting error invalid index to scalar variable

I have to extract elements from a 1D array and write a file.
import numpy as np
k0 =78
tpts = 10
x_mirror = [1,2,3,4,5,6,7,8,9,10]
alpha = -3
x_start = 3
u0 = []
for p in range(0,tpts): #initial condition at t = 0
ts = ((np.exp(alpha*((x_mirror[p]-x_start)**2))*(np.cos(k0*(x_mirror[p]-x_start)))))
u0.append(ts)
u0_array = np.array(u0)[np.newaxis]
u0_array_transpose = np.transpose(u0_array)
matrix_A = np.zeros((tpts,tpts))
matrix_A[tpts-1][tpts-1] = 56
matrixC = matrix_A # u0_array_transpose
matrixC2 = matrix_A # u0
u_pre = np.array(np.zeros(tpts))
print(u0_array)
In this I want to extract suppose elements of u0_array separately. I get my u0_array as [[ 2.89793185e-06 -4.27075012e-02 1.00000000e+00 -4.27075012e-02 2.89793185e-06 9.14080657e-14 -7.91091805e-22 2.42062877e-33 -1.24204313e-47 1.15841796e-64]]. This is just for example. How can I get different elements of u0_array? By using u0[][], I am getting error. Any help is highly appreciated.
u0_array is an array that contains an array of floats. To index an individual element, use u0_array[0][(index to access)]. You can also use .flatten(), as the comment states.

Realign indexes to a changed python collection

I have a collection of data and a variable containing indexes to some of them.
A filtering operation is applied on the data that eliminates a subset of the data.
I want to shift the indexes so that they refer to the updated collection of data (eliminating indexes to deleted instances).
I'm using the implementation in the function below. I'm also posting the code I used to validate that it works.
Is there a quick & fast way to do the index realignment via the core libraries or a better way in general?
import random
def align_index(wanted_idx, mask):
"""
Function to align a set of indexes to a collection after deletions,
indicated with a mask
Arguments:
wanted_idx: List of desired integer indexes prior to deletion
mask: Binary mask, where 1's indicate elements that survive deletion
Returns:
List of integer indexes to (surviving) desired elements, post-deletion
"""
# rebuild indexes: remove dangling
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
# mark deleted
not_mask = [int(not m) for m in mask]
# cumsum deleted regions
realigned_idx = [k-sum(not_mask[:k+1]) for k in new_idx]
return realigned_idx
# data
data = [random.randint(0,500) for _ in range(1000)]
rng = list(range(len(data)))
for _ in range(1000):
# random data deletion / request
wanted_idx = random.sample(rng, random.randint(5,100))
del_index = random.sample(rng, random.randint(5, 100))
# apply deletion
mask = [int(i not in del_index) for i in range(len(data))]
filtered_data = [data[i] for (i, m) in enumerate(mask) if m]
realigned_index = align_index(wanted_idx, mask)
# verify
new_idx = [idx for (i, idx) in enumerate(wanted_idx) if mask[idx]]
l1 = [data[k] for k in new_idx]
l2 = [filtered_data[k] for k in realigned_index]
assert l1 == l2
If you use numpy it's quite trivial:
import numpy as np
mask = np.array(mask, dtype=np.bool)
new_idx = np.cumsum(mask, dtype=np.int64)
new_idx[mask] = -1
You shouldn't need to recompute new_idx unless more elements get deleted.
Then you can get the remapped index for old index i just by looking new_idx[i]. Or a whole array at once:
wanted_idx = np.array(wanted_idx, dtype=np.int64)
remapped_idx = new_idx[wanted_idx]
Note that deleted indices get assigned value -1. You can filter these out if you want:
remapped_idx = remapped_idx[remapped_idx >= 0]

Different results for linalg.norm in numpy

I am trying to create a feature matrix based on certain features and then finding distance b/w the items.
For testing purpose I am using only 2 points right now.
data : list of items I have
specs : feature dict of the items (I am using their values of keys as features of item)
features : list of features
This is my code by using numpy zero matrix :
import numpy as np
matrix = np.zeros((len(data),len(features)),dtype=bool)
for dataindex,item in enumerate(data):
if dataindex > 5:
break
specs = item['specs']
values = [value.lower() for value in specs.values()]
for idx,feature in enumerate(features):
if(feature in values):
matrix[dataindex,idx] = 1
print dataindex,idx
v1 = matrix[0]
v2 = matrix[1]
# print v1.shape
diff = v2 - v1
dist = np.linalg.norm(diff)
print dist
The value for dist I am getting is 1.0
This is my code by using python lists :
matrix = []
for dataindex,item in enumerate(data):
if dataindex > 5:
f = open("Matrix.txt",'w')
f.write(str(matrix))
f.close()
break
print "Item" + str(dataindex)
row = []
specs = item['specs']
values = [value.lower() for value in specs.values()]
for idx,feature in enumerate(features):
if(feature in values):
print dataindex,idx
row.append(1)
else:
row.append(0)
matrix.append(row)
v1 = np.array(matrix[0]);
v2 = np.array(matrix[1]);
diff = v2 - v1
print diff
dist = np.linalg.norm(diff)
print dist
The value of dist in this case is 4.35889894354
I have checked many time that the value 1 is being set at the same position in both cases but the answer is different.
May be I am not using numpy properly or there is an issue with the logic.
I am using numpy zero based matrix because of its memory efficiency.
What is the issue?
It's a type issue :
In [9]: norm(ones(3).astype(bool))
Out[9]: 1.0
In [10]: norm(ones(3).astype(float))
Out[10]: 1.7320508075688772
You must decide what it the good norm for your problem and eventually cast your data with astype.
norm(M) is sqrt(dot(M.ravel(),M.ravel())) , so for a boolean matrix, norm(M) is 0. if M is the False matrix,
1. otherwise. use the ord parameter of norm to tune the function.

filling numpy array by index

I have a function which gives me the index for a given value. Eg,
def F(value):
index = do_something(value)
return index
I want to use this index to fill a huge numpy array by 1s. Lets call array features
l = [1,4,2,3,7,5,3,6,.....]
NOTE: features.shape[0] = len(l)
for i in range(features.shape[0]):
idx = F(l[i])
features[i, idx] = 1
Is there a pythonic way to perform this (as the loop takes a lot of time if the array is huge)?
If you can vectorize F(value) you could write something like
indices = np.arange(features.shape[0])
feature_indices = F(l)
features.flat[indices, feature_indices] = 1
try this:
i = np.arange(features.shape[0]) # rows
j = np.vectorize(F)(np.array(l)) # columns
features[i,j] = 1

Row, column assignment without for-loop

I wrote a small script to assign values to a numpy array by knowing their row and column coordinates:
gridarray = np.zeros([3,3])
gridarray_counts = np.zeros([3,3])
cols = np.random.random_integers(0,2,15)
rows = np.random.random_integers(0,2,15)
data = np.random.random_integers(0,9,15)
for nn in np.arange(len(data)):
gridarray[rows[nn],cols[nn]] += data[nn]
gridarray_counts[rows[nn],cols[nn]] += 1
In fact, then I know how many values are stored in the same grid cell and what the sum is of them. However, performing this on arrays of lengths 100000+ it is getting quite slow. Is there another way without using a for-loop?
Is an approach similar to this possible? I know this is not working yet.
gridarray[rows,cols] += data
gridarray_counts[rows,cols] += 1
I would use bincount for this, but for now bincount only takes 1darrays so you'll need to write your own ndbincout, something like:
def ndbincount(x, weights=None, shape=None):
if shape is None:
shape = x.max(1) + 1
x = np.ravel_multi_index(x, shape)
out = np.bincount(x, weights, minlength=np.prod(shape))
out.shape = shape
return out
Then you can do:
gridarray = np.zeros([3,3])
cols = np.random.random_integers(0,2,15)
rows = np.random.random_integers(0,2,15)
data = np.random.random_integers(0,9,15)
x = np.vstack([rows, cols])
temp = ndbincount(x, data, gridarray.shape)
gridarray = gridarray + temp
gridarray_counts = ndbincount(x, shape=gridarray.shape)
You can do this directly:
gridarray[(rows,cols)]+=data
gridarray_counts[(rows,cols)]+=1

Categories

Resources