How can i use the permutation array returned by Scipy RCM, to reorder the original sparse matrix and reduce the bandwidth?
B = mmread('G22.mtx')
graph = csr_matrix(B)
aux2 = reverse_cuthill_mckee(graph,symmetric_mode=True)
Where 'graph' is a undirected graph(symmetric matrix).
I found the answer, if anyone needs it in the future:
B = mmread('G22.mtx')
graph = csr_matrix(B)
aux2 = reverse_cuthill_mckee(graph,symmetric_mode=True)
for i in range(len(aux2)):
graph[:,i] = graph[aux2,i]
for i in range(len(aux2)):
graph[i,:] = graph[i,aux2]
Related
As seen in the picture I have an outlier and I would like to remove it(not the red one but the one above it in green, which is not aligned with other points) and hence I am trying to find the min distance and then try to eliminate it. But given the huge dataset it takes an eternity to execute. This is my code below. Appreciate any solution that helps, thanks! enter image description here
import math
#list of 11600 points
dataset = [[2478, 3534], [4217, 953],......,11600 points]
copy_dataset = dataset
Indices =[]
Min_Dists =[]
Distance = []
Copy_Dist=[]
for p1 in range(len(dataset)):
p1_x= dataset[p1][0]
p1_y= dataset[p1][1]
for p2 in range(len(copy_dataset)):
p2_x= copy_dataset[p2][0]
p2_y= copy_dataset[p2][1]
dist = math.sqrt((p1_x - p2_x) ** 2 + (p1_y - p2_y) ** 2)
Distance.append(dist)
Copy_Dist.append(dist)
min_dist_1= min(Distance)
Distance.remove(min_dist_1)
if(min_dist_1 !=0):
Min_Dists.append(min_dist_1)
ind_1 = Copy_Dist.index(min_dist_1)
Indices.append(ind_1)
min_dist_2=min(Distance)
Distance.remove(min_dist_2)
if(min_dist_2 !=0):
Min_Dists.append(min_dist_2)
ind_2 = Copy_Dist.index(min_dist_2)
Indices.append(ind_2)
To_Remove = copy_dataset.index([p1_x, p1_y])
copy_dataset.remove(copy_dataset[To_Remove])
Not sure how to solve this problem in general, but it's probably a lot faster to compute the distances in a vectorized fashion.
dataset_copy = dataset.copy()
dataset_copy = dataset_copy[:, np.newaxis]
distance = np.sqrt(np.sum(np.square(dataset - dataset_copy), axis=~0))
Thank you for the answers mates! I tried the below way to solve the issue it worked pretty quick.
from statistics import mean
from scipy.spatial import distance
D = distance.squareform(distance.pdist(dataset))
closest = np.argsort(D, axis=1)
d1 =[]
for i in range(len(dataset)):
d1.append(D[i][closest[i][1]])
avg_dist = int(mean(d1))
for i in range(len(dataset)):
d1= D[i][closest[i][1]]
d2= D[i][closest[i][2]]
if(abs(avg_dist-d1)>2):
if(abs(avg_dist-d2)>2):
print(dataset[i])
dataset.remove(dataset[i])
If you need all distances at once:
distances = scipy.spatial.distance_matrix(dataset, dataset)
If you need distances of one point to all others:
for pt in dataset:
distances = scipy.spatial.distance_matrix([pt], dataset)[0]
# distances.min() will be 0 because the point has 0 distance to itself
# the nearest neighbor will be the second element in sorted order
indices = np.argpartition(distances, 1) # or use argsort for a complete sort
nearest_neighbor = indices[1]
Documentation: distance_matrix, argpartition
I'm facing a problem with vectorizing a function so that it applies efficiently on a numpy array.
My program entries :
A pos_part 2D Array of Nb_particles lines, 3 columns (basicaly x,y,z coordinates, only z is relevant for the part that bothers me) Nb_particles can up to several hundreds of thousands.
An prop_part 1D array with Nb_particles values. This part I got covered, creation is made with some nice numpy functions ; I just put here a basic distribution that ressembles real values.
A z_distances 1D Array, a simple np.arange betwwen z=0 and z=z_max.
Then come the calculation that takes time, because where I can't find a way to do things properply with only numpy operation of arrays. What i want to do is :
For all distances z_i in z_distances, sum all values from prop_part if corresponding particle coordinate z_particle < z_i. This would return a 1D array the same length as z_distances.
My ideas so far :
Version 0, for loop, enumerate and np.where do retrieve the index of values that I need to sum. Obviously quite long.
Version 1, using a mask on a new array (combination of z coordinates and particle properties), and sum on the masked array. Seems better than v0
Version 2, another mask and a np.vectorize, but i understand it's not efficient as vectorize is basicaly a for loop. Still seems better than v0
Version 3, I'm trying to use mask on a function that can I directly apply to z_distances, but it's not working so far.
So, here I am. There is maybe something to do with a sort and a cumulative sum, but I don't know how to do this, so any help would be greatly appreciated. Please find below the code to make things clearer
Thanks in advance.
import numpy as np
import time
import matplotlib.pyplot as plt
# Creation of particles' positions
Nb_part = 150_000
pos_part = 10*np.random.rand(Nb_part,3)
pos_part[:,0] = pos_part[:,1] = 0
#usefull property creation
beta = 1/1.5
prop_part = (1/beta)*np.exp(-pos_part[:,2]/beta)
z_distances = np.arange(0,10,0.1)
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = np.where(pos_part[:,2]<val_dist)[0]
result[index_dist] = sum(prop_part[i] for i in positions)
print("v0 :",time.time()-t0)
#A graph to help understand
plt.figure()
plt.plot(z_distances,result, c="red")
plt.ylabel("Sum of particles' usefull property for particles with z-pos<d")
plt.xlabel("d")
#version 1 ??
t1=time.time()
combi = np.column_stack((pos_part[:,2],prop_part))
result2 = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
mask = (combi[:,0]<val_dist)
result2[index_dist]=sum(combi[:,1][mask])
print("v1 :",time.time()-t1)
plt.plot(z_distances,result2, c="blue")
#version 2
t2=time.time()
def themask(a):
mask = (combi[:,0]<a)
return sum(combi[:,1][mask])
thefunc = np.vectorize(themask)
result3 = thefunc(z_distances)
print("v2 :",time.time()-t2)
plt.plot(z_distances,result3, c="green")
### This does not work so far
# version 3
# =============================
# t3=time.time()
# def thesum(a):
# mask = combi[combi[:,0]<a]
# return sum(mask[:,1])
# result4 = thesum(z_distances)
# print("v3 :",time.time()-t3)
# =============================
You can get a lot more performance by writing your first version completely in numpy. Replace pythons sum with np.sum. Instead of the for i in positions list comprehension, simply pass the positions mask you are creating anyways.
Indeed, the np.where is not necessary and my best version looks like:
#my version 0
t0=time.time()
result = np.empty(len(z_distances))
for index_dist, val_dist in enumerate(z_distances):
positions = pos_part[:, 2] < val_dist
result[index_dist] = np.sum(prop_part[positions])
print("v0 :",time.time()-t0)
# out: v0 : 0.06322097778320312
You can get a bit faster if z_distances is very long by using numba.
Running calc for the first time usually creates some overhead which we can get rid of by running the function for some small set of `z_distances.
The below code achieves roughly a factor of two speedup over pure numpy on my laptop.
import numba as nb
#nb.njit(parallel=True)
def calc(result, z_distances):
n = z_distances.shape[0]
for ii in nb.prange(n):
pos = pos_part[:, 2] < z_distances[ii]
result[ii] = np.sum(prop_part[pos])
return result
result4 = np.zeros_like(result)
# _t = time.time()
# calc(result4, z_distances[:10])
# print(time.time()-_t)
t3 = time.time()
result4 = calc(result4, z_distances)
print("v3 :", time.time()-t3)
plt.plot(z_distances, result4)
I have an adjacency matrix stored as a pandas.DataFrame:
node_names = ['A', 'B', 'C']
a = pd.DataFrame([[1,2,3],[3,1,1],[4,0,2]],
index=node_names, columns=node_names)
a_numpy = a.as_matrix()
I'd like to create an igraph.Graph from either the pandas or the numpy adjacency matrices. In an ideal world the nodes would be named as expected.
Is this possible? The tutorial seems to be silent on the issue.
In igraph you can use igraph.Graph.Adjacency to create a graph from an adjacency matrix without having to use zip. There are some things to be aware of when a weighted adjacency matrix is used and stored in a np.array or pd.DataFrame.
igraph.Graph.Adjacency can't take an np.array as argument, but that is easily solved using tolist.
Integers in adjacency-matrix are interpreted as number of edges between nodes rather than weights, solved by using adjacency as boolean.
An example of how to do it:
import igraph
import pandas as pd
node_names = ['A', 'B', 'C']
a = pd.DataFrame([[1,2,3],[3,1,1],[4,0,2]], index=node_names, columns=node_names)
# Get the values as np.array, it's more convenenient.
A = a.values
# Create graph, A.astype(bool).tolist() or (A / A).tolist() can also be used.
g = igraph.Graph.Adjacency((A > 0).tolist())
# Add edge weights and node labels.
g.es['weight'] = A[A.nonzero()]
g.vs['label'] = node_names # or a.index/a.columns
You can reconstruct your adjacency dataframe using get_adjacency by:
df_from_g = pd.DataFrame(g.get_adjacency(attribute='weight').data,
columns=g.vs['label'], index=g.vs['label'])
(df_from_g == a).all().all() # --> True
Strictly speaking, an adjacency matrix is boolean, with 1 indicating the presence of a connection and 0 indicating the absence. Since many of the values in your a_numpy matrix are > 1, I will assume that they correspond to edge weights in your graph.
import igraph
# get the row, col indices of the non-zero elements in your adjacency matrix
conn_indices = np.where(a_numpy)
# get the weights corresponding to these indices
weights = a_numpy[conn_indices]
# a sequence of (i, j) tuples, each corresponding to an edge from i -> j
edges = zip(*conn_indices)
# initialize the graph from the edge sequence
G = igraph.Graph(edges=edges, directed=True)
# assign node names and weights to be attributes of the vertices and edges
# respectively
G.vs['label'] = node_names
G.es['weight'] = weights
# I will also assign the weights to the 'width' attribute of the edges. this
# means that igraph.plot will set the line thicknesses according to the edge
# weights
G.es['width'] = weights
# plot the graph, just for fun
igraph.plot(G, layout="rt", labels=True, margin=80)
This is possible with igraph.Graph.Weighted_Adjacency as
g = igraph.Graph.Weighted_Adjacency(a.to_numpy().tolist())
pandas.DataFrame.as_matrix has been deprecated,
so pandas.DataFrame.to_numpy should be used instead.
Additionally the numpy.ndarray given by a.to_numpy() must be converted to a list with tolist() before being passed to Weighted_Adjacency.
The node names can be stored as another attribute with
g.vs['name'] = node_names
I'm a bit new to Python and PyMC, and making rapid progress. But I'm just confused about the use of setting deterministic values of a 2D matrix. I have a model below, that I cannot get to parse correctly. The problem relates to setting the value theta in the model.
import numpy as np
import pymc
define known variables
N = 2
T = 10
tau = 1
define model... which I cannot get to parse correctly. It's the allocation of theta that I'm having trouble with. The aim to to get samples of D and x. Theta is just an intermediate variable, but I need to keep it as it's used in more complex variations of the model.
def NAFCgenerator():
D = np.empty(T, dtype=object)
theta = np.empty([N,T], dtype=object)
x = np.empty([N,T], dtype=object)
# true location of signal
for t in range(T):
D[t] = pymc.DiscreteUniform('D_%i' % t, lower=0, upper=N-1)
for t in range(T):
for n in range(N):
#pymc.deterministic(plot=False)
def temp_theta(dt=D[t], n=n):
return dt==n
theta[n,t] = temp_theta
x[n,t] = pymc.Normal('x_%i,%i' % (n,t),
mu=theta[n,t], tau=tau)
return locals()
** EDIT **
Explicit indexing is useful for me as I'm learning both PyMC and Python. But it seems that extracting MCMC samples is a bit clunky, e.g.
D0values = pymc_generator.trace('D_0')[:]
But I am probably missing something. But did I managed to get a vectorised version working
# Approach 1b - actually quite promising
def NAFCgenerator():
# NOTE TO SELF. It's important to declare these as objects
D = np.empty(T, dtype=object)
theta = np.empty([N,T], dtype=object)
x = np.empty([N,T], dtype=object)
# true location of signal
D = pymc.Categorical('D', spatial_prior, size=T)
# displayed stimuli
#pymc.deterministic(plot=False)
def theta(D=D):
theta = np.zeros([N,T])
theta[0,D==0]=1
theta[1,D==1]=1
return theta
#for n in range(N):
x = pymc.Normal('x', mu=theta, tau=tau)
return locals()
Which seems easier to get at MCMC samples using this for example
Dvalues = pymc_generator.trace('D')[:]
In PyMC2, when creating deterministic nodes with decorators, the default is to take the node name from the function name. The solution is simple: specify the node name as a parameter for the decorator.
#pymc.deterministic(name='temp_theta_%d_%d'%(t,n), plot=False)
def temp_theta(dt=D[t], n=n):
return dt==n
theta[n,t] = temp_theta
Here is a notebook that puts this in context.
I'd like to make a set of comparable empirical CDFs for a few numpy arrays (each of different length) and store these in a pandas dataframe:
a = scipy.randn(100)
b = scipy.randn(500)
# ECDF from statmodels
cdf_a = ECDF(a)
cdf_b = ECDF(b)
The problem is that cdf_a.x, cdf_a.y will be of different lengths of cdf_b.x, cdf_b.y and I would like these to be the same length, i.e. use same number of bins to compute the CDF so that these can be plotted on same scale from a pandas DataFrame. This is not possible:
df = pandas.DataFrame({"cdf_a": cdf_a.y, "cdf_b": cdf_b.y})
Since the cdfs are not of the same length. How can I bin a and b using similar bins when computing their CDFs, so that I get comparable same-length vectors back?
Is this the best solution?
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
The way we use it in some goodness of fit tests is to stack the arrays, so they are defined on all points, points from both arrays.
Then use np.searchsorted to get the ranking, number of points in dataset 1 below x and number of points in dataset 2 below x.
If I remember correctly, look at scipy.stats.ks_2samp
data1 = np.sort(data1)
data2 = np.sort(data2)
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)
It appears that this is a good solution:
bins = np.linspace(0, 1, 10)
v1 = cdf_a(bins)
v2 = cdf_b(bins)
Then len(v1) == len(v2) and these can be plotted as CDFs of a, b on the same scale.