I'm guessing about the way to conditionally minimize the 4D matrix.
Let's start creating some toy data (which is close to my real-world problem):
import numpy as np
t = np.arange(1960,1981,1)
N = np.arange(0,3,1)
k = np.arange(0,5,0.1)
k_matrix = ( np.tile(k,(len(N),1)).T * (N+1)/(N+2) ).T
p = np.arange(0.1,2.01,0.1)
theory = np.random.normal(10,1,[len(N),len(t),len(p)])
res2 = np.zeros([len(N),len(t),len(k),len(p)])
def calc_res2(N,t,k_matrix,p,theory):
for N_ind, N_val in enumerate(N):
for t_ind, t_val in enumerate(t):
for k_ind, k_val in enumerate(k_matrix[N_ind]):
for p_ind, p_val in enumerate(p):
res2[N_ind,t_ind,k_ind,p_ind] = (N_val*t_val-k_val*theory[N_ind,t_ind,p_ind])**2
return res2
test = calc_res2(N,t,k_matrix,p,theory)
I want to find such indices/values of k_matrix (as a function of N) and p (as a function of t) that
test sum over t and N will be minimal.
Now I see that this problem can be solved using for cycles:
def k_multi_N (test,k_matrix,p):
SUM_best = 1e99
k0i_b,k1i_b,k2i_b = 0,0,0
for k0_ind,k0 in enumerate(k_matrix[0]):
temp = test[0,:,k0_ind,:]
for k1_ind,k1 in enumerate(k_matrix[1]):
temp += test[1,:,k1_ind,:]
for k2_ind,k2 in enumerate(k_matrix[2]):
temp += test[2,:,k2_ind,:]
SUM = sum(temp.min(axis=1))
if SUM < SUM_best:
SUM_best = SUM
p_min_ind = np.argmin(temp,axis=1)
k0i_b,k1i_b,k2i_b = k0_ind,k1_ind,k2_ind
temp -= test[2,:,k2_ind,:]
temp -= test[1,:,k1_ind,:]
temp -= test[0,:,k0_ind,:]
return p_min_ind, (k0i_b,k1i_b,k2i_b)
k_multi_N (test,k_matrix,p)
So the expected output is:
(array([12, 16, 14, 8, 14, 18, 1, 18, 9, 9, 15, 18, 9, 13, 9, 3, 3,
18, 13, 6, 19]),
(0, 49, 49))
but the computational efficiency will be very small considering big-size vectors of N and k (my real-world case is 16*200 for N*k+800*200 for 't*k`, so it will be 16^200 iterations with 800*200 matrices :(
Of course, I considered numba solution, but it does not allow me to significantly speed up the calculation (i.e. it still takes a lot of time!).
I'm wondering about alternative, more computationally efficient ways to solve the problem.
Thanks!
EDIT: The question was significantly changed to clarify the problem. I appreciate the people who helped me to do it!
Related
I am studying image-processing using NumPy and facing a problem with filtering with convolution.
I would like to convolve a gray-scale image. (convolve a 2d Array with a smaller 2d Array)
Does anyone have an idea to refine my method?
I know that SciPy supports convolve2d but I want to make a convolve2d only by using NumPy.
What I have done
First, I made a 2d array the submatrices.
a = np.arange(25).reshape(5,5) # original matrix
submatrices = np.array([
[a[:-2,:-2], a[:-2,1:-1], a[:-2,2:]],
[a[1:-1,:-2], a[1:-1,1:-1], a[1:-1,2:]],
[a[2:,:-2], a[2:,1:-1], a[2:,2:]]])
the submatrices seems complicated but what I am doing is shown in the following drawing.
Next, I multiplied each submatrices with a filter.
conv_filter = np.array([[0,-1,0],[-1,4,-1],[0,-1,0]])
multiplied_subs = np.einsum('ij,ijkl->ijkl',conv_filter,submatrices)
and summed them.
np.sum(np.sum(multiplied_subs, axis = -3), axis = -3)
#array([[ 6, 7, 8],
# [11, 12, 13],
# [16, 17, 18]])
Thus this procedure can be called my convolve2d.
def my_convolve2d(a, conv_filter):
submatrices = np.array([
[a[:-2,:-2], a[:-2,1:-1], a[:-2,2:]],
[a[1:-1,:-2], a[1:-1,1:-1], a[1:-1,2:]],
[a[2:,:-2], a[2:,1:-1], a[2:,2:]]])
multiplied_subs = np.einsum('ij,ijkl->ijkl',conv_filter,submatrices)
return np.sum(np.sum(multiplied_subs, axis = -3), axis = -3)
However, I find this my_convolve2d troublesome for 3 reasons.
Generation of the submatrices is too awkward that is difficult to read and can only be used when the filter is 3*3
The size of the variant submatrices seems to be too big, since it is approximately 9 folds bigger than the original matrix.
The summing seems a little non intuitive. Simply said, ugly.
Thank you for reading this far.
Kind of update. I wrote a conv3d for myself. I will leave this as a public domain.
def convolve3d(img, kernel):
# calc the size of the array of submatrices
sub_shape = tuple(np.subtract(img.shape, kernel.shape) + 1)
# alias for the function
strd = np.lib.stride_tricks.as_strided
# make an array of submatrices
submatrices = strd(img,kernel.shape + sub_shape,img.strides * 2)
# sum the submatrices and kernel
convolved_matrix = np.einsum('hij,hijklm->klm', kernel, submatrices)
return convolved_matrix
You could generate the subarrays using as_strided:
import numpy as np
a = np.array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
sub_shape = (3,3)
view_shape = tuple(np.subtract(a.shape, sub_shape) + 1) + sub_shape
strides = a.strides + a.strides
sub_matrices = np.lib.stride_tricks.as_strided(a,view_shape,strides)
To get rid of your second "ugly" sum, alter your einsum so that the output array only has j and k. This implies your second summation.
conv_filter = np.array([[0,-1,0],[-1,5,-1],[0,-1,0]])
m = np.einsum('ij,ijkl->kl',conv_filter,sub_matrices)
# [[ 6 7 8]
# [11 12 13]
# [16 17 18]]
Cleaned up using as_strided and #Crispin 's einsum trick from above. Enforces the filter size into the expanded shape. Should even allow non-square inputs if the indices are compatible.
def conv2d(a, f):
s = f.shape + tuple(np.subtract(a.shape, f.shape) + 1)
strd = numpy.lib.stride_tricks.as_strided
subM = strd(a, shape = s, strides = a.strides * 2)
return np.einsum('ij,ijkl->kl', f, subM)
You can also use fft (one of the faster methods to perform convolutions)
from numpy.fft import fft2, ifft2
import numpy as np
def fft_convolve2d(x,y):
""" 2D convolution, using FFT"""
fr = fft2(x)
fr2 = fft2(np.flipud(np.fliplr(y)))
m,n = fr.shape
cc = np.real(ifft2(fr*fr2))
cc = np.roll(cc, -m/2+1,axis=0)
cc = np.roll(cc, -n/2+1,axis=1)
return cc
https://gist.github.com/thearn/5424195
you must pad the filter to be the same size as image ( place it in the middle of a zeros_like mat.)
cheers,
Dan
https://laurentperrinet.github.io/sciblog/posts/2017-09-20-the-fastest-2d-convolution-in-the-world.html
Check out all the convolution methods and their respective performances here.
Also, I found the below code snippet to be simpler.
from numpy.fft import fft2, ifft2
def np_fftconvolve(A, B):
return np.real(ifft2(fft2(A)*fft2(B, s=A.shape)))
I have a segmentation map (numpy.ndarray) that contain objects labeled with unique numbers. I want to combine objects across multiple slices by labeling them with the same number. Specifically, I want to renumber objects based on a DataFrame containing centroid positions and the desired label value.
First, I created some mock labels and a DataFrame:
df = pd.DataFrame({
"slice": [0, 0, 0, 0, 1, 1, 1, 2, 2, 2],
"number": [1, 2, 3, 4, 1, 2, 3, 1, 2, 3],
"x": [10, 20, 30, 40, 11, 21, 31, 12, 22, 32],
"y": [10, 20, 30, 40, 11, 21, 31, 12, 22, 32]
})
def make_segmap(df):
x, y = np.indices((50, 50))
maps = []
# Iterate over slices and coordinates
for n_slice in df["slice"].unique():
masks = []
for row in df[df["slice"] == n_slice].iterrows():
# Create circle
mask_circle = (x - row[1]["x"])**2 + (y - row[1]["y"])**2 < 5**2
# Random index number (here just a multiple)
masks.append(mask_circle * row[1]["number"]*3)
maps.append(np.max(masks, axis=0))
return np.stack(maps, axis=0)
segmap = make_segmap(df)
For renumbering, this is what I came up with so far:
new_maps = []
# Iterate over slices
for n_slice in df["slice"].unique():
new_labels = []
for row in df[df["slice"] == n_slice].iterrows():
# Find current value at position
original_label = segmap[n_slice, row[1]["y"], row[1]["x"]]
# Replace all label occurrences with the desired label from the DataFrame
replaced_label = np.where(segmap[n_slice] == original_label, row[1]["number"], 0)
new_labels.append(replaced_label)
new_maps.append(np.max(new_labels, axis=0))
new_segmap = np.stack(new_maps, axis=0)
This works reasonably well but doesn't scale to larger datasets. The real dataset has thousands of objects across hundreds of slices and this approach takes very long to run (an hour or so). Are there any suggestions on how to replace multiple values at once to improve performance?
Thanks in advance.
You can use groupby to replace the current quadratic search algorithm by a (quasi) linear search. Moreover, you can take advantage of Numpy's vectorization and broadcasting to remove the inner loop and make the computation faster.
Here is a faster implementation:
def make_segmap_fast(df):
x, y = np.indices((50, 50))
maps = []
# Iterate over slices and coordinates
for n_slice,subDf in df.groupby("slice"):
subDf_x = subDf["x"].to_numpy()[:, None, None]
subDf_y = subDf["y"].to_numpy()[:, None, None]
subDf_number = subDf["number"].to_numpy()[:, None, None]
# Create circle
mask_circle = (x - subDf_x)**2 + (y - subDf_y)**2 < 5**2
# Random index number (here just a multiple)
masks = mask_circle * subDf_number
maps.append(np.max(masks, axis=0)*3)
return np.stack(maps, axis=0)
On my machine, this is 2 times faster on the very small example (much more on bigger dataframes).
So I am taking my first proper course in Python and I have stumbled upon some issues trying to calculate the correlation coefficient for a data set. I know that I can just use np.coercoef but I would like to be able to do it "by hand" also. I have tried various combinations of the following code but I keep getting an answer that is somehow higher than the answer that np.coercoef gives me (approximately 0.62 compared to 0.57).
I was hoping that someone on here could maybe help me identify the problem in my code?
Best regards,
k_m = np.array([22, 48, 76, 10, 22, 4, 68, 44, 10, 76, 14, 56])
km = np.array([63, 39, 61, 30, 51, 44, 74, 78, 55, 58, 41, 69])
gns_k_m = 0
gns_km = 0
cov = 0
sum_k_m = 0
sum_km = 0
for k in range(len(k_m)):
gns_k_m += k_m[k]/len(k_m)
for k in range(len(km)):
gns_km += km[k]/len(km)
print(gns_k_m, gns_km)
for k in range(len(k_m)):
cov += (k_m[k]-gns_k_m)*(km[k]-gns_km)/(len(k_m)-1)
print(cov)
for k in range(len(k_m)):
sum_k_m += (k_m[k]-gns_k_m)**2
sa_k_m = np.sqrt(sum_k_m/len(k_m))
for k in range(len(km)):
sum_km += (km[k]-gns_km)**2
sa_km = np.sqrt(sum_km/len(km))
cor = cov/(sa_k_m*sa_km)
print(cor)
print(np.corrcoef(k_m,km))
The problem is that np.corrcoef uses a normalisation where the covariance is divided by len(k_m), not by your len(k_m) - 1.
Hence, if you re-normalize the result given by np.corrcoef, you obtain the same value as given by your manual implementation.
In [26]: new_res = (np.corrcoef(k_m, km) * len(k_m) / (len(k_m) - 1))[0, 1]
In [28]: np.allclose(new_res, cor)
Out[28]: True
In [29]: new_res
Out[29]: 0.6165427925911239
Trying to optimize a portfolio weight allocation here which maximize my return function by limit risk using cvxopt module. My codes are as below:
from cvxopt import matrix, solvers, spmatrix, sparse
from cvxopt.blas import dot
import numpy
import pandas as pd
import numpy as np
from datetime import datetime
solvers.options['show_progress'] = False
# solves the QP, where x is the allocation of the portfolio:
# minimize x'Px + q'x
# subject to Gx <= h
# Ax == b
#
# Input: n - # of assets
# avg_ret - nx1 matrix of average returns
# covs - nxn matrix of return covariance
# r_min - the minimum expected return that you'd
# like to achieve
# Output: sol - cvxopt solution object
dates = pd.date_range('2000-01-01', periods=6)
industry = ['industry', 'industry', 'utility', 'utility', 'consumer']
symbols = ['A', 'B', 'C', 'D', 'E']
zipped = list(zip(industry, symbols))
index = pd.MultiIndex.from_tuples(zipped)
noa = len(symbols)
data = np.array([[10, 11, 12, 13, 14, 10],
[10, 11, 10, 13, 14, 9],
[10, 10, 12, 13, 9, 11],
[10, 11, 12, 13, 14, 8],
[10, 9, 12, 13, 14, 9]])
market_to_market_price = pd.DataFrame(data.T, index=dates, columns=index)
rets = market_to_market_price / market_to_market_price.shift(1) - 1.0
rets = rets.dropna(axis=0, how='all')
# covariance of asset returns
P = matrix(rets.cov().values)
n = len(symbols)
q = matrix(np.zeros((n, 1)), tc='d')
G = matrix(-np.eye(n), tc='d')
h = matrix(-np.zeros((n, 1)), tc='d')
A = matrix(1.0, (1, n))
b = matrix(1.0)
sol = solvers.qp(P, q, G, h, A, b)
Should I use Monte Carlo simulation to get the target risk while maximize return? what's the best method for solving this problem? Thank you.
It is not as straightforward as one may think. The typical portfolio optimization problem is to minimize risk subject to a target return which is a linearly-constrained problem with a quadratic objective; ie, a quadratic program (QP).
minimize x^T.P.x
subject to sum(x_i) = 1
avg_ret^T.x >= r_min
x >= 0 (long-only)
What you want here, to maximize return subject to a target risk, is a quadraticaly constrained quadratic program (QCQP), looking like:
maximize avg_ret^T.x
subject to sum(x_i) = 1
x^T.P.x <= risk_max
x >= 0 (long-only)
With a convex quadratic constraint, optimization happens over a more complicated cone containing a second-order cone factor. If you want to proceed with cvxopt, you have to convert the QCQP to a second-order cone program (SOCP), as cvxopt does not have an explicit solver for QCQPs. SOCP with cvxopt has a different matrix syntax than the typical QP as you can see from the documentation. This blog post has a very nice walk-through on how to do this for this specific problem.
I think you are trying to calculate the Sharpe portfolio. I believe it can be shown that this is problem equivalent to minimizing the risk (w'Pw) with an equality constraint on the return (w'*rets = 1). This will be easier to specify under the quadratic programmer qp.
Suppose I have two arrays indicating the x and y coordinates of a calibration curve.
X = [1,2,3,4,5,6,7,8,9,10,12,14,16,18,20,30,40,50]
Y = [2,4,6,8,10,12,14,16,18,20,24,28,32,36,40,60,80,100]
My example arrays above contain 18 points. You'll notice that the x values are not linearly spaced; there are more points at lower values of x.
Let's suppose I need to reduce the number of points in my calibration curve to 13 points. Obviously, I could just remove the first five or the last five points, but that would shorten my overall range of x values. To maintain range and minimise the space between x values I would preferentially remove values x= 2,4,6,8,10. Removing these x points and their respective y values would leave 13 points in the curve as required.
How could I do this point selection and removal automatically in Python? I.e. Is there an algorithm to pick the best x points from a list, where "best" is defined as keeping the points as close as possible while keeping the overall range and adhering to the new number of points.
Please note that the points remaining must be in the original lists, so I can't interpolate the 18 points on to a 13 point grid.
This would maximize the square root distances between the chosen points. It in some sense spreads the points as far as possible.
import itertools
list(max(itertools.combinations(sorted(X), 13), i
key=lambda l: sum((a - b) ** 2 for a, b in zip(l, l[1:]))))
Note that this is only feasible for small problems. The time complexity for selecting k points is O(k * (len(X) choose k)), so basically O(exp(len(X)). So don't even think about using this for, e.g., len(X) == 100 and k == 10.
X = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 14, 16, 18, 20, 30, 40, 50]
Y = [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 24, 28, 32, 36, 40, 60, 80, 100]
assert len(X) == len(set(X)), "Duplicate X values found"
points = list(zip(X, Y))
points.sort() # sorts by X
while len(points) > 13:
# Find index whose neighbouring X values are closest together
i = min(range(1, len(points) - 1), key=lambda p: points[p + 1][0] - points[p - 1][0])
points.pop(i)
print(points)
Output:
[(1, 2), (3, 6), (5, 10), (7, 14), (10, 20), (12, 24), (14, 28), (16, 32), (18, 36), (20, 40), (30, 60), (40, 80), (50, 100)]
If you want the original series again:
X, Y = zip(*points)
An algorithm that would achieve that:
Convert each number into the sum of the absolute difference to the number to the left and to the right. If a number is missing, first or last cases, then use MAX_INT. For example, 1 would become MAX_INT; 2 would become 2, 10 would become 3.
Remove the first case with the lowest sum.
If you need to remove more numbers, go to 1.
This would remove 2,4,6,8,10,3,...
Here is a recursive approach that repeatedly removes the point which will be the least missed:
def mostRedundantPoint(x):
#returns the index, i, in the range 0 < i < len(x) - 1
#that minimizes x[i+1] - x[i-1]
#assumes len(x) > 2 and that x
#is sorted in ascending order
gaps = [x[i+1] - x[i-1] for i in range(1,len(x)-1)]
i = gaps.index(min(gaps))
return i+1
def reduceList(x,k):
if len(x) <= k:
return x
else:
i = mostRedundantPoint(x)
return reduceList(x[:i]+x[i+1:],k)
X = [1,2,3,4,5,6,7,8,9,10,12,14,16,18,20,30,40,50]
print(reduceList(X,13))
#prints [1, 3, 5, 7, 10, 12, 14, 16, 18, 20, 30, 40, 50]
This list essentially agrees with your intended output since 7 vs. 8 have the same net effect. It is reasonably quick in the sense that it is almost instantaneous in reducing sorted([random.randint(1,10**6) for i in range(1000)]) from 1000 elements to 100 elements. The fact that it is recursive implies that it will blow the stack if you try to remove many more points than that, but with what seems to be your intended problem size that shouldn't be an issue. If need be, you could of course replace the recursion by a loop.