I am new to programming in general, however I am trying really hard for a project to randomly choose some outcomes depending on the probability of that outcome happening for lotteries that i have generated and i would like to use a loop to get random numbers each time.
This is my code:
import numpy as np
p = np.arange(0.01, 1, 0.001, dtype = float)
alpha = 0.5
alpha = float(alpha)
alpha = np.zeros((1, len(p))) + alpha
def w(alpha, p):
return np.exp(-(-np.log(p))**alpha)
w = w(alpha, p)
def P(w):
return np.exp(np.log2(w))
prob_win = P(w)
prob_lose = 1 - prob_win
E = 10
E = float(E)
E = np.zeros((1, len(p))) + E
b = 0
b = float(b)
b = np.zeros((1, len(p))) + b
def A(E, b, prob_win):
return (E - b * (1 - prob_win)) / prob_win
a = A(E, b, prob_win)
a = a.squeeze()
prob_array = (prob_win, prob_lose)
prob_matrix = np.vstack(prob_array).T.squeeze()
outcomes_array = (a, b)
outcomes_matrix = np.vstack(outcomes_array).T
outcome_pairs = np.vsplit(outcomes_matrix, len(p))
outcome_pairs = np.array(outcome_pairs).astype(np.float)
prob_pairs = np.vsplit(prob_matrix, len(p))
prob_pairs = np.array(prob_pairs)
nominalized_prob_pairs = [outcome_pairs / np.sum(outcome_pairs) for
outcome_pairs in np.vsplit(prob_pairs, len(p)) ]
The code works fine but I would like to use a loop or something similar for the next line of code as I want to get for each row/ pair of probabilities to get 5 realizations. When i use size = 5 i just get a really long list but I do not know which values still belong to the pairs as when size = 1
realisations = np.concatenate([np.random.choice(outcome_pairs[i].ravel(),
size=1 , p=nominalized_prob_pairs[i].ravel()) for i in range(len(outcome_pairs))])
or if I use size=5 as below how can I match the realizations to the initial probabilities? Do i need to cut the array after every 5th element and then store the values in a matrix with 5 columns and a new row for every 5th element of the initial array? if yes how could I do this?
realisations = np.concatenate([np.random.choice(outcome_pairs[i].ravel(),
size=1 , p=nominalized_prob_pairs[i].ravel()) for i in range(len(outcome_pairs))])
What are you trying to produce exactly ? Be more concise.
Here is a starter clean code where you can produce linear data.
import numpy as np
def generate_data(n_samples, variance):
# generate 2D data
X = np.random.random((n_samples, 1))
# adding a vector of ones to ease calculus
X = np.concatenate((np.ones((n_samples, 1)), X), axis=1)
# generate two random coefficients
W = np.random.random((2, 1))
# construct targets with our data and weights
y = X # W
# add some noise to our data
y += np.random.normal(0, variance, (n_samples, 1))
return X, y, W
if __name__ == "__main__":
X, Y, W = generate_data(10, 0.5)
# check random value of x for example
for x in X:
print(x, end=' --> ')
if x[1] <= 0.4:
print('prob <= 0.4')
else:
print('prob > 0.4')
Related
I am trying to make my own CFD solver and one of the most computationally expensive parts is solving for the pressure term. One way to solve Poisson differential equations faster is by using a multigrid method. The basic recursive algorithm for this is:
function phi = V_Cycle(phi,f,h)
% Recursive V-Cycle Multigrid for solving the Poisson equation (\nabla^2 phi = f) on a uniform grid of spacing h
% Pre-Smoothing
phi = smoothing(phi,f,h);
% Compute Residual Errors
r = residual(phi,f,h);
% Restriction
rhs = restriction(r);
eps = zeros(size(rhs));
% stop recursion at smallest grid size, otherwise continue recursion
if smallest_grid_size_is_achieved
eps = smoothing(eps,rhs,2*h);
else
eps = V_Cycle(eps,rhs,2*h);
end
% Prolongation and Correction
phi = phi + prolongation(eps);
% Post-Smoothing
phi = smoothing(phi,f,h);
end
I've attempted to implement this algorithm myself (also at the end of this question) however it is very slow and doesn't give good results so evidently it is doing something wrong. I've been trying to find why for too long and I think it's just worthwhile seeing if anyone can help me.
If I use a grid size of 2^5 by 2^5 points, then it can solve it and give reasonable results. However, as soon as I go above this it takes exponentially longer to solve and basically get stuck at some level of inaccuracy, no matter how many V-Loops are performed. at 2^7 by 2^7 points, the code takes way too long to be useful.
I think my main issue is that my implementation of a jacobian iteration is using linear algebra to calculate the update at each step. This should, in general, be fast however, the update matrix A is an n*m sized matrix, and calculating the dot product of a 2^7 * 2^7 sized matrix is expensive. As most of the cells are just zeros, should I calculate the result using a different method?
if anyone has any experience in multigrid methods, I would appreciate any advice!
Thanks
my code:
# -*- coding: utf-8 -*-
"""
Created on Tue Dec 29 16:24:16 2020
#author: mclea
"""
import numpy as np
import matplotlib.pyplot as plt
from scipy.signal import convolve2d
from mpl_toolkits.mplot3d import Axes3D
from scipy.interpolate import griddata
from matplotlib import cm
def restrict(A):
"""
Creates a new grid of points which is half the size of the original
grid in each dimension.
"""
n = A.shape[0]
m = A.shape[1]
new_n = int((n-2)/2+2)
new_m = int((m-2)/2+2)
new_array = np.zeros((new_n, new_m))
for i in range(1, new_n-1):
for j in range(1, new_m-1):
ii = int((i-1)*2)+1
jj = int((j-1)*2)+1
# print(i, j, ii, jj)
new_array[i,j] = np.average(A[ii:ii+2, jj:jj+2])
new_array = set_BC(new_array)
return new_array
def interpolate_array(A):
"""
Creates a grid of points which is double the size of the original
grid in each dimension. Uses linear interpolation between grid points.
"""
n = A.shape[0]
m = A.shape[1]
new_n = int((n-2)*2 + 2)
new_m = int((m-2)*2 + 2)
new_array = np.zeros((new_n, new_m))
i = (np.indices(A.shape)[0]/(A.shape[0]-1)).flatten()
j = (np.indices(A.shape)[1]/(A.shape[1]-1)).flatten()
A = A.flatten()
new_i = np.linspace(0, 1, new_n)
new_j = np.linspace(0, 1, new_m)
new_ii, new_jj = np.meshgrid(new_i, new_j)
new_array = griddata((i, j), A, (new_jj, new_ii), method="linear")
return new_array
def adjacency_matrix(rows, cols):
"""
Creates the adjacency matrix for an n by m shaped grid
"""
n = rows*cols
M = np.zeros((n,n))
for r in range(rows):
for c in range(cols):
i = r*cols + c
# Two inner diagonals
if c > 0: M[i-1,i] = M[i,i-1] = 1
# Two outer diagonals
if r > 0: M[i-cols,i] = M[i,i-cols] = 1
return M
def create_differences_matrix(rows, cols):
"""
Creates the central differences matrix A for an n by m shaped grid
"""
n = rows*cols
M = np.zeros((n,n))
for r in range(rows):
for c in range(cols):
i = r*cols + c
# Two inner diagonals
if c > 0: M[i-1,i] = M[i,i-1] = -1
# Two outer diagonals
if r > 0: M[i-cols,i] = M[i,i-cols] = -1
np.fill_diagonal(M, 4)
return M
def set_BC(A):
"""
Sets the boundary conditions of the field
"""
A[:, 0] = A[:, 1]
A[:, -1] = A[:, -2]
A[0, :] = A[1, :]
A[-1, :] = A[-2, :]
return A
def create_A(n,m):
"""
Creates all the components required for the jacobian update function
for an n by m shaped grid
"""
LaddU = adjacency_matrix(n,m)
A = create_differences_matrix(n,m)
invD = np.zeros((n*m, n*m))
np.fill_diagonal(invD, 1/4)
return A, LaddU, invD
def calc_RJ(rows, cols):
"""
Calculates the jacobian update matrix Rj for an n by m shaped grid
"""
n = int(rows*cols)
M = np.zeros((n,n))
for r in range(rows):
for c in range(cols):
i = r*cols + c
# Two inner diagonals
if c > 0: M[i-1,i] = M[i,i-1] = 0.25
# Two outer diagonals
if r > 0: M[i-cols,i] = M[i,i-cols] = 0.25
return M
def jacobi_update(v, f, nsteps=1, max_err=1e-3):
"""
Uses a jacobian update matrix to solve nabla(v) = f
"""
f_inner = f[1:-1, 1:-1].flatten()
n = v.shape[0]
m = v.shape[1]
A, LaddU, invD = create_A(n-2, m-2)
Rj = calc_RJ(n-2,m-2)
update=True
step = 0
while update:
v_old = v.copy()
step += 1
vt = v_old[1:-1, 1:-1].flatten()
vt = np.dot(Rj, vt) + np.dot(invD, f_inner)
v[1:-1, 1:-1] = vt.reshape((n-2),(m-2))
err = v - v_old
if step == nsteps or np.abs(err).max()<max_err:
update=False
return v, (step, np.abs(err).max())
def MGV(f, v):
"""
Solves for nabla(v) = f using a multigrid method
"""
# global A, r
n = v.shape[0]
m = v.shape[1]
# If on the smallest grid size, compute the exact solution
if n <= 6 or m <=6:
v, info = jacobi_update(v, f, nsteps=1000)
return v
else:
# smoothing
v, info = jacobi_update(v, f, nsteps=10, max_err=1e-1)
A = create_A(n, m)[0]
# calculate residual
r = np.dot(A, v.flatten()) - f.flatten()
r = r.reshape(n,m)
# downsample resitdual error
r = restrict(r)
zero_array = np.zeros(r.shape)
# interploate the correction computed on a corser grid
d = interpolate_array(MGV(r, zero_array))
# Add prolongated corser grid solution onto the finer grid
v = v - d
v, info = jacobi_update(v, f, nsteps=10, max_err=1e-6)
return v
sigma = 0
# Setting up the grid
k = 6
n = 2**k+2
m = 2**(k)+2
hx = 1/n
hy = 1/m
L = 1
H = 1
x = np.linspace(0, L, n)
y = np.linspace(0, H, m)
XX, YY = np.meshgrid(x, y)
# Setting up the initial conditions
f = np.ones((n,m))
v = np.zeros((n,m))
# How many V cyles to perform
err = 1
n_cycles = 10
loop = True
cycle = 0
# Perform V cycles until converged or reached the maximum
# number of cycles
while loop:
cycle += 1
v_new = MGV(f, v)
if np.abs(v - v_new).max() < err:
loop = False
if cycle == n_cycles:
loop = False
v = v_new
print("Number of cycles " + str(cycle))
plt.contourf(v)
I realize that I'm not answering your question directly, but I do note that you have quite a few loops that will contribute some overhead cost. When optimizing code, I have found the following thread useful - particularly the line profiler thread. This way you can focus in on "high time cost" lines and then start to ask more specific questions regarding opportunities to optimize.
How do I get time of a Python program's execution?
Overall, I want to calculate a fourier transform of a given data set and filter out some of the frequencies with the biggest absolute values. So:
1) Given a data array D with accompanying times t, 2) find the k biggest fourier coefficients and 3) remove those coefficients from the data, in order to filter out certain signals from the original data.
Something goes wrong in the end when plotting the filtered data set over the given times. I'm not exactly sure, where the error is. The final 'filtered data' plot doesn't look even slightly smoothened and somehow changes its position compared with the original data. Is my code completely bad?
Part 1):
n = 1000
limit_low = 0
limit_high = 0.48
D = np.random.normal(0, 0.5, n) + np.abs(np.random.normal(0, 2, n) * np.sin(np.linspace(0, 3*np.pi, n))) + np.sin(np.linspace(0, 5*np.pi, n))**2 + np.sin(np.linspace(1, 6*np.pi, n))**2
scaling = (limit_high - limit_low) / (max(D) - min(D))
D = D * scaling
D = D + (limit_low - min(D)) # given data
t = linspace(0,D.size-1,D.size) # times
Part 2):
from numpy import linspace
import numpy as np
from scipy import fft, ifft
D_f = fft.fft(D) # fft of D-dataset
#---extract the k biggest coefficients out of D_f ---
k = 15
I, bigvals = [], []
for i in np.argsort(-D_f):
if D_f[i] not in bigvals:
bigvals.append(D_f[i])
I.append(i)
if len(I) == k:
break
bigcofs = np.zeros(len(D_f))
bigcofs[I] = D_f[I] # array with only zeros in in except for the k maximal coefficients
Part 3):
D_filter = fft.ifft(bigcofs)
D_new = D - D_filter
p1=plt.plot(t,D,'r')
p2=plt.plot(t,D_new,'b');
plt.legend((p1[0], p2[0]), ('original data', 'filtered data'))
I appreciate your help, thanks in advance.
There were two issues I noticed:
you likely want the components with largest absolute value, so np.argsort(-np.abs(D_f)) instead of np.argsort(-D_f).
More subtly: bigcofs = np.zeros(len(D_f)) is of type float64 and was discarding the imaginary part at the line bigcofs[I] = D_f[I]. You can fix this with bigcofs = np.zeros(len(D_f), dtype=complex)
I've improved your code slightly below to get desired results:
import numpy as np
from scipy import fft, ifft
import matplotlib.pyplot as plt
n = 1000
limit_low = 0
limit_high = 0.48
N_THRESH = 10
D = 0.5*np.random.normal(0, 0.5, n) + 0.5*np.abs(np.random.normal(0, 2, n) * np.sin(np.linspace(0, 3*np.pi, n))) + np.sin(np.linspace(0, 5*np.pi, n))**2 + np.sin(np.linspace(1, 6*np.pi, n))**2
scaling = (limit_high - limit_low) / (max(D) - min(D))
D = D * scaling
D = D + (limit_low - min(D)) # given data
t = np.linspace(0,D.size-1,D.size) # times
# transformed data
D_fft = fft.fft(D)
# Create boolean mask for N largest indices
idx_sorted = np.argsort(-np.abs(D_fft))
idx = idx_sorted[0:N_THRESH]
mask = np.zeros(D_fft.shape).astype(bool)
mask[idx] = True
# Split fft above, below N_THRESH points:
D_below = D_fft.copy()
D_below[mask] = 0
D_above = D_fft.copy()
D_above[~mask] = 0
#inverse separated functions
D_above = fft.ifft(D_above)
D_below = fft.ifft(D_below)
# plot
plt.ion()
f, (ax1, ax2, ax3) = plt.subplots(3,1)
l1, = ax1.plot(t, D, c="r", label="original")
l2, = ax2.plot(t, D_above, c="g", label="top {} coeff. signal".format(N_THRESH))
l3, = ax3.plot(t, D_below, c="b", label="remaining signal")
f.legend(handles=[l1,l2,l3])
plt.show()
I'm trying to use logistic regression on the popularity of hits songs on Spotify from 2010-2019 based on their durations and durability, whose data are collected from a .csv file. Basically, since the popularity values of each song are numerical, I have converted each of them to binary numbers "0" to "1". If the popularity value of a hit song is less than 70, I will replace its current value to 0, and vice versa if its value is more than 70. For some reason, as the rest of my code is pretty standard in creating a sigmoid function, the end result is a straight line instead of a sigmoid curve.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv('top10s [SubtitleTools.com] (2).csv')
BPM = df.bpm
BPM = np.array(BPM)
Energy = df.nrgy
Energy = np.array(Energy)
Dance = df.dnce
Dance = np.array(Dance)
dB = df.dB
dB = np.array(dB)
Live = df.live
Live = np.array(Live)
Valence = df.val
Valence = np.array(Valence)
Acous = df.acous
Acous = np.array(Acous)
Speech = df.spch
Speech = np.array(Speech)
df.loc[df['popu'] <= 70, 'popu'] = 0
df.loc[df['popu'] > 70, 'popu'] = 1
def Logistic_Regression(X, y, iterations, alpha):
ones = np.ones((X.shape[0], ))
X = np.vstack((ones, X))
X = X.T
b = np.zeros(X.shape[1])
for i in range(iterations):
z = np.dot(X, b)
p_hat = sigmoid(z)
gradient = np.dot(X.T, (y - p_hat))
b = b + alpha * gradient
if (i % 1000 == 0):
print('LL, i ', log_likelihood(X, y, b), i)
return b
def sigmoid(z):
return 1 / (1 + np.exp(-z))
def log_likelihood(X, y, b):
z = np.dot(X, b)
LL = np.sum(y*z - np.log(1 + np.exp(z)))
return LL
def LR1():
Dur = df.dur
Dur = np.array(Dur)
Pop = df.popu
Pop = [int(i) for i in Pop]; Pop = np.array(Pop)
plt.figure(figsize=(10,8))
colormap = np.array(['r', 'b'])
plt.scatter(Dur, Pop, c = colormap[Pop], alpha = .4)
b = Logistic_Regression(Dur, Pop, iterations = 8000, alpha = 0.00005)
print('Done')
p_hat = sigmoid(np.dot(Dur, b[1]) + b[0])
idxDur = np.argsort(Dur)
plt.plot(Dur[idxDur], p_hat[idxDur])
plt.show()
LR1()
df
Your logreg params arent coming out correctly, thus something is wrong in your gradient descent.
If I do
from sklearn.linear_model import LogisticRegression
df = pd.DataFrame({'popu':[0,1,0,1,1,0,0,1,0,0],'dur'[217,283,200,295,221,176,206,260,217,213]})
logreg = LogisticRegression()
logreg.fit(Dur.reshape([10,1]),Pop.reshape([10,1]))
print(logreg.coef_)
print(logreg.intercept_)
I get [0.86473507, -189.79655798]
whereas your params (b) come out [0.012136874150412973 -0.2430389407767768] for this data.
Plot of your vs scikit logregs here
Suppose, I have a function that has three inputs prob(x,mu,sig).
With sizes:
x = 1 x 3
mu = 1 x 3
sig = 3 x 3
Now, I have a dataset X, mean matrix M and std. deviation matrix sigma.
Sizes are:-
X : m x 3.
mean : k x 3.
sigma : k x 3 x 3
For each value m, I want to pass all values of k in the function prob to calculate my responsibility value.
I can pass the values one by one using for loops.
What would be a better way of doing this in numpy.
The related code for reference:
responsibility = np.zeros((X.shape[0],k))
s = np.zeros(k)
for i in np.arange(X.shape[0]):
for j in np.arange(k):
s[j] = prob(X[i],MU[j],SIGMA[j])
s = s/s.sum()
responsibility[i] = s
responsibility = np.transpose(responsibility)
If using a single for loop is acceptable then you can probably use the following,
import itertools
sigma.shape = k, 9
zipped_array = np.array(list(zip(mean, sigma)))
all_possible_combo = list(itertools.product(X, zipped_array))
list_len = len(all_possible_combo) # = m * k
s = np.zeros(k)
responsibility = np.zeros((X.shape[0],k))
for i in range(list_len):
X_arow = all_possible_combo[i][0]
mean_single = all_possible_combo[i][1]
sigma_single = all_possible_combo[i][2].reshape((3, 3))
s = prob(X_arow, mean_single, sigma_single)
s = s/s.sum()
responsibility[i] = s
responsibility = np.transpose(responsibility)
I've written some beginner code to calculate the co-efficients of a simple linear model using the normal equation.
# Modules
import numpy as np
# Loading data set
X, y = np.loadtxt('ex1data3.txt', delimiter=',', unpack=True)
data = np.genfromtxt('ex1data3.txt', delimiter=',')
def normalEquation(X, y):
m = int(np.size(data[:, 1]))
# This is the feature / parameter (2x2) vector that will
# contain my minimized values
theta = []
# I create a bias_vector to add to my newly created X vector
bias_vector = np.ones((m, 1))
# I need to reshape my original X(m,) vector so that I can
# manipulate it with my bias_vector; they need to share the same
# dimensions.
X = np.reshape(X, (m, 1))
# I combine these two vectors together to get a (m, 2) matrix
X = np.append(bias_vector, X, axis=1)
# Normal Equation:
# theta = inv(X^T * X) * X^T * y
# For convenience I create a new, tranposed X matrix
X_transpose = np.transpose(X)
# Calculating theta
theta = np.linalg.inv(X_transpose.dot(X))
theta = theta.dot(X_transpose)
theta = theta.dot(y)
return theta
p = normalEquation(X, y)
print(p)
Using the small data set found here:
http://www.lauradhamilton.com/tutorial-linear-regression-with-octave
I get the co-efficients: [-0.34390603; 0.2124426 ] using the above code instead of: [24.9660; 3.3058]. Could anyone help clarify where I am going wrong?
You can implement normal equation like below:
import numpy as np
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
X_b = np.c_[np.ones((100, 1)), X] # add x0 = 1 to each instance
theta_best = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y)
X_new = np.array([[0], [2]])
X_new_b = np.c_[np.ones((2, 1)), X_new] # add x0 = 1 to each instance
y_predict = X_new_b.dot(theta_best)
y_predict
This assumes X is an m by n+1 dimensional matrix where x_0 always = 1 and y is a m-dimensional vector.
import numpy as np
step1 = np.dot(X.T, X)
step2 = np.linalg.pinv(step1)
step3 = np.dot(step2, X.T)
theta = np.dot(step3, y) # if y is m x 1. If 1xm, then use y.T
Your implementation is correct. You've only swapped X and y (look closely how they define x and y), that's why you get a different result.
The call normalEquation(y, X) gives [ 24.96601443 3.30576144] as it should.
Here is the normal equation in one line:
theta = np.dot(np.linalg.inv(np.dot(X.T,X)),np.dot(X.T,Y))