I'm new to Python and looking to find an optimized solution with a number of constraints, where those constraints are based on functions of the outputs.
First two constraints are straightforward:
1) output1 +output2 + output3 = 1
2) output1, output2, and output3 must all be >= 0
Last constraint needs functions of the outputs to be EQUAL:
3) f(output1) == f(output2) == f(output3)
In this case the function array is produced a matrix multiplication of the outputs:
F = cov.dot(array([output1,output2,output3]))*array([output1,output2,output3])
f(output1) = F[0], f(output2) = F[1], f(output3) = F [2]
Hopefully I've described the problem clearly... Eventually I want to extend this to more outputs than 3.
What I have below gives me output values that don't appear to follow the constraints at all (gives me a negative value). I assume I'm entering the constraints wrong... or perhaps there is an easier way to do this with np.linalg.solve?
import numpy as np
from scipy.optimize import fsolve
cov=np.array([0.04,0.0015,0.03,0.0015,0.0025,0.000625,0.03,0.000625,0.0625]).reshape(3,3)
weights=np.array([0.3,0.2,0.5])
def RC(w):
return cov.dot(w)*w
riskcont = RC(weights)
def PV(riskcont):
return np.sqrt(riskcont.sum())
portvol = PV(riskcont)
def ERC(z):
w1=z[0]
w2=z[1]
w3=z[2]
#1) weights sum to 100%
out=w1 +w2 +w3 -1
#2) weights above zero
out.append((w1*w2*w3)>0)
#3) riskcont must all be equal
out.append([riskcont[0] == riskcont[1] == riskcont[2]] #== riskcont(w4)))
return out
z= fsolve(ERC,[1/3,1/3,1/3])
Related
I have two matrices mat1 and mat2 that are sparse (most entries are zero) and I'm not interested in the zero-valued entries: I look at the matrices from a graph-theoretical perspective where a zero means that there is no edge between the nodes.
How can I efficiently get the minimum values between non-zero entries only using scipy.sparse matrices?
I.e. an equivalent of mat1.minimum(mat2) that would ignore implicit zeros.
Using dense matrices, it is fairly easy to do:
import numpy as np
nnz = np.where(np.multiply(mat1, mat2))
m = mat1 + mat2
m[nnz] = np.minimum(mat1[nnz], mat2[nnz])
But this would be very inefficient with sparse matrices.
NB: a similar question has been asked before but did not get any relevant answer and there is a related PR on the scipy repo that proposes an implementation of this for (arg)min/max but not for minimum.
EDIT: to specify a bit more the desired behavior would be commutative, i.e. this nonzero-minimum would take all values present in only one of the two matrices and the min of the entries that are present in both matrices
Just in case someone also looks for this, my current implementation is below.
However, I'd appreciate any proposal that would either speed this up or reduce the memory footprint.
s = mat1.multiply(mat2)
s.data[:] = 1.
a1 = mat1.copy()
a1.data[:] = 1.
a1 = (a1 - s).maximum(0)
a2 = mat2.copy()
a2.data[:] = 1.
a2 = (a2 - s).maximum(0)
res = mat1.multiply(a1) + mat2.multiply(a2) + \
mat1.multiply(s).minimum(mat2.multiply(s))
If the sparse nonzeros are positive, an alternate way to use the correct UNION behavior of maximum might
be to negate and make positive.
Following your lead of mucking with data explicitly. I found
def sp_min_nz_positive(asp,bsp): # a and b scipy sparse
amax = asp.max()
bmax = bsp.max()
abmaxplus = max(amax, bmax) # + 1.0 : surprise! not needed.
# invert the direction, while remaining positive
arev = asp.copy()
arev.data[:] = abmaxplus - asp.data[:]
brev = bsp.copy()
brev.data[:] = abmaxplus - bsp.data[:]
out = arev.maximum(brev) #
# revert the direction of these positives
out.data[:] = abmaxplus - out.data[:]
return out
there may be inexactness due to roundoff
There was also a suggestion to use sparse internals. A rather generic
function is sp.find which returns the nonzero elements of anything.
So you could also try out a minimum that handles negative values too, with something like:
import scipy.sparse as sp
def sp_min_union(a, b):
assert a.shape == b.shape
assert sp.issparse(a) and sp.issparse(b)
(ra,ca,_) = sp.find(a) # over nonzeros only
(rb,cb,_) = sp.find(b) # over nonzeros only
setab = set(zip(ra,ca)).union(zip(rb,cb)) # row-column union-of-nonzero
r=[]
c=[]
v=[]
for (rr,cc) in setab:
r.append(rr)
c.append(cc)
anz = a[rr,cc]
bnz = b[rr,cc]
assert anz!=0 or bnz!=0 # they came from *some* sp.find
if anz==0: anz = bnz
#else:
# #if bnz==0: anz = anz
# #else: anz=min(anz,bnz)
# equiv.
elif bnz!=0: anz=min(anz,bnz)
v.append(anz)
# choose what sparse output format you want, many seem
# constructible as:
return sp.csr_matrix((v, (r,c)), shape=a.shape)
I do have a piece of code that compute partitions of a set of (potentialy duplicated) integers. But i am interested in the set of possible partition and there multiplicity.
You can for exemple launch the follwoing code :
import numpy as np
from collections import Counter
import pandas as pd
def _B(i):
# for a given multiindex i, we defined _B(i) as the set of integers containg i_j times the number j:
if len(i) != 1:
B = []
for j in range(len(i)):
B.extend(i[j]*[j])
else:
B = i*[0]
return B
def _partition(collection):
# from here: https://stackoverflow.com/a/62532969/8425270
if len(collection) == 1:
yield (collection,)
return
first = collection[0]
for smaller in _partition(collection[1:]):
# insert `first` in each of the subpartition's subsets
for n, subset in enumerate(smaller):
yield smaller[:n] + ((first,) + subset,) + smaller[n + 1 :]
# put `first` in its own subset
yield ((first,),) + smaller
def to_list(tpl):
# the final hierarchy is
return list(list(i) if isinstance(i, tuple) else i for i in tpl)
def _Pi(inst_B):
# inst_B must be a tuple
if type(inst_B) != tuple :
inst_B = tuple(inst_B)
pp = [tuple(sorted(p)) for p in _partition(inst_B)]
c = Counter(pp)
Pi = c.keys()
N = list()
for pi in Pi:
N.append(c[pi])
Pi = [to_list(pi) for pi in Pi]
return Pi, N
if __name__ == "__main__":
import cProfile
pr = cProfile.Profile()
pr.enable()
sh = (3, 3, 3)
rez = list()
rez_sorted= list()
rez_ref = list()
for idx in np.ndindex(sh):
if sum(idx) > 0:
print(idx)
Pi, N = _Pi(_B(idx))
print(pd.DataFrame({'Pi': Pi, 'N': N * np.array([np.math.factorial(len(pi) - 1) for pi in Pi])}))
pr.disable()
# after your program ends
pr.print_stats(sort="tottime")
This code computes, for several examples of tuples of integer numbers (generated by np.ndindex) the partitions and counts i need. Everything happens in the _partition and the _Pi functions, this is were you should look at.
If you look closely at how these two functions are working, you'll see that they comput eevery potential partition and THEN count up how many times they appeared. For small problems, this is fine, but if the size of the prolbme increase, this starts to take a looooot of time. Try setting sh = (5,5,5), you'll see what i mean;
So the problem is the following :
Is there a way to compute directly the partitions and there number of occurences instead ?
Edit: I cross-posted on mathoverflow there, and they propose a solution in this article, in corrolary 2.10 (page 10 of the pdf). The problem could be solved by implmenting the sets p(v,r) in this corrolary.
I was hoping, as in the univariate case, that those sets would have a nice recursive expression but i ould not find one yet.
More Edit : This problem is equivalent to finding all (multiset)-partitions of a multiset. If the solution for finding (set)-partitions of a set is given by Bell partial polynomials, here we need multivariate version of these polynomials.
I have a application written in python for calculating minimal return value from a function. Am using scipy.optimize.mminimize with SLSQP as the optimization method.
It runs in a loop and for saving time and kip it from just finding local minima I need it to use the x0 that I provide it.
The problem seems to be that it dos not care what x0 i give it. It just starts optimizing at random values. What do I do wrong?
I have written a smal test application to test x0 on the minimizer:
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import minimize
global log
log = []
counter = 0
def callback(x):
global counter
counter += 1
log.append(x)
print('u_guessx',x)
return True
def objectivefunction(x, *arg):
SUM = 2*x[0]**3 + 3*(3-x[0])**2 - 5*x[2]**1 + 50
return SUM
# Defining Initial Conditions
u_guess = np.array([0 for u in range(3)])
#u_guess = np.zeros(4)
print("u shape: ",u_guess.shape)
print("u_init: ",u_guess)
#Simulation loop:
bounds_u = [(0,20) for i in u_guess]
# Run Optimizer
solution_guess = minimize(objectivefunction,
u_guess,
method = 'SLSQP',
callback = callback,
bounds=bounds_u,
options={'ftol': 1e-9, 'disp': True},
)
u_guess = solution_guess.x
u_opt = u_guess.item(0)
print("type(solution_guess.x): ",type(solution_guess.x))
print("u_opt: ",u_opt)
print("solution_guess.x: ",solution_guess.x)
#print("log: ",log)
print("counter: ",counter )
First of all, isn't your objectivefunction() wrong ? I suspect it should be
def objectivefunction(x, *arg):
SUM = 2*x[0]**3 + 3*(3-x[1])**2 - 5*x[2]**1 + 50
return SUM
In the original function, x[1] is not used and therefore the algorithm is insensitive to x[1].
Secondly, the function callback() is called after each iteration. Therefore the first output is not the initial condition, but the first guess for the minima based on your initial conditions. If I run the corrected program and change the initial condition, it output different guesses. But for different runs with the same initial conditions, it always output the same guesses. There is no randomness (assuming I use corrected version of objectivefunction).
I have a weird problem with iterators, which I can't figure out. I have a complicated numerical routine returning a generator object (or after some changes to the code an islice). Afterwards I check, the results as I know that the results must have a negative imaginary part:
import numpy as np
threshold = 1e-8 # just check up to some numerical accuracy
results = result_generator(**inputs)
is_valid = [np.all(_result.imag < threshold) for _result in results]
print("Number of valid results: ", is_valid.count(True))
(Sorry for not giving an executable code, but I can't come up with a simple code at the moment.)
The problem is now, that this returns one valid solution. If I change the code to
import numpy as np
threshold = 1e-8 # just check up to some numerical accuracy
results = list(result_generator(**inputs))
is_valid = [np.all(_result.imag < threshold) for _result in results]
print("Number of valid results: ", is_valid.count(True))
using a list instead of a generator, I get zero valid solution. I can however not wrap my head around what is different and thus have no idea how to debug the problem.
If I go through the debugger and print out the result with the corresponding index the results are even different, the one of the generator is correct, the one of the list is wrong.
Here the numerical function:
def result_generator(z, iw, coeff, n_min, n_max):
assert n_min >= 1
assert n_min < n_max
if n_min % 2:
# index must be even
n_min += 1
id1 = np.ones_like(z, dtype=complex)
A0, A1 = 0.*id1, coeff[0]*id1
A2 = coeff[0] * id1
B2 = 1. * id1
multiplier = np.subtract.outer(z, iw[:-1])*coeff[1:]
multiplier = np.moveaxis(multiplier, -1, 0).copy()
def _iteration(multiplier_im):
multiplier_im = multiplier_im/B2
A2[:] = A1 + multiplier_im*A0
B2[:] = 1. + multiplier_im
A0[:] = A1
A1[:] = A2 / B2
return A1
complete_iterations = (_iteration(multiplier_im) for multiplier_im in multiplier)
return islice(complete_iterations, n_min, n_max, 2)
You're yielding the same array over and over instead of making new arrays. When you call list, you get a list of references to the same array, and that array is in its final state. When you don't call list, you examine the array in the state the generator yields it, each time it's yielded.
Stop reusing the same array over and over.
I have a set of data for which has an ID, timestamp, and identifiers. I have to go through it, calculate the entropy and save some other links for the data. At each step more identifiers are added to the identifiers dictionary and I have to re-compute the entropy and append it. I have really large amount of data and the program gets stuck due to growing number of identifiers and their entropy calculation after each step. I read the following solution but it is about the data consisting of numbers.
Incremental entropy computation
I have copied two functions from this page and the incremental calculation of entropy gives different values than the classical full entropy calculation at every step.
Here is the code I have:
from math import log
# ---------------------------------------------------------------------#
# Functions copied from https://stackoverflow.com/questions/17104673/incremental-entropy-computation
# maps x to -x*log2(x) for x>0, and to 0 otherwise
h = lambda p: -p*log(p, 2) if p > 0 else 0
# entropy of union of two samples with entropies H1 and H2
def update(H1, S1, H2, S2):
S = S1+S2
return 1.0*H1*S1/S+h(1.0*S1/S)+1.0*H2*S2/S+h(1.0*S2/S)
# compute entropy using the classic equation
def entropy(L):
n = 1.0*sum(L)
return sum([h(x/n) for x in L])
# ---------------------------------------------------------------------#
# Below is the input data (Actually I read it from a csv file)
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
total_identifiers = {} # To store the occurrences of identifiers. Values shows the number of occurrences
all_entropies = [] # Classical way of calculating entropy at every step
updated_entropies = [] # Incremental way of calculating entropy at every step
for item in input_data:
temp = item[2].split(",")
identifiers_sum = sum(total_identifiers.values()) # Sum of all identifiers
old_entropy = 0 if all_entropies[-1:] == [] else all_entropies[-1] # Get previous entropy calculation
for identifier in temp:
S_new = len(temp) # sum of new samples
temp_dictionaty = {a:1 for a in temp} # Store current identifiers and their occurrence
if identifier not in total_identifiers:
total_identifiers[identifier] = 1
else:
total_identifiers[identifier] += 1
current_entropy = entropy(total_identifiers.values()) # Entropy for current set of identifiers
updated_entropy = update(old_entropy, identifiers_sum, current_entropy, S_new)
updated_entropies.append(updated_entropy)
entropy_value = entropy(total_identifiers.values()) # Classical entropy calculation for comparison. This step becomes too expensive with big data
all_entropies.append(entropy_value)
print(total_identifiers)
print('Sum of Total Identifiers: ', identifiers_sum) # Gives 12 while the sum is 14 ???
print("All Classical Entropies: ", all_entropies) # print for comparison
print("All Updated Entropies: ", updated_entropies)
The other issue is that when I print "Sum of total_identifiers", it gives 12 instead of 14! (Due to very large amount of data, I read the actual file line by line and write the results directly to the disk and do not store it in the memory apart from the dictionary of identifiers).
The code above uses Theorem 4; it seems to me that you want to use Theorem 5 instead (from the paper in the next paragraph).
Note, however, that if the number of identifiers is really the problem then the incremental approach below isn't going to work either---at some point the dictionaries are going to get too large.
Below you can find a proof-of-concept Python implementation that follows the description from Updating Formulas and Algorithms for Computing Entropy and Gini Index from Time-Changing Data Streams.
import collections
import math
import random
def log2(p):
return math.log(p, 2) if p > 0 else 0
CountChange = collections.namedtuple('CountChange', ('label', 'change'))
class EntropyHolder:
def __init__(self):
self.counts_ = collections.defaultdict(int)
self.entropy_ = 0
self.sum_ = 0
def update(self, count_changes):
r = sum([change for _, change in count_changes])
residual = self._compute_residual(count_changes)
self.entropy_ = self.sum_ * (self.entropy_ - log2(self.sum_ / (self.sum_ + r))) / (self.sum_ + r) - residual
self._update_counts(count_changes)
return self.entropy_
def _compute_residual(self, count_changes):
r = sum([change for _, change in count_changes])
residual = 0
for label, change in count_changes:
p_new = (self.counts_[label] + change) / (self.sum_ + r)
p_old = self.counts_[label] / (self.sum_ + r)
residual += p_new * log2(p_new) - p_old * log2(p_old)
return residual
def _update_counts(self, count_changes):
for label, change in count_changes:
self.sum_ += change
self.counts_[label] += change
def entropy(self):
return self.entropy_
def naive_entropy(counts):
s = sum(counts)
return sum([-(r/s) * log2(r/s) for r in counts])
if __name__ == '__main__':
print(naive_entropy([1, 1]))
print(naive_entropy([1, 1, 1, 1]))
entropy = EntropyHolder()
freq = collections.defaultdict(int)
for _ in range(100):
index = random.randint(0, 5)
entropy.update([CountChange(index, 1)])
freq[index] += 1
print(naive_entropy(freq.values()))
print(entropy.entropy())
Thanks #blazs for providing the entropy_holder class. That solves the problem. So the idea is to import entropy_holder.py from (https://gist.github.com/blazs/4fc78807a96976cc455f49fc0fb28738) and use it to store the previous entropy and update at every step when new identifiers come.
So the minimum working code would look like this:
import entropy_holder
input_data = [["1","2008-01-06T02:13:38Z","foo,bar"], ["2","2008-01-06T02:12:13Z","bar,blup"], ["3","2008-01-06T02:13:55Z","foo,bar"],
["4","2008-01-06T02:12:28Z","foo,xy"], ["5","2008-01-06T02:12:44Z","foo,bar"], ["6","2008-01-06T02:13:00Z","foo,bar"],
["7","2008-01-06T02:13:00Z","x,y"]]
entropy = entropy_holder.EntropyHolder() # This class will hold the current entropy and counts of identifiers
for item in input_data:
for identifier in item[2].split(","):
entropy.update([entropy_holder.CountChange(identifier, 1)])
print(entropy.entropy())
This entropy by using the Blaz's incremental formulas is very close to the entropy calculated the classical way and saves from iterating over all the data again and again.