Consider this sample of User objects:
import numpy as np
class User:
def __init__(self, name, rating, actual_rating):
self.name: str = name
self.rating: int = rating
# Actual States
self.actual_rating: int = actual_rating
users = []
for actual_rating in np.random.binomial(10000, 0.157, 1000):
users.append(
User(str(random()), 1500, actual_rating)
)
# Sorting Users Randomly
sorted_users = []
How do I sort this users list such that the likelihood that an object is lower in index in the sorted_users depends on actual_rating being higher. For instance a random User("0.5465454", 1500, 1678) will have a higher likelihood of being sorted to be at index 0 of the sorted_users list than say User("0.7689989", 1500, 1400).
If possible is there a neat and readable way to do this?
Generating a random value for each user, then sorting according to this value
How about doing a first pass where you generate, for each user, a random number from a Gaussian distribution with mean actual_rating? Then you sort according to this random number instead of sorting according to actual_rating directly.
stddev = 1.0 # the larger this number, the more shuffled the list - the smaller, the more sorted the list
sorted_users = sorted(users, key=lambda u:np.random.normal(u.actual_rating, stddev))
Note the parameter stddev which you can adjust to suit your needs. The higher this parameter, the more shuffled the list in the end.
Sorting the list, then shuffling it lightly
Inspired by How to lightly shuffle a list in python?
Sort the list according to actual_rating, then shuffle it lightly.
sorted_users = sorted(users, key=lambda u:u.actual_rating)
nb_passes = 3
proba_swap = 0.25
for k in range(nb_passes):
for i in range(k%2, len(sorted_users) - 1, 2):
if random() < proba_swap:
sorted_users[i], sorted_users[i+1] = sorted_users[i+1], sorted_users[i]
Note the two parameters nb_passes (positive integer) and proba_swap (between 0.0 and 1.0) which you can adjust to better suit your needs.
Instead of using a fixed parameter proba_swap, you could make up a formula for a probability of swapping that depends on how close the actual_ratings of the two users are, for instance def proba_swap(r1,r2): return math.exp(-a*(r1-r2)**2)/2.0 for some positive parameter a.
Or alternatively:
sorted_users = sorted(users, key=lambda u:u.actual_rating)
nb_swaps = int(1.5 * len(sorted_users)) # parameter to experiment with
for i in random.choices(range(len(sorted_users)-1), k=nb_swaps):
sorted_users[i], sorted_users[i+1] = sorted_users[i+1], sorted_users[i]
See also
After searching a little bit, I found this similar question:
Randomly sort a list with bias
I am using a genetic algorithm implemented with the DEAP library for Python. In order to avoid premature convergence, and to force exploration of the feature space, I would like the mutation probability to be high during the first generations. But to prevent drifting away from extremum once they are identified, I would like the mutation probability to be lower in the last generations. How do I get the mutation probability to decrease over the generations? Is there any built-in function in DEAP to get this done?
When I register a mutation function, for instance
toolbox.register('mutate', tools.mutPolynomialBounded, eta=.6, low=[0,0], up=[1,1], indpb=0.1)
the indpb parameter is a float. How can I make it a function of something else?
Sounds like a job for Callbackproxy which evaluates function arguments each time they are called. I added a simple example where I modified the official DEAP n-queen example
such that the mutation rate is set to 2/N_GENS (arbitrary choice just to make the point).
Notice that Callbackproxy receives a lambda, so you have to pass the mutation rate argument as a function (either using a fully blown function or just a lambda). The result anyway is that each time the indpb parameter is evaluated this lambda will be called, and if the lambda contains reference to a global variable generation counter, you got what you want.
# This file is part of DEAP.
#
# DEAP is free software: you can redistribute it and/or modify
# it under the terms of the GNU Lesser General Public License as
# published by the Free Software Foundation, either version 3 of
# the License, or (at your option) any later version.
#
# DEAP is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public
# License along with DEAP. If not, see <http://www.gnu.org/licenses/>.
import random
from objproxies import CallbackProxy
import numpy
from deap import algorithms
from deap import base
from deap import creator
from deap import tools
# Problem parameter
NB_QUEENS = 20
N_EVALS = 0
N_GENS = 1
def evalNQueens(individual):
global N_EVALS, N_GENS
"""Evaluation function for the n-queens problem.
The problem is to determine a configuration of n queens
on a nxn chessboard such that no queen can be taken by
one another. In this version, each queens is assigned
to one column, and only one queen can be on each line.
The evaluation function therefore only counts the number
of conflicts along the diagonals.
"""
size = len(individual)
# Count the number of conflicts with other queens.
# The conflicts can only be diagonal, count on each diagonal line
left_diagonal = [0] * (2 * size - 1)
right_diagonal = [0] * (2 * size - 1)
# Sum the number of queens on each diagonal:
for i in range(size):
left_diagonal[i + individual[i]] += 1
right_diagonal[size - 1 - i + individual[i]] += 1
# Count the number of conflicts on each diagonal
sum_ = 0
for i in range(2 * size - 1):
if left_diagonal[i] > 1:
sum_ += left_diagonal[i] - 1
if right_diagonal[i] > 1:
sum_ += right_diagonal[i] - 1
N_EVALS += 1
if N_EVALS % 300 == 0:
N_GENS += 1
return sum_,
creator.create("FitnessMin", base.Fitness, weights=(-1.0,))
creator.create("Individual", list, fitness=creator.FitnessMin)
# Since there is only one queen per line,
# individual are represented by a permutation
toolbox = base.Toolbox()
toolbox.register("permutation", random.sample, range(NB_QUEENS), NB_QUEENS)
# Structure initializers
# An individual is a list that represents the position of each queen.
# Only the line is stored, the column is the index of the number in the list.
toolbox.register("individual", tools.initIterate, creator.Individual, toolbox.permutation)
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
toolbox.register("evaluate", evalNQueens)
toolbox.register("mate", tools.cxPartialyMatched)
toolbox.register("mutate", tools.mutShuffleIndexes, indpb=CallbackProxy(lambda: 2.0 / N_GENS))
toolbox.register("select", tools.selTournament, tournsize=3)
def main(seed=0):
random.seed(seed)
pop = toolbox.population(n=300)
hof = tools.HallOfFame(1)
stats = tools.Statistics(lambda ind: ind.fitness.values)
stats.register("Avg", numpy.mean)
stats.register("Std", numpy.std)
stats.register("Min", numpy.min)
stats.register("Max", numpy.max)
algorithms.eaSimple(pop, toolbox, cxpb=0.5, mutpb=1, ngen=100, stats=stats,
halloffame=hof, verbose=True)
return pop, stats, hof
if __name__ == "__main__":
main()
The following is the objective function:
The idea is that a mean-variance optimization has already been done on a universe of securities. This gives us the weights for a target portfolio. Now suppose the investor already is holding a portfolio and does not want to change their entire portfolio to the target one.
Let w_0 = [w_0(1),w_0(2),...,w_0(N)] be the initial portfolio, where w_0(i) is the fraction of the portfolio invested in
stock i = 1,...,N. Let w_t = [w_t(1), w_t(2),...,w_t(N)] be the target portfolio, i.e., the portfolio
that it is desirable to own after rebalancing. This target portfolio may be constructed using quadratic optimization techniques such as variance minimization.
The objective is to decide the final portfolio w_f = [w_f (1), w_f (2),..., w_f(N)] that satisfies the
following characteristics:
(1) The final portfolio is close to our target portfolio
(2) The number of transactions from our initial portfolio is sufficiently small
(3) The return of the final portfolio is high
(4) The final portfolio does not hold many more securities that our initial portfolio
An objective function which is to be minimized is created by summing together the characteristic terms 1 through 4.
The first term is captured by summing the absolute difference in weights from the final and the target portfolio.
The second term is captured by the sum of an indicator function multiplied by a user specified penalty. The indicator function is y_{transactions}(i) where it is 1 if the weight of security i is different in the initial portfolio and the final portfolio, and 0 otherwise.
The third term is captured by the total final portfolio return multiplied by a negative user specified penalty since the objective is minimization.
The final term is the count of assets in the final portfolio (ie. sum of an indicator function counting the number of positive weights in the final portfolio), multiplied by a user specified penalty.
Assuming that we already have the target weights as target_w how do I setup this optimization problem in docplex python library? Or if anyone is familiar with mixed integer programming in NAG it would be helpful to know how to setup such a problem there as well.
`
final_w = [0.]*n
final_w = np.array(final_w)
obj1 = np.sum(np.absolute(final_w - target_w))
pen_trans = 1.2
def ind_trans(final,inital):
list_trans = []
for i in range(len(final)):
if abs(final[i]-inital[i]) == 0:
list_trans.append(0)
else:
list_trans.append(1)
return list_trans
obj2 = pen_trans*sum(ind_trans(final_w,initial_w))
pen_returns = 0.6
returns_np = np.array(df_secs['Return'])
obj3 = (-1)*np.dot(returns_np,final_w)
pen_count = 1.
def ind_count(final):
list_count = []
for i in range(len(final)):
if final[i] == 0:
list_count.append(0)
else:
list_count.append(1)
return list_count
obj4 = sum(ind_count(final_w))
objective = obj1 + obj2 + obj3 + obj4
The main issue in your code is that final_w is not a an array of variables but an array of data. So there will be nothing to optimize. To create an array of variables in docplex you have to do something like this:
from docplex.mp.model import Model
with Model() as m:
final = m.continuous_var_list(n, 0.0, 1.0)
That creates n variables that can take values between 0 and 1. With that in hand you can start things. For example:
obj1 = m.sum(m.abs(initial[i] - final[i]) for i in range(n))
For the next objective things become harder since you need indicator constraints. To simplify definition of these constraints first define a helper variable delta that gives the absolute difference between stocks:
delta = m.continuous_var_list(n, 0.0, 1.0)
m.add_constraints(delta[i] == m.abs(initial[i] - final[i]) for i in range(n))
Next you need an indicator variable that is 1 if a transaction is required to adjust stock i:
needtrans = m.binary_var_list(n)
for i in range(n):
# If needtrans[i] is 0 then delta[i] must be 0.
# Since needtrans[i] is penalized in the objective, the solver will
# try hard to set it to 0. It will only set it to 1 if delta[i] != 0.
# That is exactly what we want
m.add_indicator(needtrans[i], delta[i] == 0, 0)
With that you can define the second objective:
obj2 = pen_trans * m.sum(needtrans)
once all objectives have been defined, you can add their sum to the model:
m.minimize(obj1 + obj2 + obj3 + obj4)
and then solve the model and display its solution:
m.solve()
print(m.solution.get_values(final))
If any of the above is not (yet) clear to you then I suggest you take a look at the many examples that ship with docplex and also at the (reference) documentation.
I am working on a code to solve for the optimum combination of diameter size of number of pipelines. The objective function is to find the least sum of pressure drops in six pipelines.
As I have 15 choices of discrete diameter sizes which are [2,4,6,8,12,16,20,24,30,36,40,42,50,60,80] that can be used for any of the six pipelines that I have in the system, the list of possible solutions becomes 15^6 which is equal to 11,390,625
To solve the problem, I am using Mixed-Integer Linear Programming using Pulp package. I am able to find the solution for the combination of same diameters (e.g. [2,2,2,2,2,2] or [4,4,4,4,4,4]) but what I need is to go through all combinations (e.g. [2,4,2,2,4,2] or [4,2,4,2,4,2] to find the minimum. I attempted to do this but the process is taking a very long time to go through all combinations. Is there a faster way to do this ?
Note that I cannot calculate the pressure drop for each pipeline as the choice of diameter will affect the total pressure drop in the system. Therefore, at anytime, I need to calculate the pressure drop of each combination in the system.
I also need to constraint the problem such that the rate/cross section of pipeline area > 2.
Your help is much appreciated.
The first attempt for my code is the following:
from pulp import *
import random
import itertools
import numpy
rate = 5000
numberOfPipelines = 15
def pressure(diameter):
diameterList = numpy.tile(diameter,numberOfPipelines)
pressure = 0.0
for pipeline in range(numberOfPipelines):
pressure += rate/diameterList[pipeline]
return pressure
diameterList = [2,4,6,8,12,16,20,24,30,36,40,42,50,60,80]
pipelineIds = range(0,numberOfPipelines)
pipelinePressures = {}
for diameter in diameterList:
pressures = []
for pipeline in range(numberOfPipelines):
pressures.append(pressure(diameter))
pressureList = dict(zip(pipelineIds,pressures))
pipelinePressures[diameter] = pressureList
print 'pipepressure', pipelinePressures
prob = LpProblem("Warehouse Allocation",LpMinimize)
use_diameter = LpVariable.dicts("UseDiameter", diameterList, cat=LpBinary)
use_pipeline = LpVariable.dicts("UsePipeline", [(i,j) for i in pipelineIds for j in diameterList], cat = LpBinary)
## Objective Function:
prob += lpSum(pipelinePressures[j][i] * use_pipeline[(i,j)] for i in pipelineIds for j in diameterList)
## At least each pipeline must be connected to a diameter:
for i in pipelineIds:
prob += lpSum(use_pipeline[(i,j)] for j in diameterList) ==1
## The diameter is activiated if at least one pipelines is assigned to it:
for j in diameterList:
for i in pipelineIds:
prob += use_diameter[j] >= lpSum(use_pipeline[(i,j)])
## run the solution
prob.solve()
print("Status:", LpStatus[prob.status])
for i in diameterList:
if use_diameter[i].varValue> pressureTest:
print("Diameter Size",i)
for v in prob.variables():
print(v.name,"=",v.varValue)
This what I did for the combination part which took really long time.
xList = np.array(list(itertools.product(diameterList,repeat = numberOfPipelines)))
print len(xList)
for combination in xList:
pressures = []
for pipeline in range(numberOfPipelines):
pressures.append(pressure(combination))
pressureList = dict(zip(pipelineIds,pressures))
pipelinePressures[combination] = pressureList
print 'pipelinePressures',pipelinePressures
I would iterate through all combinations, I think you would run into memory problems otherwise trying to model ALL combinations in a MIP.
If you iterate through the problems perhaps using the multiprocessing library to use all cores, it shouldn't take long just remember only to hold information on the best combination so far, and not to try and generate all combinations at once and then evaluate them.
If the problem gets bigger you should consider Dynamic Programming Algorithms or use pulp with column generation.
Thanks for the answers, I have not used StackOverflow before so I was suprised by the number of answers and the speed of them - its fantastic.
I have not been through the answers properly yet, but thought I should add some information to the problem specification. See the image below.
I can't post an image in this because i don't have enough points but you can see an image
at http://journal.acquitane.com/2010-01-20/image003.jpg
This image may describe more closely what I'm trying to achieve. So you can see on the horizontal lines across the page are price points on the chart. Now where you get a clustering of lines within 0.5% of each, this is considered to be a good thing and why I want to identify those clusters automatically. You can see on the chart that there is a cluster at S2 & MR1, R2 & WPP1.
So everyday I produce these price points and then I can identify manually those that are within 0.5%. - but the purpose of this question is how to do it with a python routine.
I have reproduced the list again (see below) with labels. Just be aware that the list price points don't match the price points in the image because they are from two different days.
[YR3,175.24,8]
[SR3,147.85,6]
[YR2,144.13,8]
[SR2,130.44,6]
[YR1,127.79,8]
[QR3,127.42,5]
[SR1,120.94,6]
[QR2,120.22,5]
[MR3,118.10,3]
[WR3,116.73,2]
[DR3,116.23,1]
[WR2,115.93,2]
[QR1,115.83,5]
[MR2,115.56,3]
[DR2,115.53,1]
[WR1,114.79,2]
[DR1,114.59,1]
[WPP,113.99,2]
[DPP,113.89,1]
[MR1,113.50,3]
[DS1,112.95,1]
[WS1,112.85,2]
[DS2,112.25,1]
[WS2,112.05,2]
[DS3,111.31,1]
[MPP,110.97,3]
[WS3,110.91,2]
[50MA,110.87,4]
[MS1,108.91,3]
[QPP,108.64,5]
[MS2,106.37,3]
[MS3,104.31,3]
[QS1,104.25,5]
[SPP,103.53,6]
[200MA,99.42,7]
[QS2,97.05,5]
[YPP,96.68,8]
[SS1,94.03,6]
[QS3,92.66,5]
[YS1,80.34,8]
[SS2,76.62,6]
[SS3,67.12,6]
[YS2,49.23,8]
[YS3,32.89,8]
I did make a mistake with the original list in that Group C is wrong and should not be included. Thanks for pointing that out.
Also the 0.5% is not fixed this value will change from day to day, but I have just used 0.5% as an example for spec'ing the problem.
Thanks Again.
Mark
PS. I will get cracking on checking the answers now now.
Hi:
I need to do some manipulation of stock prices. I have just started using Python, (but I think I would have trouble implementing this in any language). I'm looking for some ideas on how to implement this nicely in python.
Thanks
Mark
Problem:
I have a list of lists (FloorLevels (see below)) where the sublist has two items (stockprice, weight). I want to put the stockprices into groups when they are within 0.5% of each other. A groups strength will be determined by its total weight. For example:
Group-A
115.93,2
115.83,5
115.56,3
115.53,1
-------------
TotalWeight:12
-------------
Group-B
113.50,3
112.95,1
112.85,2
-------------
TotalWeight:6
-------------
FloorLevels[
[175.24,8]
[147.85,6]
[144.13,8]
[130.44,6]
[127.79,8]
[127.42,5]
[120.94,6]
[120.22,5]
[118.10,3]
[116.73,2]
[116.23,1]
[115.93,2]
[115.83,5]
[115.56,3]
[115.53,1]
[114.79,2]
[114.59,1]
[113.99,2]
[113.89,1]
[113.50,3]
[112.95,1]
[112.85,2]
[112.25,1]
[112.05,2]
[111.31,1]
[110.97,3]
[110.91,2]
[110.87,4]
[108.91,3]
[108.64,5]
[106.37,3]
[104.31,3]
[104.25,5]
[103.53,6]
[99.42,7]
[97.05,5]
[96.68,8]
[94.03,6]
[92.66,5]
[80.34,8]
[76.62,6]
[67.12,6]
[49.23,8]
[32.89,8]
]
I suggest a repeated use of k-means clustering -- let's call it KMC for short. KMC is a simple and powerful clustering algorithm... but it needs to "be told" how many clusters, k, you're aiming for. You don't know that in advance (if I understand you correctly) -- you just want the smallest k such that no two items "clustered together" are more than X% apart from each other. So, start with k equal 1 -- everything bunched together, no clustering pass needed;-) -- and check the diameter of the cluster (a cluster's "diameter", from the use of the term in geometry, is the largest distance between any two members of a cluster).
If the diameter is > X%, set k += 1, perform KMC with k as the number of clusters, and repeat the check, iteratively.
In pseudo-code:
def markCluster(items, threshold):
k = 1
clusters = [items]
maxdist = diameter(items)
while maxdist > threshold:
k += 1
clusters = Kmc(items, k)
maxdist = max(diameter(c) for c in clusters)
return clusters
assuming of course we have suitable diameter and Kmc Python functions.
Does this sound like the kind of thing you want? If so, then we can move on to show you how to write diameter and Kmc (in pure Python if you have a relatively limited number of items to deal with, otherwise maybe by exploiting powerful third-party add-on frameworks such as numpy) -- but it's not worthwhile to go to such trouble if you actually want something pretty different, whence this check!-)
A stock s belong in a group G if for each stock t in G, s * 1.05 >= t and s / 1.05 <= t, right?
How do we add the stocks to each group? If we have the stocks 95, 100, 101, and 105, and we start a group with 100, then add 101, we will end up with {100, 101, 105}. If we did 95 after 100, we'd end up with {100, 95}.
Do we just need to consider all possible permutations? If so, your algorithm is going to be inefficient.
You need to specify your problem in more detail. Just what does "put the stockprices into groups when they are within 0.5% of each other" mean?
Possibilities:
(1) each member of the group is within 0.5% of every other member of the group
(2) sort the list and split it where the gap is more than 0.5%
Note that 116.23 is within 0.5% of 115.93 -- abs((116.23 / 115.93 - 1) * 100) < 0.5 -- but you have put one number in Group A and one in Group C.
Simple example: a, b, c = (0.996, 1, 1.004) ... Note that a and b fit, b and c fit, but a and c don't fit. How do you want them grouped, and why? Is the order in the input list relevant?
Possibility (1) produces ab,c or a,bc ... tie-breaking rule, please
Possibility (2) produces abc (no big gaps, so only one group)
You won't be able to classify them into hard "groups". If you have prices (1.0,1.05, 1.1) then the first and second should be in the same group, and the second and third should be in the same group, but not the first and third.
A quick, dirty way to do something that you might find useful:
def make_group_function(tolerance = 0.05):
from math import log10, floor
# I forget why this works.
tolerance_factor = -1.0/(-log10(1.0 + tolerance))
# well ... since you might ask
# we want: log(x)*tf - log(x*(1+t))*tf = -1,
# so every 5% change has a different group. The minus is just so groups
# are ascending .. it looks a bit nicer.
#
# tf = -1/(log(x)-log(x*(1+t)))
# tf = -1/(log(x/(x*(1+t))))
# tf = -1/(log(1/(1*(1+t)))) # solved .. but let's just be more clever
# tf = -1/(0-log(1*(1+t)))
# tf = -1/(-log((1+t))
def group_function(value):
# don't just use int - it rounds up below zero, and down above zero
return int(floor(log10(value)*tolerance_factor))
return group_function
Usage:
group_function = make_group_function()
import random
groups = {}
for i in range(50):
v = random.random()*500+1000
group = group_function(v)
if group in groups:
groups[group].append(v)
else:
groups[group] = [v]
for group in sorted(groups):
print 'Group',group
for v in sorted(groups[group]):
print v
print
For a given set of stock prices, there is probably more than one way to group stocks that are within 0.5% of each other. Without some additional rules for grouping the prices, there's no way to be sure an answer will do what you really want.
apart from the proper way to pick which values fit together, this is a problem where a little Object Orientation dropped in can make it a lot easier to deal with.
I made two classes here, with a minimum of desirable behaviors, but which can make the classification a lot easier -- you get a single point to play with it on the Group class.
I can see the code bellow is incorrect, in the sense the limtis for group inclusion varies as new members are added -- even it the separation crieteria remaisn teh same, you heva e torewrite the get_groups method to use a multi-pass approach. It should nto be hard -- but the code would be too long to be helpfull here, and i think this snipped is enoguh to get you going:
from copy import copy
class Group(object):
def __init__(self,data=None, name=""):
if data:
self.data = data
else:
self.data = []
self.name = name
def get_mean_stock(self):
return sum(item[0] for item in self.data) / len(self.data)
def fits(self, item):
if 0.995 < abs(item[0]) / self.get_mean_stock() < 1.005:
return True
return False
def get_weight(self):
return sum(item[1] for item in self.data)
def __repr__(self):
return "Group-%s\n%s\n---\nTotalWeight: %d\n\n" % (
self.name,
"\n".join("%.02f, %d" % tuple(item) for item in self.data ),
self.get_weight())
class StockGrouper(object):
def __init__(self, data=None):
if data:
self.floor_levels = data
else:
self.floor_levels = []
def get_groups(self):
groups = []
floor_levels = copy(self.floor_levels)
name_ord = ord("A") - 1
while floor_levels:
seed = floor_levels.pop(0)
name_ord += 1
group = Group([seed], chr(name_ord))
groups.append(group)
to_remove = []
for i, item in enumerate(floor_levels):
if group.fits(item):
group.data.append(item)
to_remove.append(i)
for i in reversed(to_remove):
floor_levels.pop(i)
return groups
testing:
floor_levels = [ [stock. weight] ,... <paste the data above> ]
s = StockGrouper(floor_levels)
s.get_groups()
For the grouping element, could you use itertools.groupby()? As the data is sorted, a lot of the work of grouping it is already done, and then you could test if the current value in the iteration was different to the last by <0.5%, and have itertools.groupby() break into a new group every time your function returned false.