Sample RDD element(s) according to weighted probability [Spark]

Sample RDD element(s) according to weighted probability [Spark] - python

In PySpark I have an RDD composed by (key;value) pairs where keys are sequential integers and values are floating point numbers.
I'd like to sample exactly one element from this RDD with probability proportional to value.
In a naiive manner, this task can be accomplished as follows:
pairs = myRDD.collect() #now pairs is a list of (key;value) tuples
K, V = zip(*pairs) #separate keys and values
V = numpy.array(V)/sum(V) #normalise probabilities
extractedK = numpy.random.choice(K,size=1,replace=True, p=V)
What concerns me is the collect() operation which, as you might know, loads the entire list of tuples in memory, which can be quite expensive. I'm aware of takeSample(), which is great when elements should be extracted uniformly, but what happens if elements should be extracted according to weighted probabilities?
Thanks!

Here is an algorithm I worked out to do this:
EXAMPLE PROBLEM
Assume we want to sample 10 items from an RDD on 3 partitions like this:
P1: ("A", 0.10), ("B", 0.10), ("C", 0.20)
P2: ("D": 0.25), ("E", 0.25)
P3: ("F", 0.10)
Here is the high-level algorithm:
INPUT: number of samples and a RDD of items (with weights)
OUTPUT: dataset sample on driver
For each partition, calculate the total probability of sampling from the partition, and aggregate those values to the driver.
This would give the probability distribution: Prob(P1) = 0.40, Prob(P2) = 0.50, Prob(P3) = 0.10
Generate a sample of the partitions (to determine the number of elements to select from each partition.)
A sample may look like this: [P1, P1, P1, P1, P2, P2, P2, P2, P2, P3]
This would give us 4 items from P1, 5 items from P2, and 1 item from P3.
On each separate partition, we locally generate a sample of the needed size using only the elements on that partition:
On P1, we would sample 4 items with the (re-normalized) probability distribution: Prob(A) = 0.25, Prob(B) = 0.25, Prob(C) = 0.50. This could yield a sample such as [A, B, C, C].
On P2, we would sample 5 items with probability distribution: Prob(D) = 0.5, Prob(E) = 0.5. This could yield a sample such as [D,D,E,E,E]
On P3: sample 1 item with probability distribution: P(F) = 1.0, this would generate the sample [E]
Collect the samples to the driver to yield your dataset sample [A,B,C,C,D,D,E,E,E,F].
Here is an implementation in scala:
case class Sample[T](weight: Double, obj: T)
/*
* Obtain a sample of size `numSamples` from an RDD `ar` using a two-phase distributed sampling approach.
*/
def sampleWeightedRDD[T:ClassTag](ar: RDD[Sample[T]], numSamples: Int)(implicit sc: SparkContext): Array[T] = {
// 1. Get total weight on each partition
var partitionWeights = ar.mapPartitionsWithIndex{case(partitionIndex, iter) => Array((partitionIndex, iter.map(_.weight).sum)).toIterator }.collect().toArray
//Normalize to 1.0
val Z = partitionWeights.map(_._2).sum
partitionWeights = partitionWeights.map{case(partitionIndex, weight) => (partitionIndex, weight/Z)}
// 2. Sample from partitions indexes to determine number of samples from each partition
val samplesPerIndex = sc.broadcast(sample[Int](partitionWeights, numSamples).groupBy(x => x).mapValues(_.size).toMap).value
// 3. On each partition, sample the number of elements needed for that partition
ar.mapPartitionsWithIndex{case(partitionIndex, iter) =>
val numSamplesForPartition = samplesPerIndex.getOrElse(partitionIndex, 0)
var ar = iter.map(x => (x.obj, x.weight)).toArray
//Normalize to 1.0
val Z = ar.map(x => x._2).sum
ar = ar.map{case(obj, weight) => (obj, weight/Z)}
sample(ar, numSamplesForPartition).toIterator
}.collect()
}
This code using a simple weighted sampling function sample:
// a very simple weighted sampling function
def sample[T:ClassTag](dist: Array[(T, Double)], numSamples: Int): Array[T] = {
val probs = dist.zipWithIndex.map{case((elem,prob),idx) => (elem,prob,idx+1)}.sortBy(-_._2)
val cumulativeDist = probs.map(_._2).scanLeft(0.0)(_+_).drop(1)
(1 to numSamples).toArray.map(x => scala.util.Random.nextDouble).map{case(p) =>
def findElem(p: Double, cumulativeDist: Array[Double]): Int = {
for(i <- (0 until cumulativeDist.size-1))
if (p <= cumulativeDist(i)) return i
return cumulativeDist.size-1
}
probs(findElem(p, cumulativeDist))._1
}
}

This is basically doable, but you should really consider whether it makes sense to use Spark for this. If you need to draw random values, then you presumeably need to do so many times over in a loop. Each iteration will require scanning through all the data (maybe more than once).
So fitting the data you need into memory and then drawing values from it randomly is almost certainly the right way to go. If your data is really too big to fit into memory, consider (a) only collecting the columns you need for this purpose and (b) whether your data can be binned in a way that makes sense.
Having said that, it is doable within Spark. Below is pysaprk code to demonstrate the idea.
import random
import pyspark.sql.functions as F
from pyspark.sql.window import Window
# read some sample data (shown below)
df = spark.read.csv("prb.csv",sep='\t',inferSchema=True,header=True)
# find the sum of the value column
ss = df.groupBy().agg( F.sum("vl").alias("sum") ).collect()
# add a column to store the normalized values
q = df.withColumn("nrm_vl", (df["vl"] / ss[0].sum) )
w = Window.partitionBy().orderBy("nrm_vl")\
.rowsBetween(Window.unboundedPreceding, Window.currentRow)
q = q.select("*", F.sum("nrm_vl").over(w).alias("cum_vl"))
q.show()
+---+---+-------------------+-------------------+
| ky| vl| nrm_vl| cum_vl|
+---+---+-------------------+-------------------+
| 2|0.8|0.07079646017699115|0.07079646017699115|
| 3|1.1|0.09734513274336283|0.16814159292035397|
| 4|1.7|0.15044247787610618| 0.3185840707964601|
| 0|3.2| 0.2831858407079646| 0.6017699115044247|
| 1|4.5| 0.3982300884955752| 0.9999999999999999|
+---+---+-------------------+-------------------+
def getRandVl(q):
# choose a random number and find the row that is
# less than and nearest to the random number
# (analog to `std::lower_bound` in C++)
chvl = q.where( q["cum_vl"] > random.random() ).groupBy().agg(
F.min(q["cum_vl"]).alias("cum_vl") )
return q.join(chvl, on="cum_vl", how="inner")
# get 30 random samples.. this is already slow
# on a single machine.
for i in range(0,30):
x = getRandVl(q)
# add this row. there's no reason to do this (it's slow)
# except that it's convenient to count how often each
# key was chosen, to check if this method works
cdf = cdf.select(cdf.columns).union(x.select(cdf.columns))
# count how often we picked each key
cdf.groupBy("ky","vl").agg( F.count("*").alias("count") ).show()
+---+---+-----+
| ky| vl|count|
+---+---+-----+
| 4|1.7| 4|
| 2|0.8| 1|
| 3|1.1| 3|
| 0|3.2| 11|
| 1|4.5| 12|
+---+---+-----+
I think these counts are reasonable given the values. I'd rather test it with many more samples, but it's too slow.

Related

Randomly select timeframes as tuples from a list of time points

I have several time points taken from a video with some max time length (T). These points are stored in a list of lists as follows:
time_pt_nested_list =
[[0.0, 6.131, 32.892, 43.424, 46.969, 108.493, 142.69, 197.025, 205.793, 244.582, 248.913, 251.518, 258.798, 264.021, 330.02, 428.965],
[11.066, 35.73, 64.784, 151.31, 289.03, 306.285, 328.7, 408.274, 413.64],
[48.447, 229.74, 293.19, 333.343, 404.194, 418.575],
[66.37, 242.16, 356.96, 424.967],
[78.711, 358.789, 403.346],
[84.454, 373.593, 422.384],
[102.734, 394.58],
[158.534],
[210.112],
[247.61],
[340.02],
[365.146],
[372.153]]
Each list above is associated with some probability; I'd like to randomly select points from each list according to its probability to form n tuples of contiguous time spans, such as the following:
[(0,t1),(t1,t2),(t2,t3),...,(tn,T)]
where n is specified by the user. All the returned tuples should only contain the floating point numbers inside the nested list above. I want to assign the highest probability to them to be sampled and appear in the returned tuples, the second list a slightly lower probability, etc. The exact details of these probabilities are not important, but it would be nice if the user can input a parameter that controls how fast the probability decays when idx increases.
The returned tuples are timeframes that should exactly cover the entire video and should not overlap. 0 and T may not necessarily appear in time_pt_nested_list (but they may). Are there nice ways to implement this? I would be grateful for any insightful suggestions.
For example if the user inputs 6 as the number of subclips, then this will be an example output:
[(0.0, 32.892), (32.892, 64.784), (64.784, 229.74), (229.74, 306.285), (306.285, 418.575), (418.575, 437.47)]
All numbers appearing in the tuples appeared in time_pt_nested_list, except 0.0 and 437.47. (Well 0.0 does appear here but may not in other cases) Here 437.47 is the length of video which is also given and may not appear in the list.

This is simpler than it may look. You really just need to sample n points from your sublists, each with row-dependent sample probability. Whatever samples are obtained can be time-ordered to construct your tuples.
import numpy as np
# user params
n = 6
prob_falloff_param = 0.2
lin_list = sorted([(idx, el) for idx, row in enumerate(time_pt_nested_list) for
el in row], key=lambda x: x[1])
# endpoints required, excluded from random selection process
t0 = lin_list.pop(0)[1]
T = lin_list.pop(-1)[1]
arr = np.array(lin_list)
# define row weights, alpha is parameter
weights = np.exp(-prob_falloff_param*arr[:,0]**2)
norm_weights = weights/np.sum(weights)
# choose (weighted) random points, create tuple list:
random_points = sorted(np.random.choice(arr[:,1], size=(n-1), replace=False))
time_arr = [t0, *random_points, T]
output = list(zip(time_arr, time_arr[1:]))
example outputs:
# n = 6
[(0.0, 78.711),
(78.711, 84.454),
(84.454, 158.534),
(158.534, 210.112),
(210.112, 372.153),
(372.153, 428.965)]
# n = 12
[(0.0, 6.131),
(6.131, 43.424),
(43.424, 64.784),
(64.784, 84.454),
(84.454, 102.734),
(102.734, 210.112),
(210.112, 229.74),
(229.74, 244.582),
(244.582, 264.021),
(264.021, 372.153),
(372.153, 424.967),
(424.967, 428.965)]

How do I sort a list probabilistically based on properties of their elements?

Consider this sample of User objects:
import numpy as np
class User:
def __init__(self, name, rating, actual_rating):
self.name: str = name
self.rating: int = rating
# Actual States
self.actual_rating: int = actual_rating
users = []
for actual_rating in np.random.binomial(10000, 0.157, 1000):
users.append(
User(str(random()), 1500, actual_rating)
)
# Sorting Users Randomly
sorted_users = []
How do I sort this users list such that the likelihood that an object is lower in index in the sorted_users depends on actual_rating being higher. For instance a random User("0.5465454", 1500, 1678) will have a higher likelihood of being sorted to be at index 0 of the sorted_users list than say User("0.7689989", 1500, 1400).
If possible is there a neat and readable way to do this?

Generating a random value for each user, then sorting according to this value
How about doing a first pass where you generate, for each user, a random number from a Gaussian distribution with mean actual_rating? Then you sort according to this random number instead of sorting according to actual_rating directly.
stddev = 1.0 # the larger this number, the more shuffled the list - the smaller, the more sorted the list
sorted_users = sorted(users, key=lambda u:np.random.normal(u.actual_rating, stddev))
Note the parameter stddev which you can adjust to suit your needs. The higher this parameter, the more shuffled the list in the end.
Sorting the list, then shuffling it lightly
Inspired by How to lightly shuffle a list in python?
Sort the list according to actual_rating, then shuffle it lightly.
sorted_users = sorted(users, key=lambda u:u.actual_rating)
nb_passes = 3
proba_swap = 0.25
for k in range(nb_passes):
for i in range(k%2, len(sorted_users) - 1, 2):
if random() < proba_swap:
sorted_users[i], sorted_users[i+1] = sorted_users[i+1], sorted_users[i]
Note the two parameters nb_passes (positive integer) and proba_swap (between 0.0 and 1.0) which you can adjust to better suit your needs.
Instead of using a fixed parameter proba_swap, you could make up a formula for a probability of swapping that depends on how close the actual_ratings of the two users are, for instance def proba_swap(r1,r2): return math.exp(-a*(r1-r2)**2)/2.0 for some positive parameter a.
Or alternatively:
sorted_users = sorted(users, key=lambda u:u.actual_rating)
nb_swaps = int(1.5 * len(sorted_users)) # parameter to experiment with
for i in random.choices(range(len(sorted_users)-1), k=nb_swaps):
sorted_users[i], sorted_users[i+1] = sorted_users[i+1], sorted_users[i]
See also
After searching a little bit, I found this similar question:
Randomly sort a list with bias

Python dictionaries not copied properly causing repetitions, how to get this right?

I'm writing a function which is supposed to compare lists (significant genes for a test) and list out common elements (genes) for all possible combinations of the selection of lists.
These results are to be used for a venn diagram thingy...
The number of tests and genes being flexible.
The input JSON file looks something like this:
| test | genes |
|----------------- |--------------------------------------------------- |
| p-7trt_1/0con_1 | [ENSMUSG00000000031, ENSMUSG00000000049, ENSMU... |
| p-7trt_2/0con_1 | [ENSMUSG00000000031, ENSMUSG00000000037, ENSMU... |
| p-7trt_1/0con_2 | [ENSMUSG00000000037, ENSMUSG00000000049, ENSMU... |
| p-7trt_2/0con_2 | [ENSMUSG00000000028, ENSMUSG00000000031, ENSMU... |
| p-7trt_1/0con_3 | [ENSMUSG00000000088, ENSMUSG00000000094, ENSMU... |
| p-7trt_2/0con_3 | [ENSMUSG00000000028, ENSMUSG00000000031, ENSMU... |
So The function is follows:
import pandas as pd
def get_venn_compiled_data(dir_loc):
"""return json of compiled data for the venn thing
"""
data_frame = pd.read_json(dir_loc + "/venn.json", orient="records")
number_of_tests = data_frame.shape[0]
venn_data = []
venn_data_point = {"tests": [], "genes": []} # list of genes which are common across listed tests
binary = lambda x: bin(x)[2:] # to directly get the binary number
for dec_number in range(1, 2 ** number_of_tests):
# resetting
venn_data_point["tests"] = []
venn_data_point["genes"] = []
# using a binary number to get all the cases
for index, state in enumerate(binary(dec_number)):
if state == "0":
continue
# putting in all the genes from the first test
if venn_data_point["tests"] == []:
venn_data_point["genes"] = data_frame["data"][index].copy()
# removing the ones which are not common in current genes state and this.tests
else:
for gene_index, gene in enumerate(venn_data_point["genes"]):
if gene not in data_frame["data"][index]:
venn_data_point["genes"].pop(gene_index)
# putting the test in the tests list
venn_data_point["tests"].append(data_frame["name"][index])
venn_data.append(venn_data_point.copy())
return venn_data
I'm basically abusing the fact that binary number generate all possible combinations of 1's and 0's so corresponding every place of the binary number with a test, and for every binary number, if 0 is present then the list corresponding to that test is not taken for list comparison.
I tried my best to explain, please ask in the comments if I was not clear.
After running the function I am getting an output in which there are random places where test sets are repeated.
This is the test input file.
and
This is what cameout as the output
Any help is highly appreciated Thank you.

I realized what error I was making
I assumed that the binary function will magically always generate the string with the number of places I needed, Which it doesn't.
After updating the binary function to add those zeros things are doing fine.
import pandas as pd
def get_venn_compiled_data(dir_loc):
"""return json of compiled data for the venn thing
"""
# internal variables
data_frame = pd.read_json(dir_loc + "/venn.json", orient="records")
number_of_tests = data_frame.shape[0]
venn_data = []
# defining internal function
def binary(dec_no, length=number_of_tests):
"""Just to convert decimal number to binary of specified length
"""
bin_number = bin(dec_no)[2:]
if len(bin_number) < length:
bin_number = "0" * (length - len(bin_number)) + bin_number
return bin_number
# list of genes which are common across listed tests
venn_data_point = {
"tests": [],
"genes": [],
}
for dec_number in range(1, 2 ** number_of_tests):
# resetting
venn_data_point["tests"] = []
venn_data_point["genes"] = []
# using a binary number to get all the cases
for index, state in enumerate(binary(dec_number)):
if state == "0":
continue
# putting in all the genes from the first test
if venn_data_point["tests"] == []:
venn_data_point["genes"] = data_frame["data"][index].copy()
# removing the ones which are not common in current genes state and this.tests
else:
for gene_index, gene in enumerate(venn_data_point["genes"]):
if gene not in data_frame["data"][index]:
venn_data_point["genes"].pop(gene_index)
# putting the test in the tests list
venn_data_point["tests"].append(data_frame["name"][index])
venn_data.append(venn_data_point.copy())
return venn_data
If anyone else has a more optimized algorithm for this, help is appreciated.

How to run Standard Normal Homogeneity Test for a time series data

I have a timeseries data (sample data) for a variable wind for nearly 40 stations and 36 years (details in sample screenshot).
I need to run the Standard Normal Homogeneity Test and Pettitt's Test for this data as per recommendations. Are they available in python?
I couldn't find any code for the mentioned tests in python documentations and packages.
I need some help here to know if there is any package holding these tests.
There are codes in R as follows:
snht(data, period, robust = F, time = NULL, scaled = TRUE, rmSeasonalPeriod = Inf, ...)
However, no results so far... only errors.

Regarding the Pettitt test, I found this python implementation.
I believe there is a small typo: the t+1 on line 19 should actually just be t.
I also have developed a faster, vectorised implementation::
import numpy as np
def pettitt_test(X):
"""
Pettitt test calculated following Pettitt (1979): https://www.jstor.org/stable/2346729?seq=4#metadata_info_tab_contents
"""
T = len(X)
U = []
for t in range(T): # t is used to split X into two subseries
X_stack = np.zeros((t, len(X[t:]) + 1), dtype=int)
X_stack[:,0] = X[:t] # first column is each element of the first subseries
X_stack[:,1:] = X[t:] # all rows after the first element are the second subseries
U.append(np.sign(X_stack[:,0] - X_stack[:,1:].transpose()).sum()) # sign test between each element of the first subseries and all elements of the second subseries, summed.
tau = np.argmax(np.abs(U)) # location of change (first data point of second sub-series)
K = np.max(np.abs(U))
p = 2 * np.exp(-6 * K**2 / (T**3 + T**2))
return (tau, p)

put stockprices into groups when they are within 0.5% of each other

Thanks for the answers, I have not used StackOverflow before so I was suprised by the number of answers and the speed of them - its fantastic.
I have not been through the answers properly yet, but thought I should add some information to the problem specification. See the image below.
I can't post an image in this because i don't have enough points but you can see an image
at http://journal.acquitane.com/2010-01-20/image003.jpg
This image may describe more closely what I'm trying to achieve. So you can see on the horizontal lines across the page are price points on the chart. Now where you get a clustering of lines within 0.5% of each, this is considered to be a good thing and why I want to identify those clusters automatically. You can see on the chart that there is a cluster at S2 & MR1, R2 & WPP1.
So everyday I produce these price points and then I can identify manually those that are within 0.5%. - but the purpose of this question is how to do it with a python routine.
I have reproduced the list again (see below) with labels. Just be aware that the list price points don't match the price points in the image because they are from two different days.
[YR3,175.24,8]
[SR3,147.85,6]
[YR2,144.13,8]
[SR2,130.44,6]
[YR1,127.79,8]
[QR3,127.42,5]
[SR1,120.94,6]
[QR2,120.22,5]
[MR3,118.10,3]
[WR3,116.73,2]
[DR3,116.23,1]
[WR2,115.93,2]
[QR1,115.83,5]
[MR2,115.56,3]
[DR2,115.53,1]
[WR1,114.79,2]
[DR1,114.59,1]
[WPP,113.99,2]
[DPP,113.89,1]
[MR1,113.50,3]
[DS1,112.95,1]
[WS1,112.85,2]
[DS2,112.25,1]
[WS2,112.05,2]
[DS3,111.31,1]
[MPP,110.97,3]
[WS3,110.91,2]
[50MA,110.87,4]
[MS1,108.91,3]
[QPP,108.64,5]
[MS2,106.37,3]
[MS3,104.31,3]
[QS1,104.25,5]
[SPP,103.53,6]
[200MA,99.42,7]
[QS2,97.05,5]
[YPP,96.68,8]
[SS1,94.03,6]
[QS3,92.66,5]
[YS1,80.34,8]
[SS2,76.62,6]
[SS3,67.12,6]
[YS2,49.23,8]
[YS3,32.89,8]
I did make a mistake with the original list in that Group C is wrong and should not be included. Thanks for pointing that out.
Also the 0.5% is not fixed this value will change from day to day, but I have just used 0.5% as an example for spec'ing the problem.
Thanks Again.
Mark
PS. I will get cracking on checking the answers now now.
Hi:
I need to do some manipulation of stock prices. I have just started using Python, (but I think I would have trouble implementing this in any language). I'm looking for some ideas on how to implement this nicely in python.
Thanks
Mark
Problem:
I have a list of lists (FloorLevels (see below)) where the sublist has two items (stockprice, weight). I want to put the stockprices into groups when they are within 0.5% of each other. A groups strength will be determined by its total weight. For example:
Group-A
115.93,2
115.83,5
115.56,3
115.53,1
-------------
TotalWeight:12
-------------
Group-B
113.50,3
112.95,1
112.85,2
-------------
TotalWeight:6
-------------
FloorLevels[
[175.24,8]
[147.85,6]
[144.13,8]
[130.44,6]
[127.79,8]
[127.42,5]
[120.94,6]
[120.22,5]
[118.10,3]
[116.73,2]
[116.23,1]
[115.93,2]
[115.83,5]
[115.56,3]
[115.53,1]
[114.79,2]
[114.59,1]
[113.99,2]
[113.89,1]
[113.50,3]
[112.95,1]
[112.85,2]
[112.25,1]
[112.05,2]
[111.31,1]
[110.97,3]
[110.91,2]
[110.87,4]
[108.91,3]
[108.64,5]
[106.37,3]
[104.31,3]
[104.25,5]
[103.53,6]
[99.42,7]
[97.05,5]
[96.68,8]
[94.03,6]
[92.66,5]
[80.34,8]
[76.62,6]
[67.12,6]
[49.23,8]
[32.89,8]
]

I suggest a repeated use of k-means clustering -- let's call it KMC for short. KMC is a simple and powerful clustering algorithm... but it needs to "be told" how many clusters, k, you're aiming for. You don't know that in advance (if I understand you correctly) -- you just want the smallest k such that no two items "clustered together" are more than X% apart from each other. So, start with k equal 1 -- everything bunched together, no clustering pass needed;-) -- and check the diameter of the cluster (a cluster's "diameter", from the use of the term in geometry, is the largest distance between any two members of a cluster).
If the diameter is > X%, set k += 1, perform KMC with k as the number of clusters, and repeat the check, iteratively.
In pseudo-code:
def markCluster(items, threshold):
k = 1
clusters = [items]
maxdist = diameter(items)
while maxdist > threshold:
k += 1
clusters = Kmc(items, k)
maxdist = max(diameter(c) for c in clusters)
return clusters
assuming of course we have suitable diameter and Kmc Python functions.
Does this sound like the kind of thing you want? If so, then we can move on to show you how to write diameter and Kmc (in pure Python if you have a relatively limited number of items to deal with, otherwise maybe by exploiting powerful third-party add-on frameworks such as numpy) -- but it's not worthwhile to go to such trouble if you actually want something pretty different, whence this check!-)

A stock s belong in a group G if for each stock t in G, s * 1.05 >= t and s / 1.05 <= t, right?
How do we add the stocks to each group? If we have the stocks 95, 100, 101, and 105, and we start a group with 100, then add 101, we will end up with {100, 101, 105}. If we did 95 after 100, we'd end up with {100, 95}.
Do we just need to consider all possible permutations? If so, your algorithm is going to be inefficient.

You need to specify your problem in more detail. Just what does "put the stockprices into groups when they are within 0.5% of each other" mean?
Possibilities:
(1) each member of the group is within 0.5% of every other member of the group
(2) sort the list and split it where the gap is more than 0.5%
Note that 116.23 is within 0.5% of 115.93 -- abs((116.23 / 115.93 - 1) * 100) < 0.5 -- but you have put one number in Group A and one in Group C.
Simple example: a, b, c = (0.996, 1, 1.004) ... Note that a and b fit, b and c fit, but a and c don't fit. How do you want them grouped, and why? Is the order in the input list relevant?
Possibility (1) produces ab,c or a,bc ... tie-breaking rule, please
Possibility (2) produces abc (no big gaps, so only one group)

You won't be able to classify them into hard "groups". If you have prices (1.0,1.05, 1.1) then the first and second should be in the same group, and the second and third should be in the same group, but not the first and third.
A quick, dirty way to do something that you might find useful:
def make_group_function(tolerance = 0.05):
from math import log10, floor
# I forget why this works.
tolerance_factor = -1.0/(-log10(1.0 + tolerance))
# well ... since you might ask
# we want: log(x)*tf - log(x*(1+t))*tf = -1,
# so every 5% change has a different group. The minus is just so groups
# are ascending .. it looks a bit nicer.
#
# tf = -1/(log(x)-log(x*(1+t)))
# tf = -1/(log(x/(x*(1+t))))
# tf = -1/(log(1/(1*(1+t)))) # solved .. but let's just be more clever
# tf = -1/(0-log(1*(1+t)))
# tf = -1/(-log((1+t))
def group_function(value):
# don't just use int - it rounds up below zero, and down above zero
return int(floor(log10(value)*tolerance_factor))
return group_function
Usage:
group_function = make_group_function()
import random
groups = {}
for i in range(50):
v = random.random()*500+1000
group = group_function(v)
if group in groups:
groups[group].append(v)
else:
groups[group] = [v]
for group in sorted(groups):
print 'Group',group
for v in sorted(groups[group]):
print v
print

For a given set of stock prices, there is probably more than one way to group stocks that are within 0.5% of each other. Without some additional rules for grouping the prices, there's no way to be sure an answer will do what you really want.

apart from the proper way to pick which values fit together, this is a problem where a little Object Orientation dropped in can make it a lot easier to deal with.
I made two classes here, with a minimum of desirable behaviors, but which can make the classification a lot easier -- you get a single point to play with it on the Group class.
I can see the code bellow is incorrect, in the sense the limtis for group inclusion varies as new members are added -- even it the separation crieteria remaisn teh same, you heva e torewrite the get_groups method to use a multi-pass approach. It should nto be hard -- but the code would be too long to be helpfull here, and i think this snipped is enoguh to get you going:
from copy import copy
class Group(object):
def __init__(self,data=None, name=""):
if data:
self.data = data
else:
self.data = []
self.name = name
def get_mean_stock(self):
return sum(item[0] for item in self.data) / len(self.data)
def fits(self, item):
if 0.995 < abs(item[0]) / self.get_mean_stock() < 1.005:
return True
return False
def get_weight(self):
return sum(item[1] for item in self.data)
def __repr__(self):
return "Group-%s\n%s\n---\nTotalWeight: %d\n\n" % (
self.name,
"\n".join("%.02f, %d" % tuple(item) for item in self.data ),
self.get_weight())
class StockGrouper(object):
def __init__(self, data=None):
if data:
self.floor_levels = data
else:
self.floor_levels = []
def get_groups(self):
groups = []
floor_levels = copy(self.floor_levels)
name_ord = ord("A") - 1
while floor_levels:
seed = floor_levels.pop(0)
name_ord += 1
group = Group([seed], chr(name_ord))
groups.append(group)
to_remove = []
for i, item in enumerate(floor_levels):
if group.fits(item):
group.data.append(item)
to_remove.append(i)
for i in reversed(to_remove):
floor_levels.pop(i)
return groups
testing:
floor_levels = [ [stock. weight] ,... <paste the data above> ]
s = StockGrouper(floor_levels)
s.get_groups()

For the grouping element, could you use itertools.groupby()? As the data is sorted, a lot of the work of grouping it is already done, and then you could test if the current value in the iteration was different to the last by <0.5%, and have itertools.groupby() break into a new group every time your function returned false.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sample RDD element(s) according to weighted probability [Spark] - python

Related

Randomly select timeframes as tuples from a list of time points

How do I sort a list probabilistically based on properties of their elements?

Python dictionaries not copied properly causing repetitions, how to get this right?

How to run Standard Normal Homogeneity Test for a time series data

put stockprices into groups when they are within 0.5% of each other

Categories

Resources