Background
I have a function called get_player_path that takes in a list of strings player_file_list and a int value total_players. For the sake of example i have reduced the list of strings and also set the int value to a very small number.
Each string in the player_file_list either has a year-date/player_id/some_random_file.file_extension or
year-date/player_id/IDATs/some_random_number/some_random_file.file_extension
Issue
What i am essentially trying to achieve here is go through this list and store all unique year-date/player_id path in a set until it's length reaches the value of total_players
My current approach does not seem the most efficient to me and i am wondering if i can speed up my function get_player_path in anyway??
Code
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
player_file = player_file.split("/")
file_path = f"{player_file[0]}/{player_file[1]}/"
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
player_file_list = [
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
]
print(get_player_path(player_file_list, 2))
Output
['2020-10-27/31001804320549/', '2020-10-28/31001804320548/']
Let's analyze your function first:
your loop should take linear time (O(n)) in the length of the input list, assuming the path lengths are bounded by a relatively "small" number;
the sorting takes O(n log(n)) comparisons.
Thus the sorting has the dominant cost when the list becomes big. You can micro-optimize your loop as much as you want, but as long as you keep that sorting at the end, your effort won't make much of a difference with big lists.
Your approach is fine if you're just writing a Python script. If you really needed perfomances with huge lists, you would probably be using some other language. Nonetheless, if you really care about performances (or just to learn new stuff), you could try one of the following approaches:
replace the generic sorting algorithm with something specific for strings; see here for example
use a trie, removing the need for sorting; this could be theoretically better but probably worse in practice.
Just for completeness, as a micro-optimization, assuming the date has a fixed length of 10 characters:
def get_player_path(player_file_list, total_players):
player_files_to_process = set()
for player_file in player_file_list:
end = player_file.find('/', 12) # <--- len(date) + len('/') + 1
file_path = player_file[:end] # <---
player_files_to_process.add(file_path)
if len(player_files_to_process) == total_players:
break
return sorted(player_files_to_process)
If the IDs have fixed length too, as in your example list, then you don't need any split or find, just:
LENGTH = DATE_LENGTH + ID_LENGTH + 1 # 1 is for the slash between date and id
...
for player_file in player_file_list:
file_path = player_file[:LENGTH]
...
EDIT: fixed the LENGTH initialization, I had forgotten to add 1
I'll leave this solution here which can be further improved, hope it helps.
player_file_list = (
"2020-10-27/31001804320549/31001804320549.json",
"2020-10-27/31001804320549/IDATs/204825150047/foo_bar_Red.idat",
"2020-10-28/31001804320548/31001804320549.json",
"2020-10-28/31001804320548/IDATs/204825150123/foo_bar_Red.idat",
"2020-10-29/31001804320547/31001804320549.json",
"2020-10-29/31001804320547/IDATs/204825150227/foo_bar_Red.idat",
"2020-10-30/31001804320546/31001804320549.json",
"2020-10-30/31001804320546/IDATs/123455150047/foo_bar_Red.idat",
"2020-10-31/31001804320545/31001804320549.json",
"2020-10-31/31001804320545/IDATs/597625150047/foo_bar_Red.idat",
)
def get_player_path(l, n):
pfl = set()
for i in l:
i = "/".join(i.split("/")[0:2])
if i not in pfl:
pfl.add(i)
if len(pfl) == n:
return pfl
if n > len(pfl):
print("not enough matches")
return
print(get_player_path(player_file_list, 2))
# {'2020-10-27/31001804320549', '2020-10-28/31001804320548'}
Python Demo
Use dict so that you don't have to sort it since your list is already sorted. If you still need to sort you can always use sorted in the return statement. Add import re and replace your function as follows:
def get_player_path(player_file_list, total_players):
dct = {re.search('^\w+-\w+-\w+/\w+',pf).group(): 1 for pf in player_file_list}
return [k for i,k in enumerate(dct.keys()) if i < total_players]
So we can generate a unique id with str(uuid.uuid4()), which is 36 characters long.
Is there another method to generate a unique ID which is shorter in terms of characters?
EDIT:
If ID is usable as primary key then even better
Granularity should be better than 1ms
This code could be distributed, so we can't assume time independence.
If this is for use as a primary key field in db, consider just using auto-incrementing integer instead.
str(uuid.uuid4()) is 36 chars but it has four useless dashes (-) in it, and it's limited to 0-9 a-f.
Better uuid4 in 32 chars:
>>> uuid.uuid4().hex
'b327fc1b6a2343e48af311343fc3f5a8'
Or just b64 encode and slice some urandom bytes (up to you to guarantee uniqueness):
>>> base64.b64encode(os.urandom(32))[:8]
b'iR4hZqs9'
TLDR
Most of the times it's better to work with numbers internally and encode them to short IDs externally. So here's a function for Python3, PowerShell & VBA that will convert an int32 to an alphanumeric ID. Use it like this:
int32_to_id(225204568)
'F2AXP8'
For distributed code use ULIDs: https://github.com/mdipierro/ulid
They are much longer but unique across different machines.
How short are the IDs?
It will encode about half a billion IDs in 6 characters so it's as compact as possible while still using only non-ambiguous digits and letters.
How can I get even shorter IDs?
If you want even more compact IDs/codes/Serial Numbers, you can easily expand the character set by just changing the chars="..." definition. For example if you allow all lower and upper case letters you can have 56 billion IDs within the same 6 characters. Adding a few symbols (like ~!##$%^&*()_+-=) gives you 208 billion IDs.
So why didn't you go for the shortest possible IDs?
The character set I'm using in my code has an advantage: It generates IDs that are easy to copy-paste (no symbols so double clicking selects the whole ID), easy to read without mistakes (no look-alike characters like 2 and Z) and rather easy to communicate verbally (only upper case letters). Sticking to numeric digits only is your best option for verbal communication but they are not compact.
I'm convinced: show me the code
Python 3
def int32_to_id(n):
if n==0: return "0"
chars="0123456789ACEFHJKLMNPRTUVWXY"
length=len(chars)
result=""
remain=n
while remain>0:
pos = remain % length
remain = remain // length
result = chars[pos] + result
return result
PowerShell
function int32_to_id($n){
$chars="0123456789ACEFHJKLMNPRTUVWXY"
$length=$chars.length
$result=""; $remain=[int]$n
do {
$pos = $remain % $length
$remain = [int][Math]::Floor($remain / $length)
$result = $chars[$pos] + $result
} while ($remain -gt 0)
$result
}
VBA
Function int32_to_id(n)
Dim chars$, length, result$, remain, pos
If n = 0 Then int32_to_id = "0": Exit Function
chars$ = "0123456789ACEFHJKLMNPRTUVWXY"
length = Len(chars$)
result$ = ""
remain = n
Do While (remain > 0)
pos = remain Mod length
remain = Int(remain / length)
result$ = Mid(chars$, pos + 1, 1) + result$
Loop
int32_to_id = result
End Function
Function id_to_int32(id$)
Dim chars$, length, result, remain, pos, value, power
chars$ = "0123456789ACEFHJKLMNPRTUVWXY"
length = Len(chars$)
result = 0
power = 1
For pos = Len(id$) To 1 Step -1
result = result + (InStr(chars$, Mid(id$, pos, 1)) - 1) * power
power = power * length
Next
id_to_int32 = result
End Function
Public Sub test_id_to_int32()
Dim i
For i = 0 To 28 ^ 3
If id_to_int32(int32_to_id(i)) <> i Then Debug.Print "Error, i=", i, "int32_to_id(i)", int32_to_id(i), "id_to_int32('" & int32_to_id(i) & "')", id_to_int32(int32_to_id(i))
Next
Debug.Print "Done testing"
End Sub
Yes. Just use the current UTC millis. This number never repeats.
const uniqueID = new Date().getTime();
EDIT
If you have the rather seldom requirement to produce more than one ID within the same millisecond, this method is of no use as this number‘s granularity is 1ms.
I have looked on this website for something similar, and attempted to debug using previous answers, and failed.
I'm testing (I did not write this module) a module that changes the grade value of a course's grades from a B- to say a B, but never going across base grade levels (ie, B+ to an A-).
The original module is called transcript.py
I'm testing it in my own testtranscript.py
I'm testing that module by importing it: 'import transcript' and 'import cornelltest'
I have ensured that all files are in the same folder/directory.
There is the function raise_grade present in transcript.py (there are multiple definitions in this module, but raise_grade is the only one giving me any trouble).
ti is in the form ('class name', 'gradvalue')
There's already another definition converting floats to strings and back (ie 3.0--> B).
def raise_grade(ti):
""""Raise gradeval of transcript line ti by a non-noticeable amount.
"""
# value of the base letter grade, e.g., 4 (or 4.0) for a 4.3
bval = int(ti.gradeval)
print 'bval is:"' + str(bval) + '"'
# part after decimal point in raised grade, e.g., 3 (or 3.0) for a 4.3
newdec = min(int((ti.gradeval + .3)*10) % 10, 3)
print 'newdec is:"' + str(newdec) + '"'
# get result by add the two values together, after shifting newdec one
# decimal place
newval = bval + round(newdec/10.0, 1)
ti.gradeval = newval
print 'newval is:"' + str(newval) + '"'
I will probably get rid of the print later.
When I run testtranscript, which imports transcript:
def test_raise():
"""test raise_grade"""
testobj = transcript.Titem('CS1110','B-')
transcript.raise_grade('CS1110','B-')
cornelltest.assert_floats_equal(3.0,transcript.lettergrade_to_val("B-"))
I get this from the cmd shell:
TypeError: raise_grade takes exactly 1 argument (2 given)
Edit1: So now I see that I am giving it two parameters when raise_grade(ti) is just one, but perhaps it would shed more light if I just put out the rest of the code. I'm still stuck as to why I get a ['str' object has no gradeval error]
LETTER_LIST = ['B', 'A']
# List of valid modifiers to base letter grades.
MODIFIER_LIST = ['-','+']
def lettergrade_to_val(lg):
"""Returns: numerical value of letter grade lg.
The usual numerical scheme is assumed: A+ -> 4.3, A -> 4.0, A- -> 3.7, etc.
Precondition: lg is a 1 or 2-character string consisting of a "base" letter
in LETTER_LIST optionally followed by a modifier in MODIFIER_LIST."""
# if LETTER_LIST or MODIFIER_LIST change, the implementation of
# this function must change.
# get value of base letter. Trick: index in LETTER_LIST is shifted from value
bv = LETTER_LIST.index(lg[0]) + 3
# Trick with indexing in MODIFIER_LIST to get the modifier value
return bv + ((MODIFIER_LIST.index(lg[1]) - .5)*.3/.5 if (len(lg) == 2) else 0)
class Titem(object):
"""A Titem is an 'item' on a transcript, like "CS1110 A+"
Instance variables:
course [string]: course name. Always at least 1 character long.
gradeval [float]: the numerical equivalent of the letter grade.
Valid letter grades are 1 or 2 chars long, and consist
of a "base" letter in LETTER_LIST optionally followed
by a modifier in MODIFIER_LIST.
We store values instead of letter grades to facilitate
calculations of GPA later.
(In "real" life, one would write a function that,
when displaying a Titem, would display the letter
grade even though the underlying representation is
numerical, but we're keeping things simple for this
lab.)
"""
def __init__(self, n, lg):
"""Initializer: A new transcript line with course (name) n, gradeval
the numerical equivalent of letter grade lg.
Preconditions: n is a non-empty string.
lg is a string consisting of a "base" letter in LETTER_LIST
optionally followed by modifier in MODIFIER_LIST.
"""
# assert statements that cause an error when preconditions are violated
assert type(n) == str and type(lg) == str, 'argument type error'
assert (len(n) >= 1 and 0 < len(lg) <= 2 and lg[0] in LETTER_LIST and
(len(lg) == 1 or lg[1] in MODIFIER_LIST)), 'argument value error'
self.course = n
self.gradeval = lettergrade_to_val(lg)
Edit2: I understand the original problem... but it seems that the original writer screwed up the code, since raise_grade doesn't work properly for grade values at 3.7 ---> 4.0, since bval takes the original float and makes it an int, which doesn't work in this case.
You are calling the function incorrectly, you should be passing the testobj:
def test_raise():
"""test raise_grade"""
testobj = transcript.Titem('CS1110','B-')
transcript.raise_grade(testobj)
...
The raise_grade function is expecting a single argument ti which has a gradeval attribute, i.e. a Titem instance.
I'm currently transitioning from Java to Python and have taken on the task of trying to create a calculator that can carry out symbolic operations on infix-notated mathematical expressions (without using custom modules like Sympy). Currently, it's built to accept strings that are space delimited and can only carry out the (, ), +, -, *, and / operators. Unfortunately, I can't figure out the basic algorithm for simplifying symbolic expressions.
For example, given the string '2 * ( ( 9 / 6 ) + 6 * x )', my program should carry out the following steps:
2 * ( 1.5 + 6 * x )
3 + 12 * x
But I can't get the program to ignore the x when distributing the 2. In addition, how can I handle 'x * 6 / x' so it returns '6' after simplification?
EDIT: To clarify, by "symbolic" I meant that it will leave letters like "A" and "f" in the output while carrying out the remaining calculations.
EDIT 2: I (mostly) finished the code. I'm posting it here if anyone stumbles on this post in the future, or if any of you were curious.
def reduceExpr(useArray):
# Use Python's native eval() to compute if no letters are detected.
if (not hasLetters(useArray)):
return [calculate(useArray)] # Different from eval() because it returns string version of result
# Base case. Returns useArray if the list size is 1 (i.e., it contains one string).
if (len(useArray) == 1):
return useArray
# Base case. Returns the space-joined elements of useArray as a list with one string.
if (len(useArray) == 3):
return [' '.join(useArray)]
# Checks to see if parentheses are present in the expression & sets.
# Counts number of parentheses & keeps track of first ( found.
parentheses = 0
leftIdx = -1
# This try/except block is essentially an if/else block. Since useArray.index('(') triggers a KeyError
# if it can't find '(' in useArray, the next line is not carried out, and parentheses is not incremented.
try:
leftIdx = useArray.index('(')
parentheses += 1
except Exception:
pass
# If a KeyError was returned, leftIdx = -1 and rightIdx = parentheses = 0.
rightIdx = leftIdx + 1
while (parentheses > 0):
if (useArray[rightIdx] == '('):
parentheses += 1
elif (useArray[rightIdx] == ')'):
parentheses -= 1
rightIdx += 1
# Provided parentheses pair isn't empty, runs contents through again; else, removes the parentheses
if (leftIdx > -1 and rightIdx - leftIdx > 2):
return reduceExpr(useArray[:leftIdx] + [' '.join(['(',reduceExpr(useArray[leftIdx+1:rightIdx-1])[0],')'])] + useArray[rightIdx:])
elif (leftIdx > -1):
return reduceExpr(useArray[:leftIdx] + useArray[rightIdx:])
# If operator is + or -, hold the first two elements and process the rest of the list first
if isAddSub(useArray[1]):
return reduceExpr(useArray[:2] + reduceExpr(useArray[2:]))
# Else, if operator is * or /, process the first 3 elements first, then the rest of the list
elif isMultDiv(useArray[1]):
return reduceExpr(reduceExpr(useArray[:3]) + useArray[3:])
# Just placed this so the compiler wouldn't complain that the function had no return (since this was called by yet another function).
return None
You need much more processing before you go into operations on symbols. The form you want to get to is a tree of operations with values in the leaf nodes. First you need to do a lexer run on the string to get elements - although if you always have space-separated elements it might be enough to just split the string. Then you need to parse that array of tokens using some grammar you require.
If you need theoretical information about grammars and parsing text, start here: http://en.wikipedia.org/wiki/Parsing If you need something more practical, go to https://github.com/pyparsing/pyparsing (you don't have to use the pyparsing module itself, but their documentation has a lot of interesting info) or http://www.nltk.org/book
From 2 * ( ( 9 / 6 ) + 6 * x ), you need to get to a tree like this:
*
2 +
/ *
9 6 6 x
Then you can visit each node and decide if you want to simplify it. Constant operations will be the simplest ones to eliminate - just compute the result and exchange the "/" node with 1.5 because all children are constants.
There are many strategies to continue, but essentially you need to find a way to go through the tree and modify it until there's nothing left to change.
If you want to print the result then, just walk the tree again and produce an expression which describes it.
If you are parsing expressions in Python, you might consider Python syntax for the expressions and parse them using the ast module (AST = abstract syntax tree).
The advantages of using Python syntax: you don't have to make a separate language for the purpose, the parser is built in, and so is the evaluator. Disadvantages: there's quite a lot of extra complexity in the parse tree that you don't need (you can avoid some of it by using the built-in NodeVisitor and NodeTransformer classes to do your work).
>>> import ast
>>> a = ast.parse('x**2 + x', mode='eval')
>>> ast.dump(a)
"Expression(body=BinOp(left=BinOp(left=Name(id='x', ctx=Load()), op=Pow(),
right=Num(n=2)), op=Add(), right=Name(id='x', ctx=Load())))"
Here's an example class that walks a Python parse tree and does recursive constant folding (for binary operations), to show you the kind of thing you can do fairly easily.
from ast import *
class FoldConstants(NodeTransformer):
def visit_BinOp(self, node):
self.generic_visit(node)
if isinstance(node.left, Num) and isinstance(node.right, Num):
expr = copy_location(Expression(node), node)
value = eval(compile(expr, '<string>', 'eval'))
return copy_location(Num(value), node)
else:
return node
>>> ast.dump(FoldConstants().visit(ast.parse('3**2 - 5 + x', mode='eval')))
"Expression(body=BinOp(left=Num(n=4), op=Add(), right=Name(id='x', ctx=Load())))"
Thanks for the answers, I have not used StackOverflow before so I was suprised by the number of answers and the speed of them - its fantastic.
I have not been through the answers properly yet, but thought I should add some information to the problem specification. See the image below.
I can't post an image in this because i don't have enough points but you can see an image
at http://journal.acquitane.com/2010-01-20/image003.jpg
This image may describe more closely what I'm trying to achieve. So you can see on the horizontal lines across the page are price points on the chart. Now where you get a clustering of lines within 0.5% of each, this is considered to be a good thing and why I want to identify those clusters automatically. You can see on the chart that there is a cluster at S2 & MR1, R2 & WPP1.
So everyday I produce these price points and then I can identify manually those that are within 0.5%. - but the purpose of this question is how to do it with a python routine.
I have reproduced the list again (see below) with labels. Just be aware that the list price points don't match the price points in the image because they are from two different days.
[YR3,175.24,8]
[SR3,147.85,6]
[YR2,144.13,8]
[SR2,130.44,6]
[YR1,127.79,8]
[QR3,127.42,5]
[SR1,120.94,6]
[QR2,120.22,5]
[MR3,118.10,3]
[WR3,116.73,2]
[DR3,116.23,1]
[WR2,115.93,2]
[QR1,115.83,5]
[MR2,115.56,3]
[DR2,115.53,1]
[WR1,114.79,2]
[DR1,114.59,1]
[WPP,113.99,2]
[DPP,113.89,1]
[MR1,113.50,3]
[DS1,112.95,1]
[WS1,112.85,2]
[DS2,112.25,1]
[WS2,112.05,2]
[DS3,111.31,1]
[MPP,110.97,3]
[WS3,110.91,2]
[50MA,110.87,4]
[MS1,108.91,3]
[QPP,108.64,5]
[MS2,106.37,3]
[MS3,104.31,3]
[QS1,104.25,5]
[SPP,103.53,6]
[200MA,99.42,7]
[QS2,97.05,5]
[YPP,96.68,8]
[SS1,94.03,6]
[QS3,92.66,5]
[YS1,80.34,8]
[SS2,76.62,6]
[SS3,67.12,6]
[YS2,49.23,8]
[YS3,32.89,8]
I did make a mistake with the original list in that Group C is wrong and should not be included. Thanks for pointing that out.
Also the 0.5% is not fixed this value will change from day to day, but I have just used 0.5% as an example for spec'ing the problem.
Thanks Again.
Mark
PS. I will get cracking on checking the answers now now.
Hi:
I need to do some manipulation of stock prices. I have just started using Python, (but I think I would have trouble implementing this in any language). I'm looking for some ideas on how to implement this nicely in python.
Thanks
Mark
Problem:
I have a list of lists (FloorLevels (see below)) where the sublist has two items (stockprice, weight). I want to put the stockprices into groups when they are within 0.5% of each other. A groups strength will be determined by its total weight. For example:
Group-A
115.93,2
115.83,5
115.56,3
115.53,1
-------------
TotalWeight:12
-------------
Group-B
113.50,3
112.95,1
112.85,2
-------------
TotalWeight:6
-------------
FloorLevels[
[175.24,8]
[147.85,6]
[144.13,8]
[130.44,6]
[127.79,8]
[127.42,5]
[120.94,6]
[120.22,5]
[118.10,3]
[116.73,2]
[116.23,1]
[115.93,2]
[115.83,5]
[115.56,3]
[115.53,1]
[114.79,2]
[114.59,1]
[113.99,2]
[113.89,1]
[113.50,3]
[112.95,1]
[112.85,2]
[112.25,1]
[112.05,2]
[111.31,1]
[110.97,3]
[110.91,2]
[110.87,4]
[108.91,3]
[108.64,5]
[106.37,3]
[104.31,3]
[104.25,5]
[103.53,6]
[99.42,7]
[97.05,5]
[96.68,8]
[94.03,6]
[92.66,5]
[80.34,8]
[76.62,6]
[67.12,6]
[49.23,8]
[32.89,8]
]
I suggest a repeated use of k-means clustering -- let's call it KMC for short. KMC is a simple and powerful clustering algorithm... but it needs to "be told" how many clusters, k, you're aiming for. You don't know that in advance (if I understand you correctly) -- you just want the smallest k such that no two items "clustered together" are more than X% apart from each other. So, start with k equal 1 -- everything bunched together, no clustering pass needed;-) -- and check the diameter of the cluster (a cluster's "diameter", from the use of the term in geometry, is the largest distance between any two members of a cluster).
If the diameter is > X%, set k += 1, perform KMC with k as the number of clusters, and repeat the check, iteratively.
In pseudo-code:
def markCluster(items, threshold):
k = 1
clusters = [items]
maxdist = diameter(items)
while maxdist > threshold:
k += 1
clusters = Kmc(items, k)
maxdist = max(diameter(c) for c in clusters)
return clusters
assuming of course we have suitable diameter and Kmc Python functions.
Does this sound like the kind of thing you want? If so, then we can move on to show you how to write diameter and Kmc (in pure Python if you have a relatively limited number of items to deal with, otherwise maybe by exploiting powerful third-party add-on frameworks such as numpy) -- but it's not worthwhile to go to such trouble if you actually want something pretty different, whence this check!-)
A stock s belong in a group G if for each stock t in G, s * 1.05 >= t and s / 1.05 <= t, right?
How do we add the stocks to each group? If we have the stocks 95, 100, 101, and 105, and we start a group with 100, then add 101, we will end up with {100, 101, 105}. If we did 95 after 100, we'd end up with {100, 95}.
Do we just need to consider all possible permutations? If so, your algorithm is going to be inefficient.
You need to specify your problem in more detail. Just what does "put the stockprices into groups when they are within 0.5% of each other" mean?
Possibilities:
(1) each member of the group is within 0.5% of every other member of the group
(2) sort the list and split it where the gap is more than 0.5%
Note that 116.23 is within 0.5% of 115.93 -- abs((116.23 / 115.93 - 1) * 100) < 0.5 -- but you have put one number in Group A and one in Group C.
Simple example: a, b, c = (0.996, 1, 1.004) ... Note that a and b fit, b and c fit, but a and c don't fit. How do you want them grouped, and why? Is the order in the input list relevant?
Possibility (1) produces ab,c or a,bc ... tie-breaking rule, please
Possibility (2) produces abc (no big gaps, so only one group)
You won't be able to classify them into hard "groups". If you have prices (1.0,1.05, 1.1) then the first and second should be in the same group, and the second and third should be in the same group, but not the first and third.
A quick, dirty way to do something that you might find useful:
def make_group_function(tolerance = 0.05):
from math import log10, floor
# I forget why this works.
tolerance_factor = -1.0/(-log10(1.0 + tolerance))
# well ... since you might ask
# we want: log(x)*tf - log(x*(1+t))*tf = -1,
# so every 5% change has a different group. The minus is just so groups
# are ascending .. it looks a bit nicer.
#
# tf = -1/(log(x)-log(x*(1+t)))
# tf = -1/(log(x/(x*(1+t))))
# tf = -1/(log(1/(1*(1+t)))) # solved .. but let's just be more clever
# tf = -1/(0-log(1*(1+t)))
# tf = -1/(-log((1+t))
def group_function(value):
# don't just use int - it rounds up below zero, and down above zero
return int(floor(log10(value)*tolerance_factor))
return group_function
Usage:
group_function = make_group_function()
import random
groups = {}
for i in range(50):
v = random.random()*500+1000
group = group_function(v)
if group in groups:
groups[group].append(v)
else:
groups[group] = [v]
for group in sorted(groups):
print 'Group',group
for v in sorted(groups[group]):
print v
print
For a given set of stock prices, there is probably more than one way to group stocks that are within 0.5% of each other. Without some additional rules for grouping the prices, there's no way to be sure an answer will do what you really want.
apart from the proper way to pick which values fit together, this is a problem where a little Object Orientation dropped in can make it a lot easier to deal with.
I made two classes here, with a minimum of desirable behaviors, but which can make the classification a lot easier -- you get a single point to play with it on the Group class.
I can see the code bellow is incorrect, in the sense the limtis for group inclusion varies as new members are added -- even it the separation crieteria remaisn teh same, you heva e torewrite the get_groups method to use a multi-pass approach. It should nto be hard -- but the code would be too long to be helpfull here, and i think this snipped is enoguh to get you going:
from copy import copy
class Group(object):
def __init__(self,data=None, name=""):
if data:
self.data = data
else:
self.data = []
self.name = name
def get_mean_stock(self):
return sum(item[0] for item in self.data) / len(self.data)
def fits(self, item):
if 0.995 < abs(item[0]) / self.get_mean_stock() < 1.005:
return True
return False
def get_weight(self):
return sum(item[1] for item in self.data)
def __repr__(self):
return "Group-%s\n%s\n---\nTotalWeight: %d\n\n" % (
self.name,
"\n".join("%.02f, %d" % tuple(item) for item in self.data ),
self.get_weight())
class StockGrouper(object):
def __init__(self, data=None):
if data:
self.floor_levels = data
else:
self.floor_levels = []
def get_groups(self):
groups = []
floor_levels = copy(self.floor_levels)
name_ord = ord("A") - 1
while floor_levels:
seed = floor_levels.pop(0)
name_ord += 1
group = Group([seed], chr(name_ord))
groups.append(group)
to_remove = []
for i, item in enumerate(floor_levels):
if group.fits(item):
group.data.append(item)
to_remove.append(i)
for i in reversed(to_remove):
floor_levels.pop(i)
return groups
testing:
floor_levels = [ [stock. weight] ,... <paste the data above> ]
s = StockGrouper(floor_levels)
s.get_groups()
For the grouping element, could you use itertools.groupby()? As the data is sorted, a lot of the work of grouping it is already done, and then you could test if the current value in the iteration was different to the last by <0.5%, and have itertools.groupby() break into a new group every time your function returned false.