I have a lot of data, usually in a file. I want to compute some quantities so I have this kind of functions:
def mean(iterator):
n = 0
sum = 0.
for i in iterator:
sum += i
n += 1
return sum / float(n)
I have also many other similar functions (var, size, ...)
Now I have an iterator iterating throught the data: iter_data. I can compute all the quantities I want: m = mean(iter_data); v = var(iter_data) and so on, but the problem is that I am iterating many times and this is expensive in my case. Actually the I/O is the most expensive part.
So the question is: can I compute my quantities m, v, ... iterating only one time over iter_data keeping separate the functions mean, var, ... so that it is easy to add new ones?
What I need is something similar to boost::accumulators
For example use objects and callbacks like:
class Counter():
def __init__(self):
self.n = 0
def __call__(self, i):
self.n += 1
class Summer():
def __init__(self):
self.sum = 0
def __call__(self, i):
self.sum += i
def process(iterator, callbacks):
for i in iterator:
for f in callbacks: f(i)
counter = Counter()
summer = Summer()
callbacks = [counter, summer]
iterator = xrange(10) # testdata
process(iterator, callbacks)
# process results from callbacks
n = counter.n
sum = summer.sum
This is easily extendible and iterates the data only once.
You can use itertools.tee and generator magic (I say magic because it's not exactly nice and readable):
import itertools
def mean(iterator):
n = 0
sum = 0.
for i in iterator:
sum += i
n += 1
yield
yield sum / float(n)
def multi_iterate(funcs, iter_data):
iterators = itertools.tee(iter_data, len(funcs))
result_iterators = [func(values) for func, values in zip(funcs, iterators)]
for results in itertools.izip(*result_iterators):
pass
return results
mean_result, var_result = multi_iterate([mean, var], iter([10, 20, 30]))
print(mean_result) # 20.0
By the way, you can write mean in a simpler way:
def mean(iterator):
total = 0.
for n, item in enumerate(iterator, 1):
total += i
yield
yield total / n
You shouldn't name variables sum because that shadows the built-in function with the same name.
Without classes, you could adapt the following:
def my_mean():
total = 0.
length = 0
while True:
val = (yield)
if val is not None:
total += val
length += 1
else:
yield total / length
def my_len():
length = 0
while True:
val = (yield)
if val is not None:
length += 1
else:
yield length
def my_sum():
total = 0.
while True:
val = (yield)
if val is not None:
total += val
else:
yield total
def process(iterable, **funcs):
fns = {name:func() for name, func in funcs.iteritems()}
for fn in fns.itervalues():
fn.send(None)
for item in iterable:
for fn in fns.itervalues():
fn.send(item)
return {name:next(func) for name, func in fns.iteritems()}
data = [1, 2, 3]
print process(data, items=my_len, some_other_value=my_mean, Total=my_sum)
# {'items': 3, 'some_other_value': 2.0, 'Total': 6.0}
What you want is to have a main Calc class that iterates over the data applying different calculation for mean, var, etc and then can return those values through an interface. You could make it more generic by letting calculations register themselves with this class before the main calculation and then have their results available through new accessors in the interface.
Related
Say I have X or 3 sets of different sizes.
So 3×4×5
I want to generate a probability distribution of the Cartesian of these sets that eventually and is guaranteed to cover every item.
Which based on my other answer I can generate the Nth Cartesian product by
items = []
N = self.N
for item in sets:
N, r = divmod(N, len(item))
items.append(item[r])
self.N = self.N + 1
Now I want to bias the distribution of the set to be unfair.
For example, If I have three sets a, b and c. I want to pick an item from a more often than b and an item of b more often than c but I want all items to eventually occur.
One inefficient idea I have is to generate all the indexed of every Nth position and order them based on a distribution but I am not sure how to do that.
Why am I trying to do this? I am trying to schedule things.
I think I found a solution. It's also incremental and small change. We introduce a frequency array. This is the priority order. 0 means that this column shall grow the least, then 1 means this column shall grow the most. 2 means that this column shall grow more than 0 but less than 1. I'm not sure why.
letters = ["A", "B", "C", "D"]
numbers = ["1", "2", "3", "4"]
symbols = ["÷", "×", "(", "&"]
freq = [2, 0, 1]
class Concurrent:
def __init__(self, sets):
self.sets = sets
self.N = 0
def size(self):
total = len(self.sets[0])
for item in self.sets[1:]:
total = total * len(item)
return total
def reset(self):
self.N = 0
def fairtick(self):
combo = []
N = self.N
for index, item in enumerate(self.sets):
N, r = divmod(N, len(item))
combo.append(r)
self.N = self.N + 1
results = ""
for index, item in enumerate(combo):
results += self.sets[index][item]
return results
def biasedtick(self):
combo = [0] * len(self.sets)
N = self.N
for index, item in enumerate(self.sets):
N, r = divmod(N, len(item))
combo[freq[index]] = r
self.N = self.N + 1
results = ""
for index, item in enumerate(combo):
results += self.sets[index][item]
return results
print("FAIR")
cl = Concurrent([letters, numbers, symbols])
s1 = set()
for index in range(cl.size()):
value = cl.fairtick()
print(value)
s1.add(value)
print("")
print("BIASED")
print("")
cl2 = Concurrent([letters, numbers, symbols])
s2 = set()
for index in range(cl2.size()):
value = cl2.biasedtick()
print(value)
s2.add(value)
assert s1 == s2
Here's the output:
FAIR
A1÷
B1÷
C1÷
D1÷
A2÷
B2÷
C2÷
D2÷
A3÷
B3÷
C3÷
D3÷
A4÷
B4÷
C4÷
D4÷
A1×
B1×
C1×
D1×
A2×
B2×
C2×
D2×
A3×
B3×
C3×
D3×
A4×
B4×
C4×
D4×
A1(
B1(
C1(
D1(
A2(
B2(
C2(
D2(
A3(
B3(
C3(
D3(
A4(
B4(
C4(
D4(
A1&
B1&
C1&
D1&
A2&
B2&
C2&
D2&
A3&
B3&
C3&
D3&
A4&
B4&
C4&
D4&
BIASED
A1÷
A1×
A1(
A1&
B1÷
B1×
B1(
B1&
C1÷
C1×
C1(
C1&
D1÷
D1×
D1(
D1&
A2÷
A2×
A2(
A2&
B2÷
B2×
B2(
B2&
C2÷
C2×
C2(
C2&
D2÷
D2×
D2(
D2&
A3÷
A3×
A3(
A3&
B3÷
B3×
B3(
B3&
C3÷
C3×
C3(
C3&
D3÷
D3×
D3(
D3&
A4÷
A4×
A4(
A4&
B4÷
B4×
B4(
B4&
C4÷
C4×
C4(
C4&
D4÷
D4×
D4(
D4&
I'm developing an ABtest framework using django. I want to assign variant number based on bucket_id from cookies' request.
bucket_id is set by the front end with a range integer from 0-99.
So far, I have created the function name get_bucket_name:
def get_bucket_range(data):
range_bucket = []
first_val = 0
next_val = 0
for i, v in enumerate(data.split(",")):
v = int(v)
if i == 0:
first_val = v
range_bucket.append([0, first_val])
elif i == 1:
range_bucket.append([first_val, first_val + v])
next_val = first_val + v
else:
range_bucket.append([next_val, next_val + v])
next_val = next_val + v
return range_bucket
Data input for get_bucket_range is a comma delineated string which means we have 3 variants where each variant has its own weight e.g. data = "25,25,50" with first variant's weight being 25 etc.
I then created a function to assign the variant named,
def assign_variant(range_bucket, num):
for i in range(len(range_bucket)):
if num in range(range_bucket[i][0], range_bucket[i][1]):
return i
This function should have 2 parameters, range_bucket -> from get_bucket_range function, and num -> bucket_id from cookies.
With this function I can return which bucket_id belongs to the variant id.
For example, we have 25 as bucket_id, with data = "25,25,50". This means our bucket_id should belong to variant id 1. Or in the case that we have 25 as bucket_id, with data = "10,10,10,70". This should mean that our bucket_id will belong to variant id 2.
However, it feels like neither of my functions are pythonic or optimised. Does anyone here have any suggestions as to how I could improve my code?
Your functions could look like this for example:
def get_bucket_range(data):
last = 0
range_bucket = []
for v in map(int, data.split(',')):
range_bucket.append([last, last+v])
last += v
return range_bucket
def assign_variant(range_bucket, num):
for i, (low, high) in enumerate(range_bucket):
if low <= num < high:
return i
You can greatly reduce the lengths of your functions with the itertools.accumulate and bisect.bisect functions. The first function accumulates all the weights into sums (10,10,10,70 becomes 10,20,30,100), and the second function gives you the index of where that element would belong, which in your case is equivalent to the index of the group it belongs to.
from itertools import accumulate
from bisect import bisect
def get_bucket_range(data):
return list(accumulate(map(int, data.split(',')))
def assign_variant(range_bucket, num):
return bisect(range_bucket, num)
I have a startnumber and an endnumber.
From these numbers I need to pick a sequence of numbers.
The sequences is not always the same.
Example:
startnumber = 1
endnumber = 32
I need to create a list of numbers with a certain sequence
p.e.
3 numbers yes, 2 numbers no, 3 numbers yes, 2 numbers no.. etc
Expected output:
[[1-3],[6-8],[11-13],[16-18],[21-23],[26-28],[31-32]]
(at the end there are only 2 numbers remaining (31 and 32))
Is there a simple way in python to select sequences of line from a range of numbers?
numbers = range(1,33)
take = 3
skip = 2
seq = [list(numbers[idx:idx+take]) for idx in range(0, len(numbers),take+skip)]
Extrapolating this out:
def get_data(data, filterfunc=None):
if filterfunc is None:
filterfunc = lambda: True # take every line
result = []
sub_ = []
for line in data:
if filterfunc():
sub_.append(line)
else:
if sub_:
result.append(sub_)
sub_ = []
return result
# Example filterfunc
def example_filter(take=1, leave=1):
"""example_filter is a less-fancy version of itertools.cycle"""
while True:
for _ in range(take):
yield True
for _ in range(leave):
yield False
# Your example
final = get_data(range(1, 33), example_filter(take=3, leave=2))
As alluded to in the docstring of example_filter, the filterfunc for get_data is really just expecting a True or False based on a call. You could change this easily to be of the signature:
def filterfunc(some_data: object) -> bool:
So that you can determine whether to take or leave based on the value (or even the index), but it currently takes no arguments and just functions as a less magic itertools.cycle (since it should return its value on call, not on iteration)
from itertools import islice
def grouper(iterable, n, min_chunk=1):
it = iter(iterable)
while True:
chunk = list(islice(it, n))
if len(chunk) < min_chunk:
return
yield chunk
def pick_skip_seq(seq, pick, skip, skip_first=False):
if skip_first:
ret = [ x[skip:] for x in grouper(seq, pick+skip, skip+1) ]
else:
ret = [ x[:pick] for x in grouper(seq, pick+skip) ]
return ret
pick_skip_seq(range(1,33), 3, 2) gives required list.
In pick_skip_seq(seq, pick, skip, skip_first=False),
seq is sequence to pick/skip from,
pick/skip are no. of elements to pick/skip,
skip_first is to be set True if
such behavior is desired.
grouper returns chunks of n elements, it
ignores last group if it has less
than min_chunk elements.
It is derived from stuff given in
https://stackoverflow.com/a/8991553/1921546.
Demo:
# pick 3 skip 2
for i in range(30,35):
print(pick_skip_seq(range(1,i), 3, 2))
# skip 3 pick 2
for i in range(30,35):
print(pick_skip_seq(range(1,i), 3, 2, True))
An alternative implementation of pick_skip_seq:
from itertools import chain,cycle,repeat,compress
def pick_skip_seq(seq, pick, skip, skip_first=False):
if skip_first:
c = cycle(chain(repeat(0, skip), repeat(1, pick)))
else:
c = cycle(chain(repeat(1, pick), repeat(0, skip)))
return list(grouper(compress(seq, c), pick))
All things used are documented here: https://docs.python.org/3/library/itertools.html#itertools.compress
I have some calculations on biological data. Each function calculates the total, average, min, max values for one list of objects.
The idea is that I have a lot of different lists each one is for a different object type.
I don't want to repeat my code for every function just changing the "for" line and the call of the object's method!
For example:
Volume function:
def calculate_volume(self):
total = 0
min = sys.maxint
max = -1
compartments_counter = 0
for n in self.nodes:
compartments_counter += 1
current = n.get_compartment_volume()
if min > current:
min = current
if max < current:
max = current
total += current
avg = float(total) / compartments_counter
return total, avg, min, max
Contraction function:
def get_contraction(self):
total = 0
min = sys.maxint
max = -1
branches_count = self.branches.__len__()
for branch in self.branches:
current = branch.get_contraction()
if min > current:
min = current
if max < current:
max = current
total += current
avg = float(total) / branches_count
return total, avg, min, max
Both functions look almost the same, just a little modification!
I know I can use the sum, min, max, ... etc. but when I apply them for my values they take more time than doing them in the loop because they can't be called at once.
I just want to know if is it the right way to write a function for every calculation? (i.e. a professional way?) Or maybe I can write one function and pass the list, object type and the method to call.
It's hard to say without seeing the rest of the code but from the limited view given I'd reckon you shouldn't have these functions in methods at all. I also really don't understand your reasoning for not using the builtins("they can't be called at once?"). If you're implying that implementing the 4 statistical methods in a single pass in python is faster than 4 passes in builtin (C) then I'm afraid you have a very wrong assumption.
That said, here's my take on the problem:
def get_stats(l):
s = sum(l)
return (
s,
float(s) / len(l),
min(l),
max(l))
# then create numeric lists from your data and send 'em through:
node_volumes = [n.get_compartment_volume() for n in self.nodes]
branches = [b.get_contraction() for b in self.branches]
# ...
total_1, avg_1, min_1, max_1 = get_stats(node_volumes)
total_2, avg_2, min_2, max_2 = get_stats(branches)
EDIT
Some benchmarks to prove that builtin is win:
MINE.py
import sys
def get_stats(l):
s = sum(l)
return (
s,
float(s) / len(l),
min(l),
max(l)
)
branches = [i for i in xrange(10000000)]
print get_stats(branches)
Versus YOURS.py
import sys
branches = [i for i in xrange(10000000)]
total = 0
min = sys.maxint
max = -1
branches_count = branches.__len__()
for current in branches:
if min > current:
min = current
if max < current:
max = current
total += current
avg = float(total) / branches_count
print total, avg, min, max
And finally with some timers:
smassey#hacklabs:/tmp $ time python mine.py
(49999995000000, 4999999.5, 0, 9999999)
real 0m1.225s
user 0m0.996s
sys 0m0.228s
smassey#hacklabs:/tmp $ time python yours.py
49999995000000 4999999.5 0 9999999
real 0m2.369s
user 0m2.180s
sys 0m0.180s
Cheers
First, notice that while it is probably more efficient to call len(self.branches) (don't call __len__ directly), it is more general to increment a counter in the loop like you do with calculate_volume. With that change, you can refactor as follows:
def _stats(self, iterable, get_current):
total = 0.0
min_value = None # Slightly better
max_value = -1
counter = 0
for n in iterable:
counter += 1
current = get_current(n)
if min_value is None or min_value > current:
min_value = current
if max_value < current:
max_value = current
total += current
avg = total / denom
return total, avg, min_value, max_value
Now, each of the two can be implemented in terms of _stats:
import operator
def calculate_volume(self):
return self._stats(self.nodes, operator.methodcaller('get_compartment_volume'))
def get_contraction(self):
return self.refactor(self.branches, operator.methodcaller('get_contraction'))
methodcaller provides a function f such that f('method_name')(x) is equivalent to x.method_name(), which allows you to factor out the method call.
You can use getattr( instance, methodname) to write a function to process lists of arbitrary objects.
def averager( things, methodname):
count,total,min,max = 0,0,sys.maxint,-1
for thing in things:
current = getattr(thing, methodname)()
count += 1
if min > current:
min = current
if max < current:
max = current
total += current
avg = float(total) / branches_count
return total, avg, min, max
Then inside your class definitions you just need
def calculate_volume(self): return averager( self.nodes, 'get_compartment_volume')
def get_contraction(self): return averager( self.branches, 'get_contraction' )
Writing a function that takes another function that knows how to extract values from the list is very common. In fact, min and max both take arguments to such and effect.
eg.
items = [1, 0, -2]
print(max(items, key=abs)) # prints -2
So it's perfectly acceptable to write your own function that does the same. Normally, I would just create a new list of all the values you want to examine and then work with that (eg. [branch.get_contraction() for branch in branches]). But perhaps space is an issue for you, so here is an example using a generator.
def sum_avg_min_max(iterable, key=None):
if key is not None:
iter_ = (key(item) for item in iterable)
else:
# if there is no key, just use the iterable itself
iter_ = iter(iterable)
try:
# We don't know sensible starting values for total, min or max. So use
# the first value.
total = min_ = max_ = next(iter_)
except StopIteration:
# can't have a min or max if we have no items in the iterable...
raise ValueError("empty iterable") from None
count = 1
for item in iter_:
total += item
min_ = min(min_, item)
max_ = max(max_, item)
count += 1
return total, float(total) / count, min_, max_
Then you might use it like this:
class MyClass(int):
def square(self):
return self ** 2
items = [MyClass(i) for i in range(10)]
print(sum_avg_min_max(items, key=MyClass.square)) # prints (285, 28.5, 0, 81)
This works because when you fetch an instance method from the class it gives your underlying function itself (without self bound). So we can use it as the key. eg.
str.upper("hello world") == "hello world".upper()
With a more concrete example (assuming items in branches are instances of Branch):
def get_contraction(self):
result = sum_avg_min_max(self.branches, key=Branch.get_contraction)
return result
Or maybe I can write one function and pass the list, object type and the method to call.
Altough you can definitely pass a function to function, and it's actually a very common way to avoid repeating yourself, in this case you can't because each object in the list has it's own method. So instead, I'm passing the function's name as a string, then using getattr in order to get the actual callable method from the object. Also note that I'm using len() instead of explicitly calling __len()__.
def handle_list(items_list, func_to_call):
total = 0
min = sys.maxint
max = -1
count = len(items_list)
for item in items_list:
current = getattr(item, func_to_call)()
if min > current:
min = current
if max < current:
max = current
total += current
avg = float(total) / count
return total, avg, min, max
I am attempting to implement heap sort using the psuedo code from the book Intro to Algorithms. The following is what I have:
def parent(i):
return i/2
def left(i):
return 2*i
def right(i):
return 2*i+1
def max_heapify(seq, i, n):
l = left(i)
r = right(i)
if l <= n and seq[n] > seq[i]:
largest = l
else:
largest = i
if r <= n and seq[r] > seq[largest]:
largest = r
if largest != i:
seq[i], seq[largest] = seq[largest], seq[i]
max_heapify(seq, largest, n)
def heap_length(seq):
return len(seq) - 1
def build_heap(seq):
n = heap_length(seq)
for i in range(n/2,0,-1):
max_heapify(seq, i, n)
def sort(seq):
build_heap(seq)
heap_size = heap_length(seq)
for i in range(heap_size,1,-1):
seq[1], seq[i] = seq[i], seq[1]
heap_size = heap_size - 1
max_heapify(seq, 1, heap_size)
return seq
I am having issues with understanding passing by value or by reference in Python. I have looked at the following question and it seems that I am passing the list by value. My questions is how to return the correctly sorted list either by reference or by value?
arrays are always passed by reference
if you want to pass by value use slice
my_func(my_array[:]) #send copy
my_func(my_array) #array is modified inside and changes are reflected in original