Calculate mean on values in python collections.Counter - python

I'm profiling some numeric time measurements that cluster extremely closely. I would like to obtain mean, standard deviation, etc. Some inputs are large, so I thought I could avoid creating lists of millions of numbers and instead
use Python collections.Counter objects as a compact representation.
Example: one of my small inputs yields a collection.Counter like [(48, 4082), (49, 1146)] which means 4,082 occurrences of the value 48 and 1,146 occurrences of the value 49. For this data set I manually calculate the mean to be something like 48.2192042846.
Of course if I had a simple list of 4,082 + 1,146 = 5,228 integers I would just feed it to numpy.mean().
My question: how can I calculate descriptive statistics from the values in a collections.Counter object just as if I had a list of numbers? Do I have to create the full list or is there a shortcut?

collections.Counter() is a subclass of dict. Just use Counter().values() to get a list of the counts, and you can use the standard library staticstics.mean() function
import statistics
counts = Counter(some_iterable_to_be_counted)
mean = statistics.mean(counts.values())
Note that I did not call Counter.most_common() here, which would produce the list of (key, count) tuples you posted in your question.
If you must use the output of Counter.most_common() you can filter out just the counts with a generator expression:
mean = statistics.mean(count for key, count in most_common_list)
If you meant to calculate the mean key value as weighted by their counts, you'd do your own calculations directly from the counter values:
mean = sum(key * count for key, count in counter.items()) / counter.total())
Note: I used Counter.total() there, which is new in Python 3.10. In older versions. use sum(counter.values()).
For the median, use statistics.median():
import statistics
counts = Counter(some_iterable_to_be_counted)
median = statistics.median(counts.values())
or, for key * value:
median = statistics.median(key * count for key, count in counts.items())

While you can offload everything to numpy after making a list of values, this will be slower than needed. Instead, you can use the actual definitions of what you need.
The mean is just the sum of all numbers divided by their count, so that's very simple:
sum_of_numbers = sum(number*count for number, count in counter.items())
count = sum(count for n, count in counter.items())
mean = sum_of_numbers / count
Standard deviation is a bit more complex. It's the square root of variance, and variance in turn is defined as "mean of squares minus the square of the mean" for your collection. Soooo...
total_squares = sum(number*number * count for number, count in counter)
mean_of_squares = total_squares / count
variance = mean_of_squares - mean * mean
std_dev = math.sqrt(variance)
A little bit more manual work, but should also be much faster if the number sets have a lot of repetition.

Unless you want to write your own statistic functions there is no prêt-à-porter solution (as far as I know).
So at the end you need to create lists, and the fastest way is to use numpy. One way to do it is:
import numpy as np
# One memory allocation will be considerably faster
# if you have multiple discrete values.
elements = np.ones(48+49)
elements[0:48] *= 4082
elements[48:] *= 1146
# Then you can use numpy statistical functions to calculate
np.mean(elements)
np.std(elements)
# ...
UPDATE: Create elements from an existing collections.Counter() object
c = collections.Counter({48: 4082, 49: 1146})
elements = np.ones(sum(c.values()))
idx = 0
for value, occurrences in c.iteritems():
elements[idx:idx + occurrences] *= value
idx += occurrences

Related

enumerate in dictionary loop take long time how to improv the speed

I am using python-3.x and I would like to speed my code where in every loop, I am creating new values and I checked if they exist or not in the dictionary by using the (check if) then I will keep the index where it is found if it exists in the dictionary. I am using the enumerate but it takes a long time and it very clear way. is there any way to speed my code by using another way or in my case the enumerate is the only way I need to work with? I am not sure in my case using numpy will be better.
Here is my code:
# import numpy
import numpy as np
# my first array
my_array_1 = np.random.choice ( np.linspace ( -1000 , 1000 , 2 ** 8 ) , size = ( 100 , 3 ) , replace = True )
my_array_1 = np.array(my_array_1)
# here I want to find the unique values from my_array_1
indx = np.unique(my_array_1, return_index=True, return_counts= True,axis=0)
#then saved the result to dictionary
dic_t= {"my_array_uniq":indx[0], # unique values in my_array_1
"counts":indx[2]} # how many times this unique element appear on my_array_1
# here I want to create random array 100 times
for i in range (100):
print (i)
# my 2nd array
my_array_2 = np.random.choice ( np.linspace ( -1000 , 1000 , 2 ** 8 ) , size = ( 100 , 3 ) , replace = True )
my_array_2 = np.array(my_array_2)
# I would like to check if the values in my_array_2 exists or not in the dictionary (my_array_uniq":indx[0])
# if it exists then I want to hold the index number of that value in the dictionary and
# add 1 to the dic_t["counts"], which mean this value appear agin and cunt how many.
# if not exists, then add this value to the dic (my_array_uniq":indx[0])
# also add 1 to the dic_t["counts"]
for i, a in enumerate(my_array_2):
ix = [k for k,j in enumerate(dic_t["my_array_uniq"]) if (a == j).all()]
if ix:
print (50*"*", i, "Yes", "at", ix[0])
dic_t["counts"][ix[0]] +=1
else:
# print (50*"*", i, "No")
dic_t["counts"] = np.hstack((dic_t["counts"],1))
dic_t["my_array_uniq"] = np.vstack((dic_t["my_array_uniq"], my_array_2[i]))
explanation:
1- I will create an initial array.
2- then I want to find the unique values, index and count from an initial array by using (np.unique).
3- saved the result to the dictionary (dic_t)
4- Then I want to start the loop by creating random values 100 times.
5- I would like to check if this random values in my_array_2 exist or not in the dictionary (my_array_uniq":indx[0])
6- if one of them exists then I want to hold the index number of that value in the dictionary.
7 - add 1 to the dic_t["counts"], which mean this value appears again and count how many.
8- if not exists, then add this value to the dic as new unique value (my_array_uniq":indx[0])
9 - also add 1 to the dic_t["counts"]
So from what I can see you are
Creating 256 random numbers from a linear distribution of numbers between -1000 and 1000
Generating 100 triplets from those (it could be fewer than 100 due to unique but with overwhelming probability it will be exactly 100)
Then doing pretty much the same thing 100 times and each time checking for each of the triplets in the new list whether they exist in the old list.
You're then trying to get a count of how often each element occurs.
I'm wondering why you're trying to do this, because it doesn't make much sense to me, but I'll give a few pointers:
There's no reason to make a dictionary dic_t if you're only going to hold to objects in it, just use two variables my_array_uniq and counts
You're dealing with triplets of floating point numbers. In the given range, that should give you about 10^48 different possible triplets (I may be wrong on the exact number but it's an absurdly large number either way). The way you're generating them does reduce the total phase-space a fair bit, but nowhere near enough. The probability of finding identical ones is very very low.
If you have a set of objects (in this case number triplets) and you want to determine whether you have seen a given one before, you want to use sets. Sets can only contain immutable objects, so you want to turn your triplets into tuples. Determining whether a given triplet is already contained in your set is then an O(1) operation.
For counting the number of occurences of sth, collections.Counter is the natural datastructure to use.

Index of element in random permutation for very large range

I am working with a very large range of values (0 to approx. 10^6128) and I need a way in Python to perform two-way lookups with a random permutation of the range.
Example with a smaller dataset:
import random
values = list(range(10)) # the actual range is too large to do this
random.shuffle(values)
def map_value(n):
return values[n]
def unmap_value(n):
return values.index(n)
I need a way to implement the map_value and unmap_value methods with values in the very large range above.
Creating a fixed permutation of 10**6128 values is costly - memory wise.
You can create values from your range on the fly and store them in one / two dictionaries.
If you only draw comparativly few values one dict might be enough, if you have lots of values you might need 2 for faster lookup.
Essentially you
lookup a value, if not present generate an index, store it and return it
lookup an index, if not present, generate a value, store it and return it
Using a fixed random seed should lead to same sequences:
import random
class big_range():
random.seed(42)
pos_value = {}
value_pos = {}
def map_value(self, n):
p = big_range.value_pos.get(n)
while p is None:
p = random.randrange(10**6128) # works, can't use random.choice(range(10**6128))
if p in big_range.pos_value:
p = None
else:
big_range.pos_value[p]=n
big_range.value_pos[n]=p
return p
def unmap_value(self, n):
p = big_range.pos_value.get(n)
while p is None:
p = random.randrange(10**6128) # works, can't use random.choice(range(10**6128))
if p in big_range.pos_value:
p = None
else:
big_range.pos_value[n]=p
big_range.value_pos[p]=n
return p
br = big_range()
for i in range(10):
print(br.map_value(i))
print(big_range.pos_value)
print(big_range.value_pos)
Output:
Gibberisch humongeous number ... but it works.
You store each number twice (once as pos:number, once as number:pos) for lookup reasons. You might want to check how many numbers you can generate before your memory goes out.
You can use one dict only, but looking up the value to an index is not O(1) but O(n) in that case because you need to traverse dict.items() to find the value and return the index.
The repeatability breaks if you do other random things in between because you alter the "state" of random - you might need to do some more capsulating and maybe statekeeping inside your class using random.getstate() / random.setstate() to store the last state after generation of a new random as well...
If you look for most of your values it will take longer and longer to produce a "not present one" if you simple keep looping indexes from 0 to 10**6128...
random.getstate()
random.setstate()
random.randrange()
This is kindof fragile and more of a thought experiment - I have no clue whatfor one needs a 10**6128 range of numbers...

How can I get the size of bloom filter set while using union or intersection function?

I'm trying to get the size of bloom filter set while using bloom filter's union & intersection functions with python package(https://github.com/jaybaird/python-bloomfilter.git)
I though that after conducting the function 'union' or 'intersection', then I could get the result by adding len() function, but it just print out only '0' output.
from pybloom import BloomFilter
bf1 = BloomFilter(1000)
bf2 = BloomFilter(1000)
# After adding some elements to bf1 and bf2
print(len(bf1.union(bf2)))
# expected max(len(bf1), len(bf2)) but the result was 0
After I find the document page, I realized that the len() option become disabled after 'union' function and its actual result len() was 0.
Instead, I want to approximate the size of bloom filter set somehow.
Do you have any idea in order to calculate the size of it?
The implementation only copies BloomFilter's bitarray, i.e. self.bitarray. The elements self.count in previous filters are not counted in.
So it doesn't union the elements - but do a bitarray or.
Update:
In most cases you don't need to approximate the count. It provided a precise count of elements when you call add, and you can just call len(bf3). Unfortunately new created bf3 has not been called add so len(bf3) == 0.
For the formula to approximate number of elements,
- m / k * ln(1- n / m)
You have
import math.log as ln
m = bf3.bitarray.length()
n = bf3.bitarray.count()
k = bf3.num_slices
# given m=20, n=8, approximate n elements as 5.89

Python-random number and its frequency

The function randint from the random module can be used to produce random numbers. A call on random.randint(1, 6), for example, will produce the values 1 to 6 with equal probability. Write a program that loops 1000 times. On each iteration it makes two calls on randint to simulate rolling a pair of dice. Compute the sum of the two dice, and record the number of times each value appears.
The output should be two columns. One displays all the sums (i.e. from 2 to 12) and the other displays the sums' respective frequencies in 1000 times.
My code is shown below:
import random
freq=[0]*13
for i in range(1000):
Sum=random.randint(1,6)+random.randint(1,6)
#compute the sum of two random numbers
freq[sum]+=1
#add on the frequency of a particular sum
for Sum in xrange(2,13):
print Sum, freq[Sum]
#Print a column of sums and a column of their frequencies
However, I didn't manage to get any results.
You shouldn't use Sum because simple variables should not be capitalized.
You shouldn't use sum because that would shadow the built-in sum().
Use a different non-capitalized variable name. I suggest diceSum; that's also stating a bit about the context, the idea behind your program etc. so a reader understands it faster.
You don't want to make any readers of your code happy? Think again. You asked for help here ;-)
Try this:
import random
freq=[0]*13
for i in range(1000):
Sum=random.randint(1,6)+random.randint(1,6)
#compute the sum of two random numbers
freq[Sum]+=1
#add on the frequency of a particular sum
for Sum in xrange(2,13):
print Sum, freq[Sum]
#Print a column of sums and a column of their frequencies
There's a grammar case error on sum
The seed generator that python uses should suffice to your task.
Looks like a typo error. Sum variable is wrongly typed as sum.
Below is the modified code in python 3.x
#!/usr/bin/env python3
import random
freq= [0]*13
for i in range(1000):
#compute the sum of two random numbers
Sum = random.randint(1,6)+random.randint(1,6)
#add on the frequency of a particular sum
freq[Sum] += 1
for Sum in range(2,13):
#Print a column of sums and a column of their frequencies
print(Sum, freq[Sum])

Python: How would I get an average of a set of tuples?

I have a problem I attempting to solve this problem.
I have a function that produces tuples. I attempted to store them in an array in this method
while(loops til exhausted)
count = 0
set_of_tuples[count] = function(n,n,n)
count = count + 1
apparently python doesn't store variables this way. How can I go about storing a set of tuples in a variable and then averaging them out?
You can store them in a couple ways. Here is one:
set_of_tuples = []
while `<loop-condition>`:
set_of_tuples.append(function(n, n, n))
If you want to average the results element-wise, you can:
average = tuple(sum(x[i] for x in set_of_tuples) / len(set_of_tuples)
for i in range(len(set_of_tuples[0])))
If this is numerical data, you probably want to use Numpy instead. If you were using a Numpy array, you would just:
average = numpy.average(arr, axis=0)
Hmmm, your psuedo-code is not Python at all. You might want to look at something more like:
## count = 0
set_of_tuples = list()
while not exhausted():
set_of_tuples.append(function(n,n,n))
## count += 1
count = len(set_of_tuples)
However, here the count is superfluous since we can just *len(set_of_tuples)* after the loop if we want. Also the name "set_of_tuples" is a pretty poor choice; especially given that it's not a set.

Categories

Resources