Python: How would I get an average of a set of tuples? - python

I have a problem I attempting to solve this problem.
I have a function that produces tuples. I attempted to store them in an array in this method
while(loops til exhausted)
count = 0
set_of_tuples[count] = function(n,n,n)
count = count + 1
apparently python doesn't store variables this way. How can I go about storing a set of tuples in a variable and then averaging them out?

You can store them in a couple ways. Here is one:
set_of_tuples = []
while `<loop-condition>`:
set_of_tuples.append(function(n, n, n))
If you want to average the results element-wise, you can:
average = tuple(sum(x[i] for x in set_of_tuples) / len(set_of_tuples)
for i in range(len(set_of_tuples[0])))
If this is numerical data, you probably want to use Numpy instead. If you were using a Numpy array, you would just:
average = numpy.average(arr, axis=0)

Hmmm, your psuedo-code is not Python at all. You might want to look at something more like:
## count = 0
set_of_tuples = list()
while not exhausted():
set_of_tuples.append(function(n,n,n))
## count += 1
count = len(set_of_tuples)
However, here the count is superfluous since we can just *len(set_of_tuples)* after the loop if we want. Also the name "set_of_tuples" is a pretty poor choice; especially given that it's not a set.

Related

iteration over every element of 1d array in python

From an nc file I've read variables which are in the form of arrays. Now I've performed a calculation with the first element of all these variables and created a new variable. I want to repeat the same set of calculations for each element in the initial arrays without changing the code for the calculation which I have made taking a single point into consideration.
I've tried zip and nditer but in both cases the if statements in variable a are to be changed to .any() or .all(). I can't do either because I want the if statement to take into consideration only a single point and not the entire array.
T = AD06_ALL_OMNI.variables['A_TEMP'][:][0]
REL_HUM = AD06_ALL_OMNI.variables['HUMIDITY'][:][0]
AIR_PRES = AD06_ALL_OMNI.variables['A_PRES'][:][0]
a = T-29.65
#masking of values so that division by 0 is avoided
if a!=0.0:
exponent1 = math.exp(17.67*T-0.16/a)
q = REL_HUM*exponent1/(26.3*AIR_PRES)
deltaq = 0.98*qs-q
print (deltaq)
I need a to be computed for each point so that deltaq is found out for the same point taking values of T, REL_HUM, and AIR_PRES from the corresponding points. All variables are of same size (1d arrays). Please help!
for var in range(0, length(AD06_ALL_OMNI.variables['A_TEMP'][:])-1):
T = AD06_ALL_OMNI.variables['A_TEMP'][:][var]
REL_HUM = AD06_ALL_OMNI.variables['HUMIDITY'][:][var]
AIR_PRES = AD06_ALL_OMNI.variables['A_PRES'][:][var]
a = T-29.65
#masking of values so that division by 0 is avoided
count = 0
for element in a:
if element!=0.0:
exponent1 = math.exp(17.67*T[count] -0.16/element)
q = REL_HUM[count]*exponent1/(26.3*AIR_PRES[count] )
deltaq = 0.98*qs-q
print (deltaq)
count = count + 1
Assuming that all the arrays are of the same length(it doesn't make sense if you don't have air pressure, air temperature and humidity of equal lengths)you can use a loop to iterate over all the values of a, check for 0 for each value and calculate and print deltaq for each point. I hope this helps.

keeping only elements in a list at a certain distance at least - changing iterator while looping - Python

as the title suggests, I have developed a function that, given an ORDERED ascending list, you keep only the elements which have a distance of at least k periods but it does so while dynamically changing the iterator while looping. I have been told this is to be avoided like the plague and, though I am not fully convinced as to why this is such a bad idea, I trust in those whom I have been leaning on for training and thus asking for advice on how to avoid such practice. The code is the following:
import pandas as pd
from datetime import days
a = pd.Series(range(0,25,1), index=pd.date_range('2011-1-1',periods=25))
store_before_cleanse = a.index
def funz(x,k):
i = 0
while i < len(x)-1:
if (x[i+1]-x[i]).days < k:
x = x[:i+1] + x[i+2:]
i = i-1
i = i + 1
return x
print(funz(store_before_cleanse,10))
what do you think can be done in order to avoid it?
p.s.: do not worry about solutions in which the list is not ordered. the list that will be given will always be ordered in an ascending fashion.
The biggest default of your function his to have a quadratic complexity, since x = x[:i+1] + x[i+2:] copy the whole x each time.
The simplest an more efficient way to do that want is probably
a.resample('10D').first().index.
If you prefer a loop you can just do :
def funz1(dates,k):
result=[dates[0]]
for date in dates:
if (date-result[-1]).days >= k:
result.append(date)
return result

How do i add everything in my array together

In my code I am attempting to generate 8 random numbers using a for loop. I then add these to the end of my array of the name 'numbers'. Now I will like to add the numbers in this array together but I can't figure out a way to do this.
Below you will see my code.
def get_key():
numbers = []
for i in range(8):
i = random.randrange(33, 126)
numbers.append(i)
get_key()
You want to use sum
a = [1,2,3,4,5]
sum(a) # outputs 15
add as in sum? simply do sum(numbers).
As others have noted, you can use sum to iterate and accumulate over the list (the default accumulator for sum is int(), i.e. 0). Also, if this is the only purpose for the list, you can save memory by using a generator.
import random
get_key = lambda: sum(random.randrange(33, 126) for _ in range(8))
print( get_key() ) # 612
The real question is why are you trying to do this? It seems like there will be a more direct method by using a higher-level distribution. For example, the sum of n I.I.D. variables will approach a Normal distribution.

Calculate mean on values in python collections.Counter

I'm profiling some numeric time measurements that cluster extremely closely. I would like to obtain mean, standard deviation, etc. Some inputs are large, so I thought I could avoid creating lists of millions of numbers and instead
use Python collections.Counter objects as a compact representation.
Example: one of my small inputs yields a collection.Counter like [(48, 4082), (49, 1146)] which means 4,082 occurrences of the value 48 and 1,146 occurrences of the value 49. For this data set I manually calculate the mean to be something like 48.2192042846.
Of course if I had a simple list of 4,082 + 1,146 = 5,228 integers I would just feed it to numpy.mean().
My question: how can I calculate descriptive statistics from the values in a collections.Counter object just as if I had a list of numbers? Do I have to create the full list or is there a shortcut?
collections.Counter() is a subclass of dict. Just use Counter().values() to get a list of the counts, and you can use the standard library staticstics.mean() function
import statistics
counts = Counter(some_iterable_to_be_counted)
mean = statistics.mean(counts.values())
Note that I did not call Counter.most_common() here, which would produce the list of (key, count) tuples you posted in your question.
If you must use the output of Counter.most_common() you can filter out just the counts with a generator expression:
mean = statistics.mean(count for key, count in most_common_list)
If you meant to calculate the mean key value as weighted by their counts, you'd do your own calculations directly from the counter values:
mean = sum(key * count for key, count in counter.items()) / counter.total())
Note: I used Counter.total() there, which is new in Python 3.10. In older versions. use sum(counter.values()).
For the median, use statistics.median():
import statistics
counts = Counter(some_iterable_to_be_counted)
median = statistics.median(counts.values())
or, for key * value:
median = statistics.median(key * count for key, count in counts.items())
While you can offload everything to numpy after making a list of values, this will be slower than needed. Instead, you can use the actual definitions of what you need.
The mean is just the sum of all numbers divided by their count, so that's very simple:
sum_of_numbers = sum(number*count for number, count in counter.items())
count = sum(count for n, count in counter.items())
mean = sum_of_numbers / count
Standard deviation is a bit more complex. It's the square root of variance, and variance in turn is defined as "mean of squares minus the square of the mean" for your collection. Soooo...
total_squares = sum(number*number * count for number, count in counter)
mean_of_squares = total_squares / count
variance = mean_of_squares - mean * mean
std_dev = math.sqrt(variance)
A little bit more manual work, but should also be much faster if the number sets have a lot of repetition.
Unless you want to write your own statistic functions there is no prêt-à-porter solution (as far as I know).
So at the end you need to create lists, and the fastest way is to use numpy. One way to do it is:
import numpy as np
# One memory allocation will be considerably faster
# if you have multiple discrete values.
elements = np.ones(48+49)
elements[0:48] *= 4082
elements[48:] *= 1146
# Then you can use numpy statistical functions to calculate
np.mean(elements)
np.std(elements)
# ...
UPDATE: Create elements from an existing collections.Counter() object
c = collections.Counter({48: 4082, 49: 1146})
elements = np.ones(sum(c.values()))
idx = 0
for value, occurrences in c.iteritems():
elements[idx:idx + occurrences] *= value
idx += occurrences

Python noob: manipulating arrays

I have already asked a few questions on here about this same topic, but I'm really trying not to disappoint the professor I'm doing research with. This is my first time using Python and I may have gotten in a little over my head.
Anyways, I was sent a file to read and was able to using this command:
SNdata = numpy.genfromtxt('...', dtype=None,
usecols (0,6,7,8,9,19,24,29,31,33,34,37,39,40,41,42,43,44),
names ['sn','off1','dir1','off2','dir2','type','gal','dist',
'htype','d1','d2','pa','ai','b','berr','b0','k','kerr'])
sn is just an array of the names of a particular supernova; type is an array of the type of supernovae it is (Ia or II), etc.
One of the first things I need to do is simply calculate the probabilities of certain properties given the SN type (Ia or II).
For instance, the column htype is the morphology of a galaxy (given as an integer 1=elliptical to 8=irregular). I need to calculate the probability of an elliptical given a TypeIa and an elliptical given TypeII, for all of the integers to up to 8.
For ellipticals, I know that I just need the number of elements that have htype = 1 and type = Ia divided by the total number of elements of type = Ia. And then the number of elements that have htype = 1 and type = II divided by the total number of elements that have type = II.
I just have no idea how to write code for this. I was planning on finding the total number of each type first and then running a for loop to find the number of elements that have a certain htype given their type (Ia or II).
Could anyone help me get started with this? If any clarification is needed, let me know.
Thanks a lot.
Numpy supports boolean array operations, which will make your code fairly straightforward to write. For instance, you could do:
htype_sums = {}
for htype_number in xrange(1,9):
htype_mask = SNdata.htype == htype_number
Ia_mask = SNdata.type == 'Ia'
II_mask = SNdata.type == 'II'
Ia_sum = (htype_mask & Ia_mask).sum() / Ia_mask.sum()
II_sum = (htype_mask & II_mask).sum() / II_mask.sum()
htype_sums[htype_number] = (Ia_sum, II_sum)
Each of the _mask variables are boolean arrays, so when you sum them you count the number of elements that are True.
You can use collections.Counter to count needed observations.
For example,
from collections import Counter
types_counter = Counter(row['type'] for row in data)
will give you desired counts of sn types.
htypes_types_counter = Counter((row['type'], row['htype']) for row in data)
counts for morphology and types. Then, to get your evaluation for ellipticals, just divide
1.0*htypes_types_counter['Ia', 1]/types_counter['Ia']

Categories

Resources