2D numpy array where spacing between items is defined by a function - python

I need a list or 2-d array of integers between a minimum value and a maximum value, where the interval between the integers varies according to a distribution function inversely. In other words, at the maximum value of the distribution the density should be highest. In my case something like a Weibull probability density function with k parameter 1.5 would be nice.
Output would look something like this:
>>> min = 1
>>> max = 500
>>> peak = 100
>>> n = 18
>>> myfunc(min, max, peak, n)
[1, 50, 75, 88, 94, 97, 98, 99, 100, 102, 106, 112, 135, 176, 230, 290, 360, 500]
I already tried one method using the np.random.weibull() function to populate a numpy array but this doesn't work out nicely enough; the randomization when producing a list of 20 items means that the spacing is not satisfactory. It would be much better to avoid generating random numbers from a distribution and instead do what I described above, controlling the spacing directly.
Thank you.
Edit: I mention a Weibull distribution because it is asymmetric, but of course any similar distribution function that gives similar results is also OK and may be more suitable.
Edit2: So I want a numpy non-linear space!
Edit3: As I answered in one comment, I want to avoid random number generation so that the function output is identical each time it is run with the same input parameters.

If I'm understanding your question right, this function should do what you're asking:
def weibullspaced(min, max, k, arrsize):
wb = np.random.weibull(k, arrsize - 1)
spaced = np.zeros((arrsize,))
spaced[1:] = np.cumsum(wb)
diff = max - min
spaced *= diff / spaced[-1]
return min + np.rint(spaced)
You can of course substitute in any distribution you want, but you said you wanted Weibull. Is that the function you're looking for?

Here is a rather unelegant but simple solution to my own question. I simplified things by using a triangular distribution function. This is good because it is easy to specify a minimum and maximum value. A function named "spacing()" provides a spacing amount from the x value according to a specified mathematical function. After incrementing through a while loop I add the Max value to the list so that the full range is present. I then convert to integers while converting to a numpy array.
The downside of this method is that I must manually specify a minimum and maximum step size. I would rather specify the length of the returned array!
import numpy as np
import math
Min = 1.0
Max = 500.0
peak = 100.0
minstep = 1.0
maxstep = 50.0
def spacing(x):
# Triangle distribution:
if x < peak:
# Since we are calculating gradients I keep everything as floats for now.
grad = (minstep - maxstep)/(peak - Min)
return grad*x + maxstep
elif x == peak:
return minstep
else:
grad = (maxstep-minstep)/(Max-peak)
return grad*x + minstep
def myfunc(Min, Max, peak, minstep, maxstep):
x = 1.0
chosen = []
while x < Max:
space = spacing(x)
chosen.append(x)
x += space
chosen.append(Max)
# I cheat with the integers by casting the list to ints right at the end:
chosen = np.array(chosen, dtype = 'int')
return chosen
print myfunc(1.0, 500.0, 100.0, 1.0, 50.0)
Output:
[ 1 50 75 88 94 97 99 100 113 128 145 163 184 208 235 264 298 335 378 425 478 500]

Related

Algorithm for generating 'balls in bins' outcomes with limits on bin size?

Suppose we have N balls and M bins to put them into. The number of outcomes for this can be found using 'stars and bars' method Integer Equations - Stars and Bars. We can also compute all the actual outcomes using the itertools package and get unique outcomes by using sympy.utilities.iterables.multiset_permutations.
If we restrict bin size however to some arbitrary number on [0, N], how could you efficiently compute all the possible outcomes? Is there a better way than computing all outcomes without bin size and then eliminating invalid outcomes afterwards?
For N = 2 and M = 3, the outcomes are {200, 110, 020, 011, 101, 002}.
For example, with N = 3 and M = 3, the outcomes are {300, 210, 120, 021, 012, 030, 201, 102, 003, 111}. But what if we wanted some array b=[3,3,0] specifying max bin size so that the only valid outcomes were {300, 210, 120, 030}?
You could use a recursive approach placing 0 to maximum balls on the first bin and recursing with the remaining balls on the rest of the bins:
def ballPlacing(N,B):
if N <= 0: return int(N==0)
if len(B) == 1:
return int(N<=B[0])
total = 0
for c in range(min(N,B[0])+1):
total += ballPlacing(N-c,B[1:])
return total
print(ballPlacing(3,[3,3,3])) # 10
print(ballPlacing(3,[3,3,0])) # 4

Normal distributed sub-sampling from a numpy array in python

I have a numpy array whose values are distributed in the following manner
From this array I need to get a random sub-sample which is normally distributed.
I need to get rid of the values from the array which are above the red line in the picture. i.e. I need to get rid of some occurences of certain values from the array so that my distribution gets smoothened when the abrupt peaks are removed.
And my array's distribution should become like this:
Can this be achieved in python, without manually looking for entries corresponding to the peaks and remove some occurences of them ? Can this be done in a simpler way ?
The following kind of works, it is rather aggressive, though:
It works by ordering the samples, transforming to uniform and then trying to select a regular griddish subsample. If you feel it is too aggressive you could increase ns which is essentially the number of samples kept.
Also, please note that it requires the knowledge of the true distribution. In case of normal distribution you should be fine with using sample mean and unbiased variance estimate (the one with n-1).
Code (without plotting):
import scipy.stats as ss
import numpy as np
a = ss.norm.rvs(size=1000)
b = ss.uniform.rvs(size=1000)<0.4
a[b] += 0.1*np.sin(10*a[b])
def smooth(a, gran=25):
o = np.argsort(a)
s = ss.norm.cdf(a[o])
ns = int(gran / np.max(s[gran:] - s[:-gran]))
grid, dp = np.linspace(0, 1, ns, endpoint=False, retstep=True)
grid += dp/2
idx = np.searchsorted(s, grid)
c = np.flatnonzero(idx[1:] <= idx[:-1])
while c.size > 0:
idx[c+1] = idx[c] + 1
c = np.flatnonzero(idx[1:] <= idx[:-1])
idx = idx[:np.searchsorted(idx, len(a))]
return o[idx]
ap = a[smooth(a)]
c, b = np.histogram(a, 40)
cp, _ = np.histogram(ap, b)

python, plot planck curves looping through arrays

I try to get myself familiar with programming in python but have just started and struggling with the following problem. Maybe someone can give me a hint how to proceed or where I can look for a nice solution.
I'd like to plot planck curves for 132 wavelength in 6 different temperatures via a loop in a loop. The function planckwavel receives two parameters, wavelength and temperature, which I separated in two loops.
I so far managed to use lists, which worked, however probably not solved in an elegant way:
plancks = []
temp = [280, 300, 320, 340, 360, 380]
temp_len = len(temp)
### via fun planckwavel
for i in range(temp_len):
t_list = [] # list nach jeder j schleife wieder leeren
for j in range(wl_centers_ar.shape[0]):
t = planckwavel(wl_centers_ar[j],temp[i])
t_list.append(t)
plancks.append(t_list)
### PLOT Planck curves
plancks = np.array(plancks).T # convert list to array and transpose
view_7 = plt.figure(figsize=(8.5, 4.5))
plt.plot(wl_centers_ar,plancks)
plt.xticks(rotation='vertical')
But I would like to use arrays insted of lists, as I like to continue afterwards with huge more dimensional images. So I tried the same with arrays but unfortunately failed with this code:
plancks_ar = zeros([132,6], dtype=float ) # create array and fill with zeros
temp_ar = array([273, 300, 310, 320, 350, 373])
for i in range(temp_ar.shape[0]):
t_ar = np.zeros(plancks_ar.shape[0])
for j in range(plancks_ar.shape[0]):
t = planck(wl_centers_ar[j]*1e-6,temp[1])/10**6
np.append(t_ar,t)
np.append(plancks_ar, t_ar)
plt.plot(wl_centers_ar,plancks)
I would be very thankful, if someone can give me some advice.
Thanx,
best regards,
peter
I think you're asking about how to use NumPy's broadcasting and vectorization. Here's a way to remove the explicit Python loops:
import numpy as np
# Some physical constants we'll need
h, kB, c = 6.626e-34, 1.381e-23, 2.998e8
def planck(lam, T):
# The Planck function, using NumPy vectorization
return 2*h*c**2/lam**5 / (np.exp(h*c/lam/kB/T) - 1)
# wavelength array, 3 - 300 um
lam = np.linspace(3, 75, 132)
# temperature array
T = np.array([280, 300, 320, 340, 360, 380])
# Remember to convert wavelength from um to m
pfuncs = planck(lam * 1.e-6, T[:,None])
import pylab
for pfunc in pfuncs:
pylab.plot(lam, pfunc)
pylab.show()
We want to calculate planck for each wavelength and for each T, so we need to broadcast the calculation over the two arrays. Following the rules laid out in the documentation linked to above, we can do that by adding a new axis to the temperature array (with T[:, None]):
lam: 132
T 6 x 1
--------------
6 x 132
The final dimension of T[:, None] is 1, so the 132 values of lam can be broadcast across it to produce a 6 x 132 array: 6 rows (one for each T) of 132 values (the wavelengths).
I tried to doublecheck the above planck equation using the inverse (brightness temperature). Oriented on your code I definde the following function and expected to get 300 Kelvin (# 10 microns, for 10 W/m2/str/microns):
def planckInv(lam, rad):
rad=rad*1.e6 #convert to W/m^2/m/sr
lam=lam*1.e-6 #convert wavelength to m
return (h*c/kB*lam)*( 1/ np.log( (2*h*c**2/lam**5) / rad +1 ))
but received a strange result
planckInv(10,10) - - > 3.0039933569668916e-08
Any suggestions what's wrong with my brightness temperature function?
thanks,
peter

How to qcut with non unique bin edges?

My question is the same as this previous one:
Binning with zero values in pandas
however, I still want to include the 0 values in a fractile. Is there a way to do this? In other words, if I have 600 values, 50% of which are 0, and the rest are let's say between 1 and 100, how would I categorize all the 0 values in fractile 1, and then the rest of the non-zero values in fractile labels 2 to 10 (assuming I want 10 fractiles). Could I convert the 0's to nan, qcut the remaining non nan data into 9 fractiles (1 to 9), then add 1 to each label (now 2 to 10) and label all the 0 values as fractile 1 manually? Even this is tricky, because in my data set in addition to the 600 values, I also have another couple hundred which may already be nan before I would convert the 0s to nan.
Update 1/26/14:
I came up with the following interim solution. The problem with this code though, is if the high frequency value is not on the edges of the distribution, then it inserts an extra bin in the middle of the existing set of bins and throws everything a little (or a lot) off.
def fractile_cut(ser, num_fractiles):
num_valid = ser.valid().shape[0]
remain_fractiles = num_fractiles
vcounts = ser.value_counts()
high_freq = []
i = 0
while vcounts.iloc[i] > num_valid/ float(remain_fractiles):
curr_val = vcounts.index[i]
high_freq.append(curr_val)
remain_fractiles -= 1
num_valid = num_valid - vcounts[i]
i += 1
curr_ser = ser.copy()
curr_ser = curr_ser[~curr_ser.isin(high_freq)]
qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)
qcut_bins = qcut[1]
all_bins = list(qcut_bins)
for val in high_freq:
bisect.insort(all_bins, val)
cut = pd.cut(ser, bins=all_bins)
ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)
return ser_fractiles
The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the same value must stay in the same bin/quantile (this behaviour is in accordance with the statistical definition of quantile).
The solutions are:
1 - Use pandas >= 0.20.0 that has this fix. They added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.
2 - Decrease the number of quantiles. Less quantiles means more elements per quantile
3 - Rank your data with DataFrame.rank(method='first'). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method='first')
Example:
pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"
Then use this instead:
pd.qcut(df.rank(method='first'), nbins)
4 - Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile
5 - Use pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin
Another way to do this is to introduce a minimal amount of noise, which will artificially create unique bin edges. Here's an example:
a = pd.Series(range(100) + ([0]*20))
def jitter(a_series, noise_reduction=1000000):
return (np.random.random(len(a_series))*a_series.std()/noise_reduction)-(a_series.std()/(2*noise_reduction))
# and now this works by adding a little noise
a_deciles = pd.qcut(a + jitter(a), 10, labels=False)
we can recreate the original error using something like this:
a_deciles = pd.qcut(a, 10, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 173, in qcut
precision=precision, include_lowest=True)
File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 192, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 0. , 0. , 0. , 3.8 ,
11.73333333, 19.66666667, 27.6 , 35.53333333,
43.46666667, 51.4 , 59.33333333, 67.26666667,
75.2 , 83.13333333, 91.06666667, 99. ])
You ask about binning with non-unique bin edges, for which I have a fairly simple answer. In the case of your example, your intent and the behavior of qcut diverge where in the pandas.tools.tile.qcut function where bins are defined:
bins = algos.quantile(x, quantiles)
Which, because your data is 50% 0s, causes bins to be returned with multiple bin edges at the value 0 for any value of quantiles greater than 2. I see two possible resolutions. In the first, the fractile space is divided evenly, binning all 0s, but not only 0s, in the first bin. In the second, the fractile space is divided evenly for values greater than 0, binning all 0s and only 0s in the first bin.
import numpy as np
import pandas as pd
import pandas.core.algorithms as algos
from pandas import Series
In both cases, I'll create some random sample data fitting your description of 50% zeroes and the remaining values between 1 and 100
zs = np.zeros(300)
rs = np.random.randint(1, 100, size=300)
arr=np.concatenate((zs, rs))
ser = Series(arr)
Solution 1: bin 1 contains both 0s and low values
bins = algos.quantile(np.unique(ser), np.linspace(0, 1, 11))
result = pd.tools.tile._bins_to_cuts(ser, bins, include_lowest=True)
The result is
In[61]: result.value_counts()
Out[61]:
[0, 9.3] 323
(27.9, 38.2] 37
(9.3, 18.6] 37
(88.7, 99] 35
(57.8, 68.1] 32
(68.1, 78.4] 31
(78.4, 88.7] 30
(38.2, 48.5] 27
(48.5, 57.8] 26
(18.6, 27.9] 22
dtype: int64
Solution 2: bin1 contains only 0s
mx = np.ma.masked_equal(arr, 0, copy=True)
bins = algos.quantile(arr[~mx.mask], np.linspace(0, 1, 11))
bins = np.insert(bins, 0, 0)
bins[1] = bins[1]-(bins[1]/2)
result = pd.tools.tile._bins_to_cuts(arr, bins, include_lowest=True)
The result is:
In[133]: result.value_counts()
Out[133]:
[0, 0.5] 300
(0.5, 11] 32
(11, 18.8] 28
(18.8, 29.7] 30
(29.7, 39] 35
(39, 50] 26
(50, 59] 31
(59, 71] 31
(71, 79.2] 27
(79.2, 90.2] 30
(90.2, 99] 30
dtype: int64
There is work that could be done to Solution 2 to make it a little prettier I think, but you can see that the masked array is a useful tool to approach your goals.
If you want to enforce equal size bins, even in the presence of duplicate values, you can use the following, 2 step process:
Rank your values, using method='first' to have python assign a unique rank to all your records. If there is a duplicate value (i.e. a tie in the rank), this method will choose the first record it comes to and rank in that order.
df['rank'] = df['value'].rank(method='first')
Use qcut on the rank to determine equal sized quantiles. Below example creates deciles (bins=10).
df['decile'] = pd.qcut(df['rank'].values, 10).codes
I've had a lot of problems with qcut as well, so I used the Series.rank function combined with creating my own bins using those results. My code is on Github:
https://gist.github.com/ashishsingal1/e1828ffd1a449513b8f8
I had this problem as well, so I wrote a small function, which only treats the non zero values and then inserts the labels where the original was not 0.
def qcut2(x, n=10):
x = np.array(x)
x_index_not0 = [i for i in range(len(x)) if x[i] > 0]
x_cut_not0 = pd.qcut(x[x > 0], n-1, labels=False) + 1
y = np.zeros(len(x))
y[x_index_not0] = x_cut_not0
return y

statistics for histogram of periodic data

For a series of angle values in (-pi, pi) range, I make a histogram. Is there an effective way to calculate a mean and modal (post probable) value? Consider following examples:
import numpy as N, cmath
deg = N.pi/180.
d = N.array([-175., 170, 175, 179, -179])*deg
i = N.sum(N.exp(1j*d))
ave = cmath.phase(i)
i /= float(d.size)
stdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
print ave/deg, stdev/deg
Now, let's have a histogram:
counts, bins = N.histogram(data, N.linspace(-N.pi, N.pi, 360))
Is it possible to calculate mean, mode having counts and bins? For non-periodic data, calculation of a mean is straightforward:
ave = sum(counts*bins[:-1])
Calculations of a modal value requires more effort. Actually, I'm not sure my code below is correct: firstly, I identify bins which occur most frequently and then I calculate an arithmetic mean:
cmax = bins[N.argmax(counts)]
mode = N.mean(N.take(bins, N.nonzero(counts == cmax)[0]))
I have no idea, how to calculate standard deviation from such data, though. One obvious solution to all my problems (at least those described above) is to convert histogram data to a data series and then use it in calculations. This is not elegant, however, and inefficient.
Any hints will be very appreciated.
This is the partial solution I wrote.
import numpy as N, cmath
import scipy.stats as ST
d = [-175, 170.2, 175.57, 179, -179, 170.2, 175.57, 170.2]
deg = N.pi/180.
data = N.array(d)*deg
i = N.sum(N.exp(1j*data))
ave = cmath.phase(i) # correct and exact mean for periodic data
wrong_ave = N.mean(d)
i /= float(data.size)
stdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
wrong_stdev = N.std(d)
bins = N.linspace(-N.pi, N.pi, 360)
counts, bins = N.histogram(data, bins, normed=False)
# consider it weighted vector addition
nz = N.nonzero(counts)[0]
weight = counts[nz]
i = N.sum(weight * N.exp(1j*bins[nz])/len(nz))
pave = cmath.phase(i) # correct and approximated mean for periodic data
i /= sum(weight)/float(len(nz))
pstdev = -2. * N.log(N.sqrt(i.real**2 + i.imag**2))
print
print 'scipy: %12.3f (mean) %12.3f (stdev)' % (ST.circmean(data)/deg, \
ST.circstd(data)/deg)
When run, it gives following results:
mean: 175.840 85.843 175.360
stdev: 0.472 151.785 0.430
scipy: 175.840 (mean) 3.673 (stdev)
A few comments now: the first column gives mean/stdev calculated. As can be seen, the mean agrees well with scipy.stats.circmean (thanks JoeKington for pointing it out). Unfortunately stdev differs. I will look at it later. The second column gives completely wrong results (non-periodic mean/std from numpy obviously does not work here). The 3rd column gives sth I wanted to obtain from the histogram data (#JoeKington: my raw data won't fit memory of my computer.., #dmytro: thanks for your input: of course, bin size will influence result but in my application I don't have much choice, i.e. I have to reduce data somehow). As can be seen, the mean (3rd column) is properly calculated, stdev needs further attention :)
Have a look at scipy.stats.circmean and scipy.stats.circstd.
Or do you only have the histogram counts, and not the "raw" data? If so, you could fit a Von Mises distribution to your histogram counts and approximate the mean and stddev in that way.
Here's how to get an approximation.
Since Var(x) = <x^2> - <x>^2, we have:
meanX = N.sum(counts * bins[:-1]) / N.sum(counts)
meanX2 = N.sum(counts * bins[:-1]**2) / N.sum(counts)
std = N.sqrt(meanX2 - meanX**2)

Categories

Resources