What exactly does this random.uniform line in Python do? - python

I'm following a tutorial here from Andrew Cross on using random forests in Python. I got the code to run fine, and for the most part I understand the output. However, I am unsure on exactly what this line does:
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
I know that it "creates a (random) uniform distribution between 0 and 1 and assigns 3/4ths of the data to be in the training subset." However, the training subset is not always exactly 3/4 of the subset. Sometimes it is smaller and sometimes it is larger. So is a random sized subset chosen that is approximately 75%? Why not make it always 75%?

np.random.uniform(0, 1, len(df)) creates an array of len(df) random numbers.
<= .75 then creates another array containing True where the numbers matched that condition, and False in other places.
The code then uses the data in indexes where True was found. Since the random distribution is... well, random, you won't get exactly 75% of the values.

It does not assign 3/4ths of the data to be in the training subset.
It assigns the probability that data is in the training subset to be 3/4:
Example:
>>> import numpy as np
>>> sum(np.random.uniform(0, 1, 10) < .75)
8
>>> sum(np.random.uniform(0, 1, 10) < .75)
10
>>> sum(np.random.uniform(0, 1, 10) < .75)
7
80% of the data is in the training subset in the 1st example
100% -- in the 2nd one
70% -- in the 3rd.
On average, it should be 75%.

If you want to be more strict selecting randomly a training set always very near to 75%, you can use some code like this:
d = np.random.uniform(0, 1, 1000)
p = np.percentile(d, 75)
print(np.sum(d <= p)) # 750
print(np.sum(d <= .75)) # 745
In your example:
d = np.random.uniform(0, 1, len(df))
p = np.percentile(d, 75)
df['is_train'] = d <= p

Related

Finding the mean time to reach a given threshold for random walk in python

I have a code for a random walk of 10000 steps. I then had to repeat the simulation 12 times and save each in a separate text file, which I have done. I have now been given a task, to find on average, how many steps it takes for the random walk to reach x = 10, when it starts from (0,0). This means if you imagine there is a North South line at x = 10, I need to calculate the mean for my 12 walks, for how many steps it takes to reach x = 10 from starting position (0,0). I think it would involve using an if statement but I'm not sure how to use that in my code and how to get my code to tell me how many steps it took to get there for each run, and then calculate the mean for all the runs.
My code for the random walk and saving different runs in separate text files is as follows:
import numpy as np
import matplotlib.pyplot as plt
import random as rd
import math
a = (np.zeros((10000, 2), dtype=np.float))
def randwalk(x,y):
theta= 2*math.pi*rd.random()
x+=math.cos(theta);
y+=math.sin(theta);
return (x,y)
x, y = 0.,0.
for i in range(10000):
x, y = randwalk(x,y)
a[i,:] = x,y
plt.figure()
plt.title("10000 steps 2D Random Walk")
plt.plot(a[:,0],a[:,1], color = 'y')
plt.show()
N = 12
for j in range(N):
rd.seed(j)
x , y = 0., 0.
for i in range(10000):
x, y = randwalk(x,y)
a[i,:] = x, y
filename = 'Walk'+ str(j)
np.savetxt(filename, a)
The short answer:
np.argmax(np.abs(a[:, 0]) > 10)
What this does:
np.abs(a[:, 0]) is the absolute value of the x values
* > 10 is an array that's True whenever the above is greater than 10
np.argmax(*) gives the argument of the maximum value in *. since it's boolean the max is True (1). argmax always gives the first instance of max so it's the first True
Thus this gives the index of the first step above 10, which is equivalent to how long it took to reach |x| > 10.
Now, the reason I used the absolute value is that a walk is not guaranteed to get to 10 (or -10, but both is much less likely), so you'll have some corner cases to cover if you only want 10.

How is scikit-learn's FastICA deconvoluting two signals that are each composed of Gaussian noise?

Not sure if best to ask this here or on StackExchange, but since it's a programming question as well as potentially a math question, here goes.
The question is about FastICA.
Given input time series ("observations" below), where each time series is a linear mix of n_components signals, ICA returns the signals and a mixing matrix. From http://www.cs.jhu.edu/~ayuille/courses/Stat161-261-Spring14/HyvO00-icatut.pdf, section 3, I understand that at most one signal may be Gaussian noise. But below I seem to demonstrate that FastICA recovers two signals even when both are noise (here as a function of length of time series from just a single time step to 10000 time steps, for 16 time series):
# Snippet below adapted from http://scikit-learn.org/stable/auto_examples/decomposition/plot_ica_blind_source_separation.html
import numpy as np
import matplotlib.pyplot as plt
from scipy import signal
from sklearn.decomposition import FastICA, PCA
for i in [1, 2, 3, 4, 5, 10, 20, 100, 1000, 10000]: # number of timepoints
# Generate sample data
np.random.seed(0)
n_samples = i
time = np.linspace(0, 8, n_samples)
#
s1 = np.array([np.random.normal() for q in range(i)])
s2 = np.array([np.random.normal() for q in range(i)])
#
S = np.c_[s1, s2]
S += 0.2 * np.random.normal(size=S.shape) # Add extra noise, just to muddy the signals
#
S /= S.std(axis=0) # Standardize data
# Mix data
A = np.array([[np.random.normal(), np.random.normal()] for j in range(16)]) # Mixing matrix
X = np.dot(S, A.T) # Generate observations
#
# Compute ICA
ica = FastICA(n_components=2)
print i, "\t",
try:
S_ = ica.fit_transform(X) # Reconstruct signals
except ValueError:
print "ValueError: ICA does not run"
continue
A_ = ica.mixing_ # Get estimated mixing matrix
#
# We can `prove` that the ICA model applies by reverting the unmixing.
print np.allclose(X, np.dot(S_, A_.T) + ica.mean_) # X - AS ~ 0
Output:
1 ValueError: ICA does not run
2 False
3 True
4 True
5 True
10 True
20 True
100 True
1000 True
10000 True
Why does this work? I.e., why is X - AS ~ 0 (the allclose() condition above)? Note it still works if the number of datasets we generate is much larger than the 16 I used here (e.g. 1,000 time series still works).

How to properly sample truncated distributions?

I am trying to learn how to sample truncated distributions. To begin with I decided to try a simple example I found here example
I didn't really understand the division by the CDF, therefore I decided to tweak the algorithm a bit. Being sampled is an exponential distribution for values x>0 Here is an example python code:
# Sample exponential distribution for the case x>0
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0. :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
plt.plot(x,pdf(x))
plt.hist([x for x in xvec if x != 0],bins=150,normed=True)
plt.show()
Ant the output is:
The code above seems to work fine only for when using the condition if a > 0. :, i.e. positive x, choosing another condition (e.g. if a > 0.5 :) produces wrong results.
Since my final goal was to sample a 2D-Gaussian - pdf on a truncated interval I tried extending the simple example using the exponential distribution (see the code below). Unfortunately, since the simple case didn't work, I assume that the code given below would yield wrong results.
I assume that all this can be done using the advanced tools of python. However, since my primary idea was to understand the principle behind, I would greatly appreciate your help to understand my mistake.
Thank you for your help.
EDIT:
# code updated according to the answer of CrazyIvan
from scipy.stats import multivariate_normal
RANGE=100000
a=2.06072E-02
b=1.10011E+00
a_range=[0.001,0.5]
b_range=[0.01, 2.5]
cov=[[3.1313994E-05, 1.8013737E-03],[ 1.8013737E-03, 1.0421529E-01]]
x=a
y=b
j=0
for i in range(RANGE):
a_t,b_t=np.random.multivariate_normal([a,b],cov)
# accept if within bounds - all that is neded to truncate
if a_range[0]<a_t and a_t<a_range[1] and b_range[0]<b_t and b_t<b_range[1]:
print(dx,dy)
EDIT:
I changed the code by norming the analytic pdf according to this scheme, and according to the answers given by, #Crazy Ivan and #Leandro Caniglia , for the case where the bottom of the pdf is removed. That is dividing by (1-CDF(0.5)) since my accept condition is x>0.5. This seems again to show some discrepancies. Again the mystery prevails ..
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm
def pdf(x):
return x*np.exp(-x)
# included the corresponding cdf
def cdf(x):
return 1. -np.exp(-x)-x*np.exp(-x)
xvec=np.zeros(1000000)
x=1.
for i in range(1000000):
a=x+np.random.normal()
xs=x
if a > 0.5 :
xs=a
A=pdf(xs)/pdf(x)
if np.random.uniform()<A :
x=xs
xvec[i]=x
x=np.linspace(0,15,1000)
# new part norm the analytic pdf to fix the area
plt.plot(x,pdf(x)/(1.-cdf(0.5)))
plt.hist([x for x in xvec if x != 0],bins=200,normed=True)
plt.savefig("test_exp.png")
plt.show()
It seems that this can be cured by choosing larger shift size
shift=15.
a=x+np.random.normal()*shift.
which is in general an issue of the Metropolis - Hastings. See the graph below:
I also checked shift=150
Bottom line is that changing the shift size definitely improves the convergence. The misery is why, since the Gaussian is unbounded.
You say you want to learn the basic idea of sampling a truncated distribution, but your source is a blog post about
Metropolis–Hastings algorithm? Do you actually need this "method for obtaining a sequence of random samples from a probability distribution for which direct sampling is difficult"? Taking this as your starting point is like learning English by reading Shakespeare.
Truncated normal
For truncated normal, basic rejection sampling is all you need: generate samples for original distribution, reject those outside of bounds. As Leandro Caniglia noted, you should not expect truncated distribution to have the same PDF except on a shorter interval — this is plain impossible because the area under the graph of a PDF is always 1. If you cut off stuff from sides, there has to be more in the middle; the PDF gets rescaled.
It's quite inefficient to gather samples one by one, when you need 100000. I would grab 100000 normal samples at once, accept only those that fit; then repeat until I have enough. Example of sampling truncated normal between amin and amax:
import numpy as np
n_samples = 100000
amin, amax = -1, 2
samples = np.zeros((0,)) # empty for now
while samples.shape[0] < n_samples:
s = np.random.normal(0, 1, size=(n_samples,))
accepted = s[(s >= amin) & (s <= amax)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples] # we probably got more than needed, so discard extra ones
And here is the comparison with the PDF curve, rescaled by division by cdf(amax) - cdf(amin) as explained above.
from scipy.stats import norm
_ = plt.hist(samples, bins=50, density=True)
t = np.linspace(-2, 3, 500)
plt.plot(t, norm.pdf(t)/(norm.cdf(amax) - norm.cdf(amin)), 'r')
plt.show()
Truncated multivariate normal
Now we want to keep the first coordinate between amin and amax, and the second between bmin and bmax. Same story, except there will be a 2-column array and the comparison with bounds is done in a relatively sneaky way:
(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)
This means: subtract amin, bmin from each row and keep only the rows where both results are nonnegative (meaning we had a >= amin and b >= bmin). Also do a similar thing with amax, bmax. Accept only the rows that meet both criteria.
n_samples = 10
amin, amax = -1, 2
bmin, bmax = 0.2, 2.4
mean = [0.3, 0.5]
cov = [[2, 1.1], [1.1, 2]]
samples = np.zeros((0, 2)) # 2 columns now
while samples.shape[0] < n_samples:
s = np.random.multivariate_normal(mean, cov, size=(n_samples,))
accepted = s[(np.min(s - [amin, bmin], axis=1) >= 0) & (np.max(s - [amax, bmax], axis=1) <= 0)]
samples = np.concatenate((samples, accepted), axis=0)
samples = samples[:n_samples, :]
Not going to plot, but here are some values: naturally, within bounds.
array([[ 0.43150033, 1.55775629],
[ 0.62339265, 1.63506963],
[-0.6723598 , 1.58053835],
[-0.53347361, 0.53513105],
[ 1.70524439, 2.08226558],
[ 0.37474842, 0.2512812 ],
[-0.40986396, 0.58783193],
[ 0.65967087, 0.59755193],
[ 0.33383214, 2.37651975],
[ 1.7513789 , 1.24469918]])
To compute the truncated density function pdf_t from the entire density function pdf, do the following:
Let [a, b] be the truncation interval; (x axis)
Let A := cdf(a) and B := cdf(b); (cdf = non-truncated cumulative distribution function)
Then pdf_t(x) := pdf(x) / (B - A) if x in [a, b] and 0 elsewhere.
In cases where a = -infinity (resp. b = +infinity), take A := 0 (resp. B := 1).
As for the "mystery" you see
please note that your blue curve is wrong. It is not the pdf of your truncated distribution, it is just the pdf of the non-truncated one, scaled by the correct amount (division by 1-cdf(0.5)). The actual truncated pdf curve starts with a vertical line on x = 0.5 which goes up until it reaches your current blue curve. In other words, you only scaled the curve but forgot to truncate it, in this case to the left. Such a truncation corresponds to the "0 elsewhere" part of step 3 in the algorithm above.

Normal distributed sub-sampling from a numpy array in python

I have a numpy array whose values are distributed in the following manner
From this array I need to get a random sub-sample which is normally distributed.
I need to get rid of the values from the array which are above the red line in the picture. i.e. I need to get rid of some occurences of certain values from the array so that my distribution gets smoothened when the abrupt peaks are removed.
And my array's distribution should become like this:
Can this be achieved in python, without manually looking for entries corresponding to the peaks and remove some occurences of them ? Can this be done in a simpler way ?
The following kind of works, it is rather aggressive, though:
It works by ordering the samples, transforming to uniform and then trying to select a regular griddish subsample. If you feel it is too aggressive you could increase ns which is essentially the number of samples kept.
Also, please note that it requires the knowledge of the true distribution. In case of normal distribution you should be fine with using sample mean and unbiased variance estimate (the one with n-1).
Code (without plotting):
import scipy.stats as ss
import numpy as np
a = ss.norm.rvs(size=1000)
b = ss.uniform.rvs(size=1000)<0.4
a[b] += 0.1*np.sin(10*a[b])
def smooth(a, gran=25):
o = np.argsort(a)
s = ss.norm.cdf(a[o])
ns = int(gran / np.max(s[gran:] - s[:-gran]))
grid, dp = np.linspace(0, 1, ns, endpoint=False, retstep=True)
grid += dp/2
idx = np.searchsorted(s, grid)
c = np.flatnonzero(idx[1:] <= idx[:-1])
while c.size > 0:
idx[c+1] = idx[c] + 1
c = np.flatnonzero(idx[1:] <= idx[:-1])
idx = idx[:np.searchsorted(idx, len(a))]
return o[idx]
ap = a[smooth(a)]
c, b = np.histogram(a, 40)
cp, _ = np.histogram(ap, b)

How to qcut with non unique bin edges?

My question is the same as this previous one:
Binning with zero values in pandas
however, I still want to include the 0 values in a fractile. Is there a way to do this? In other words, if I have 600 values, 50% of which are 0, and the rest are let's say between 1 and 100, how would I categorize all the 0 values in fractile 1, and then the rest of the non-zero values in fractile labels 2 to 10 (assuming I want 10 fractiles). Could I convert the 0's to nan, qcut the remaining non nan data into 9 fractiles (1 to 9), then add 1 to each label (now 2 to 10) and label all the 0 values as fractile 1 manually? Even this is tricky, because in my data set in addition to the 600 values, I also have another couple hundred which may already be nan before I would convert the 0s to nan.
Update 1/26/14:
I came up with the following interim solution. The problem with this code though, is if the high frequency value is not on the edges of the distribution, then it inserts an extra bin in the middle of the existing set of bins and throws everything a little (or a lot) off.
def fractile_cut(ser, num_fractiles):
num_valid = ser.valid().shape[0]
remain_fractiles = num_fractiles
vcounts = ser.value_counts()
high_freq = []
i = 0
while vcounts.iloc[i] > num_valid/ float(remain_fractiles):
curr_val = vcounts.index[i]
high_freq.append(curr_val)
remain_fractiles -= 1
num_valid = num_valid - vcounts[i]
i += 1
curr_ser = ser.copy()
curr_ser = curr_ser[~curr_ser.isin(high_freq)]
qcut = pd.qcut(curr_ser, remain_fractiles, retbins=True)
qcut_bins = qcut[1]
all_bins = list(qcut_bins)
for val in high_freq:
bisect.insort(all_bins, val)
cut = pd.cut(ser, bins=all_bins)
ser_fractiles = pd.Series(cut.labels + 1, index=ser.index)
return ser_fractiles
The problem is that pandas.qcut chooses the bins/quantiles so that each one has the same number of records, but all records with the same value must stay in the same bin/quantile (this behaviour is in accordance with the statistical definition of quantile).
The solutions are:
1 - Use pandas >= 0.20.0 that has this fix. They added an option duplicates='raise'|'drop' to control whether to raise on duplicated edges or to drop them, which would result in less bins than specified, and some larger (with more elements) than others.
2 - Decrease the number of quantiles. Less quantiles means more elements per quantile
3 - Rank your data with DataFrame.rank(method='first'). The ranking assigns a unique value to each element in the dataframe (the rank) while keeping the order of the elements (except for identical values, which will be ranked in order they appear in the array, see method='first')
Example:
pd.qcut(df, nbins) <-- this generates "ValueError: Bin edges must be unique"
Then use this instead:
pd.qcut(df.rank(method='first'), nbins)
4 - Specify a custom quantiles range, e.g. [0, .50, .75, 1.] to get unequal number of items per quantile
5 - Use pandas.cut that chooses the bins to be evenly spaced according to the values themselves, while pandas.qcut chooses the bins so that you have the same number of records in each bin
Another way to do this is to introduce a minimal amount of noise, which will artificially create unique bin edges. Here's an example:
a = pd.Series(range(100) + ([0]*20))
def jitter(a_series, noise_reduction=1000000):
return (np.random.random(len(a_series))*a_series.std()/noise_reduction)-(a_series.std()/(2*noise_reduction))
# and now this works by adding a little noise
a_deciles = pd.qcut(a + jitter(a), 10, labels=False)
we can recreate the original error using something like this:
a_deciles = pd.qcut(a, 10, labels=False)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 173, in qcut
precision=precision, include_lowest=True)
File "/usr/local/lib/python2.7/site-packages/pandas/tools/tile.py", line 192, in _bins_to_cuts
raise ValueError('Bin edges must be unique: %s' % repr(bins))
ValueError: Bin edges must be unique: array([ 0. , 0. , 0. , 3.8 ,
11.73333333, 19.66666667, 27.6 , 35.53333333,
43.46666667, 51.4 , 59.33333333, 67.26666667,
75.2 , 83.13333333, 91.06666667, 99. ])
You ask about binning with non-unique bin edges, for which I have a fairly simple answer. In the case of your example, your intent and the behavior of qcut diverge where in the pandas.tools.tile.qcut function where bins are defined:
bins = algos.quantile(x, quantiles)
Which, because your data is 50% 0s, causes bins to be returned with multiple bin edges at the value 0 for any value of quantiles greater than 2. I see two possible resolutions. In the first, the fractile space is divided evenly, binning all 0s, but not only 0s, in the first bin. In the second, the fractile space is divided evenly for values greater than 0, binning all 0s and only 0s in the first bin.
import numpy as np
import pandas as pd
import pandas.core.algorithms as algos
from pandas import Series
In both cases, I'll create some random sample data fitting your description of 50% zeroes and the remaining values between 1 and 100
zs = np.zeros(300)
rs = np.random.randint(1, 100, size=300)
arr=np.concatenate((zs, rs))
ser = Series(arr)
Solution 1: bin 1 contains both 0s and low values
bins = algos.quantile(np.unique(ser), np.linspace(0, 1, 11))
result = pd.tools.tile._bins_to_cuts(ser, bins, include_lowest=True)
The result is
In[61]: result.value_counts()
Out[61]:
[0, 9.3] 323
(27.9, 38.2] 37
(9.3, 18.6] 37
(88.7, 99] 35
(57.8, 68.1] 32
(68.1, 78.4] 31
(78.4, 88.7] 30
(38.2, 48.5] 27
(48.5, 57.8] 26
(18.6, 27.9] 22
dtype: int64
Solution 2: bin1 contains only 0s
mx = np.ma.masked_equal(arr, 0, copy=True)
bins = algos.quantile(arr[~mx.mask], np.linspace(0, 1, 11))
bins = np.insert(bins, 0, 0)
bins[1] = bins[1]-(bins[1]/2)
result = pd.tools.tile._bins_to_cuts(arr, bins, include_lowest=True)
The result is:
In[133]: result.value_counts()
Out[133]:
[0, 0.5] 300
(0.5, 11] 32
(11, 18.8] 28
(18.8, 29.7] 30
(29.7, 39] 35
(39, 50] 26
(50, 59] 31
(59, 71] 31
(71, 79.2] 27
(79.2, 90.2] 30
(90.2, 99] 30
dtype: int64
There is work that could be done to Solution 2 to make it a little prettier I think, but you can see that the masked array is a useful tool to approach your goals.
If you want to enforce equal size bins, even in the presence of duplicate values, you can use the following, 2 step process:
Rank your values, using method='first' to have python assign a unique rank to all your records. If there is a duplicate value (i.e. a tie in the rank), this method will choose the first record it comes to and rank in that order.
df['rank'] = df['value'].rank(method='first')
Use qcut on the rank to determine equal sized quantiles. Below example creates deciles (bins=10).
df['decile'] = pd.qcut(df['rank'].values, 10).codes
I've had a lot of problems with qcut as well, so I used the Series.rank function combined with creating my own bins using those results. My code is on Github:
https://gist.github.com/ashishsingal1/e1828ffd1a449513b8f8
I had this problem as well, so I wrote a small function, which only treats the non zero values and then inserts the labels where the original was not 0.
def qcut2(x, n=10):
x = np.array(x)
x_index_not0 = [i for i in range(len(x)) if x[i] > 0]
x_cut_not0 = pd.qcut(x[x > 0], n-1, labels=False) + 1
y = np.zeros(len(x))
y[x_index_not0] = x_cut_not0
return y

Categories

Resources