Using data from pythons pandas dataframes to sample from normal distributions - python

I'm trying to sample from a normal distribution using means and standard deviations that are stored in pandas DataFrames.
For example:
means= numpy.arange(10)
means=means.reshape(5,2)
produces:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and:
sts=numpy.arange(10,20)
sts=sts.reshape(5,2)
produces:
0 1
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
How would I produce another pandas dataframe with the same shape but with values sampled from the normal distribution using the corresponding means and standard deviations.
i.e. position 0,0 of this new dataframe would sample from a normal distribution with mean=0 and standard deviation=10, and so on.
My function so far:
def make_distributions(self):
num_data_points,num_species= self.means.shape
samples=[]
for i,j in zip(self.means,self.stds):
for k,l in zip(self.means[i],self.stds[j]):
samples.append( numpy.random.normal(k,l,self.n) )
will sample from the distributions for me but I'm having difficulty putting the data back into the same shaped dataframe as the mean and standard deviation dfs. Does anybody have any suggestions as to how to do this?
Thanks in advance.

You can use numpy.random.normal to sample from a random normal distribution.
IIUC, then this might be easiest, taking advantage of broadcasting:
import numpy as np
np.random.seed(1) # only for demonstration
np.random.normal(means,sts)
array([[ 16.24345364, -5.72932055],
[ -4.33806103, -10.94859209],
[ 16.11570681, -29.52308045],
[ 33.91698823, -5.94051732],
[ 13.74270373, 4.26196287]])
Check that it works:
np.random.seed(1)
print np.random.normal(0,10)
print np.random.normal(1,11)
16.2434536366
-5.72932055015
If you need a pandas DataFrame:
import pandas as pd
pd.DataFrame(np.random.normal(means,sts))

I will use dictionary to construct this dataframe. Suppose indices and columns are the same for means and stds:
means= numpy.arange(10)
means=pd.DataFrame(means.reshape(5,2))
stds=numpy.arange(10,20)
stds=pd.DataFrame(sts.reshape(5,2))
samples={}
for i in means.columns:
col={}
for j in means.index:
col[j]=numpy.random.normal(means.ix[j,i],stds.ix[j,i],2)
samples[i]=col
print(pd.DataFrame(samples))
# 0 1
#0 [0.0760974520154, 3.29439282825] [11.1292510583, 0.318246201796]
#1 [-25.4518020981, 19.2176263823] [17.0826945017, 9.36179435872]
#2 [14.5402484325, 8.33808246538] [6.96459947914, 26.5552235093]
#3 [0.775891790613, -2.09168601369] [2.38723023677, 15.8099942902]
#4 [-0.828518484847, 45.4592922652] [26.8088977308, 16.0818556353]
Or reset the dtype of a DataFrame and reassign values:
import itertools
samples = means * 0
samples = samples.astype(object)
for i,j in itertools.product(means.index, means.columns):
samples.set_value(i,j,numpy.random.normal(means.ix[i,j],stds.ix[i,j],2))

Related

How to efficiently update pandas row if computation involving lookup another array value

The objective is to update the df rows, by considering element in the df and and reference value from external np array.
Currently, I had to use a for loop to update each row, as below.
However, I wonder whether this can be takcle using any pandas built-in module.
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
for n in range (len(df)):
a=df.loc[n]
drange=list(range(a['start'],a['end']+1))
darr=arr[0,drange]
r=np.where(darr==np.amax(darr))[0].item()
df.loc[n,'pos_peak']=drange[r]
Expected output
start end o pos_peak
0 1 4 a 3.0
1 7 11 g 8.0
2 13 17 t 16.0
My approach would be to use pandas apply() function with which you can apply a function to each row of your dataframe. In order to find the index of the maximum element, I used the numpy function argmax() onto the relevant part of arr. Here is the code:
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
df['pos_peak'] = df.apply(lambda x: x['start'] + np.argmax(arr[0][x['start']:x['end']+1]), axis=1)
df
Output:
start end o pos_peak
0 1 4 a 3
1 7 11 g 8
2 13 17 t 16

How to quickly sum across columns for every permutation of rows in Python

Suppose I have a n x k matrix X. And I want to get the sum across the columns, but for every permutation of the rows. So if my matrix is [[1,2],[3,4]] my desired output would be [1+2, 1+4, 3+2, 3+4]. I produce a MWE example with my first attempt at a solution. I'm hoping I can get some help to reduce the computation time.
My actual problem has n=160 and k=4, and it takes quite a while to run (as of writing this, it's still running).
import pandas as pd
import numpy as np
import itertools
n = 4
k = 3
X = np.random.randint(0, 10, (n, k))
df = pd.DataFrame(X)
df
0 1 2
0 2 9 2
1 7 6 4
2 3 7 0
3 5 0 0
ixi = df.index.tolist()
ixc = df.columns.tolist()
psum = np.array([df.lookup(i, ixc).sum() for i in
itertools.product(ixi, repeat=len(ixc))])
You can try functools.reduce:
from functools import reduce
reduce(np.add.outer, df.values.T).ravel()

How to filter a pandas DataFrame and keep specific elements?

I have a pandas Data Frame which is a 50x50 correlation matrix. In the following picture you can see what I have as an example
What I would like to do, if it's possible of course, is to make a new data frame which has only the elements of the old one that are higher than 0.5 or lower than -0.5, indicating a strong linear relationship, but not 1, to avoid the variance parts.
I dont think what I ask is exactly possible because of course variable x0 wont have the same strong relationships that x1 have etc, so the new data frame wont be looking very good.
But is there any way to scan fast through this data frame, find the values I mentioned and maybe at least insert them into an array?
Any insight would be helpful. Thanks
you can't really look at a correlation matrix if you want to drop correlation pairs that are too low. One thing you could do is stack the frame and keep the relevant correlation pair.
having (randomly generated as an example):
0 1 2 3 4
0 0.038142 -0.881054 -0.718265 -0.037968 -0.587288
1 0.587694 -0.135326 -0.529463 -0.508112 -0.160751
2 -0.528640 -0.434885 -0.679416 -0.455866 0.077580
3 0.158409 0.827085 0.018871 -0.478428 0.129545
4 0.825489 -0.000416 0.682744 0.794137 0.694887
you could do:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.uniform(-1, 1, (5, 5)))
df = df.stack()
df = df[((df > 0.5) | (df < -0.5)) & (df != 1)]
0 1 -0.881054
2 -0.718265
4 -0.587288
1 0 0.587694
2 -0.529463
3 -0.508112
2 0 -0.528640
2 -0.679416
3 1 0.827085
4 0 0.825489
2 0.682744
3 0.794137
4 0.694887

Finding the indexes of the N maximum values across an axis in Pandas

I know that there is a method .argmax() that returns the indexes of the maximum values across an axis.
But what if we want to get the indexes of the 10 highest values across an axis?
How could this be accomplished?
E.g.:
data = pd.DataFrame(np.random.random_sample((50, 40)))
You can use argsort:
s = pd.Series(np.random.permutation(30))
sorted_indices = s.argsort()
top_10 = sorted_indices[sorted_indices < 10]
print(top_10)
Output:
3 9
4 1
6 0
8 7
13 4
14 2
15 3
19 8
20 5
24 6
dtype: int64
IIUC, say, if you want to get the index of the top 10 largest numbers of column col:
data[col].nlargest(10).index
Give this a try. This will take the 10 largest values across a row and put them into a dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.random_sample((50, 40)))
df2 = pd.DataFrame(np.sort(df.values)[:,-10:])

Performing math on a Python Pandas Group By DataFrame

I have a Pandas DataFrame with the following structure:
In [1]: df
Out[1]:
location_code month amount
0 1 1 10
1 1 2 11
2 1 3 12
3 1 4 13
4 1 5 14
5 1 6 15
6 2 1 23
7 2 2 25
8 2 3 27
9 2 4 29
10 2 5 31
11 2 6 33
I also have a DataFrame with the following:
In [2]: output_df
Out[2]:
location_code regression_coef
0 1 None
1 2 None
What I would like:
output_df = df.groupby('location_code')[amount].apply(linear_regression_and_return_coefficient)
I would like to group by the location code and then perform a linear regression on the values of amount and store the coefficient. I have tried the following code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
gb = df.groupby('location_code')['amount']
x = []
for j in range(6): x.append(j+1)
for location_code, amount in gb:
trans = amount.tolist()
x = sm.add_constant(x)
model = sm.OLS(trans, x)
results = model.fit()
output_df['regression_coef'][merchant_location_code] = results.params[1]/np.mean(trans)
This code works, but my data set is somewhat large (about 5 gb) and a bit more complex, and this is taking a REALLY LONG TIME. I am wondering if there is a vectorized operation that can do this more efficiently? I know that using loops on a Pandas DataFrame is bad.
SOLUTION
After some tinkering around, I wrote a function that can be used with the apply method on a groupby.
def get_lin_reg_coef(series):
x=sm.add_constant(range(1,7))
result = sm.OLS(series, x).fit().params[1]
return result/series.mean()
gb = df.groupby('location_code')['amount']
output_df['lin_reg_coef'] = gb.apply(get_lin_reg_coef)
Benchmarking this versus the iterative solution I had before, with varying input sizes gets:
DataFrame Rows Iterative Solution (sec) Vectorized Solution (sec)
370,000 81.42 84.46
1,850,000 448.36 365.66
3,700,000 1282.83 715.89
7,400,000 5034.62 1407.88
Clearly a lot faster as the dataset grows in size!
Without knowing more about the data, number of records, etc, this code should run faster:
import pandas as pd
import statsmodels.api as sm
import numpy as np
gb = df.groupby('location_code')['amount']
x = sm.add_constant(range(1,7))
def fit(stuff):
return sm.OLS(stuff["amount"], x).fit().params[1] / stuff["amount"].mean()
output = gb.apply(fit)

Categories

Resources