Performing math on a Python Pandas Group By DataFrame - python

I have a Pandas DataFrame with the following structure:
In [1]: df
Out[1]:
location_code month amount
0 1 1 10
1 1 2 11
2 1 3 12
3 1 4 13
4 1 5 14
5 1 6 15
6 2 1 23
7 2 2 25
8 2 3 27
9 2 4 29
10 2 5 31
11 2 6 33
I also have a DataFrame with the following:
In [2]: output_df
Out[2]:
location_code regression_coef
0 1 None
1 2 None
What I would like:
output_df = df.groupby('location_code')[amount].apply(linear_regression_and_return_coefficient)
I would like to group by the location code and then perform a linear regression on the values of amount and store the coefficient. I have tried the following code:
import pandas as pd
import statsmodels.api as sm
import numpy as np
gb = df.groupby('location_code')['amount']
x = []
for j in range(6): x.append(j+1)
for location_code, amount in gb:
trans = amount.tolist()
x = sm.add_constant(x)
model = sm.OLS(trans, x)
results = model.fit()
output_df['regression_coef'][merchant_location_code] = results.params[1]/np.mean(trans)
This code works, but my data set is somewhat large (about 5 gb) and a bit more complex, and this is taking a REALLY LONG TIME. I am wondering if there is a vectorized operation that can do this more efficiently? I know that using loops on a Pandas DataFrame is bad.
SOLUTION
After some tinkering around, I wrote a function that can be used with the apply method on a groupby.
def get_lin_reg_coef(series):
x=sm.add_constant(range(1,7))
result = sm.OLS(series, x).fit().params[1]
return result/series.mean()
gb = df.groupby('location_code')['amount']
output_df['lin_reg_coef'] = gb.apply(get_lin_reg_coef)
Benchmarking this versus the iterative solution I had before, with varying input sizes gets:
DataFrame Rows Iterative Solution (sec) Vectorized Solution (sec)
370,000 81.42 84.46
1,850,000 448.36 365.66
3,700,000 1282.83 715.89
7,400,000 5034.62 1407.88
Clearly a lot faster as the dataset grows in size!

Without knowing more about the data, number of records, etc, this code should run faster:
import pandas as pd
import statsmodels.api as sm
import numpy as np
gb = df.groupby('location_code')['amount']
x = sm.add_constant(range(1,7))
def fit(stuff):
return sm.OLS(stuff["amount"], x).fit().params[1] / stuff["amount"].mean()
output = gb.apply(fit)

Related

How to efficiently update pandas row if computation involving lookup another array value

The objective is to update the df rows, by considering element in the df and and reference value from external np array.
Currently, I had to use a for loop to update each row, as below.
However, I wonder whether this can be takcle using any pandas built-in module.
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
for n in range (len(df)):
a=df.loc[n]
drange=list(range(a['start'],a['end']+1))
darr=arr[0,drange]
r=np.where(darr==np.amax(darr))[0].item()
df.loc[n,'pos_peak']=drange[r]
Expected output
start end o pos_peak
0 1 4 a 3.0
1 7 11 g 8.0
2 13 17 t 16.0
My approach would be to use pandas apply() function with which you can apply a function to each row of your dataframe. In order to find the index of the maximum element, I used the numpy function argmax() onto the relevant part of arr. Here is the code:
import pandas as pd
import numpy as np
arr=np.array([1,2,5,100,3,6,8,3,99,12,5,6,8,11,14,11,100,1,3])
arr=arr.reshape((1,-1))
df=pd.DataFrame(zip([1,7,13],[4,11,17],['a','g','t']),columns=['start','end','o'])
df['pos_peak'] = df.apply(lambda x: x['start'] + np.argmax(arr[0][x['start']:x['end']+1]), axis=1)
df
Output:
start end o pos_peak
0 1 4 a 3
1 7 11 g 8
2 13 17 t 16

How to quickly sum across columns for every permutation of rows in Python

Suppose I have a n x k matrix X. And I want to get the sum across the columns, but for every permutation of the rows. So if my matrix is [[1,2],[3,4]] my desired output would be [1+2, 1+4, 3+2, 3+4]. I produce a MWE example with my first attempt at a solution. I'm hoping I can get some help to reduce the computation time.
My actual problem has n=160 and k=4, and it takes quite a while to run (as of writing this, it's still running).
import pandas as pd
import numpy as np
import itertools
n = 4
k = 3
X = np.random.randint(0, 10, (n, k))
df = pd.DataFrame(X)
df
0 1 2
0 2 9 2
1 7 6 4
2 3 7 0
3 5 0 0
ixi = df.index.tolist()
ixc = df.columns.tolist()
psum = np.array([df.lookup(i, ixc).sum() for i in
itertools.product(ixi, repeat=len(ixc))])
You can try functools.reduce:
from functools import reduce
reduce(np.add.outer, df.values.T).ravel()

How to improve computational efficiency of correlation analysis using scipy on big dataframe

I have a couple of big data frames (12222 X 400000), on which I need to compute the Pearson correlation using scipy. The big data frame is generated by concatenating two other data frames. For example, here I can show can toy datasets which resembles my big data frame.
import pandas as pd,numpy as np
from scipy.stats import pearsonr
np.random.seed([3,1415])
cols = ['A_p','B_p','C_p','D_p','E_p','F_p','N_p','M_p','O_p','Q_p']
ind2 = ['sap1','luf','tur']
df1 = pd.DataFrame(np.random.randint(10, size=(3, 10)), columns=cols,index=ind2)
cols = ['G_l','I_l','J_l','K_l','L_l','M_l','R_l']
df2 = pd.DataFrame(np.random.randint(20, size=(3, 7)), columns=cols,index=ind2)
df = pd.concat([df1,df2],axis =1)
df
A_p B_p C_p D_p E_p F_p N_p M_p O_p Q_p G_l I_l J_l K_l L_l M_l R_l
sap1 0 2 7 3 8 7 0 6 8 6 11 7 12 6 4 7 18
luf 0 2 0 4 9 7 3 2 4 3 14 6 6 14 18 8 10
tur 3 6 7 7 4 5 3 7 5 9 5 15 4 15 13 7 6
The function to perform the correlation test is the following,
def correlation_analysis(lncRNA_PC_T):
"""
Function for correlation analysis
"""
correlations = pd.DataFrame()
for p in [column for column in df.columns if '_p' in column]:
for l in [column for column in df.columns if '_l' in column]:
correlations = correlations.append(pd.Series(pearsonr(df[p],df[l]),index=['PCC', 'p-value'],name=p + '_' + l)
return(correlations)
The function performs well on small data frames. However, on the big data frame of the above-mentioned size, it is taking forever to finish computing the correlation.
What I have tried to make small chunks of the data frame (df1 and df2) and try running the list of the data frame. For example,like following,
n = 2
list_df1 = [df1.iloc[:, i:i+n] for i in range(0, df1.shape[1], n)]
list_df2 = [df2.iloc[:, i:i+n] for i in range(0, df2.shape[1], n)]
Then, I passed the above lists into my function as following,
def correlation(list_df1,list_df2):
correlations = pd.DataFrame()
for dfs,dfs2 in zip(list_df1,list_df2):
DF = pd.concat([dfs,dfs2])
df = DF.set_index("Gene_id").T
for p in [column for column in df.columns if '_p' in column]:
for l in [column for column in df.columns if '_l' in column]:
correlations = correlations.append(pd.Series(pearsonr(df[p],df[l]),index=['PCC', 'p-value'],name= p + '_' + l))
return correlations
The output for the datafrme should give pairwise correlations between all the columns with string _p and _l. For example, like this,
PCC p-value
A_p_G_l -0.944911 0.212296
A_p_I_l 0.994850 0.064639
A_p_J_l -0.693375 0.512246
A_p_K_l 0.585206 0.602027
A_p_L_l 0.162758 0.895922
... ... ...
Q_p_J_l -0.240192 0.845579
Q_p_K_l 0.101361 0.935361
Q_p_L_l -0.352381 0.770744
Q_p_M_l -0.866025 0.333333
Q_p_R_l -0.327327 0.787704
The current function is taking more than 12 hours to finish a datframe of size 1222 X 6222. Hence, an efficent solution would be really helpful. Any suggestions are welcome.
Thank you!
You may have a look at dask, it is designed for the analysis of big data sets, supports pandas and has comes with multi threading support. I tested it on an extended version of you example (21x18) it showed a small reduction in computing time.
correlation_analysis w/o dask 0.767 sec
correlation_analysis w dask 0.707 sec
dask also provides it's own routine for correlation calculation https://docs.dask.org/en/latest/array-api.html?#dask.array.corrcoef .
If the two variables you want to correlate (_p, _l) are in two dataframes, do you need to concatenate them? Looping over both dataframes instead of one merged reduced the computing time as well.
def correlation_analysis2(lncRNA_PC_T):
"""
Function for correlation analysis
"""
correlations = pd.DataFrame()
for p in df1:
for l in df2:
correlations = correlations.append(pd.Series(pearsonr(df1[p],df2[l]),index=['PCC', 'p-value'],name=p + '_' + l))
correlation_analysis with 2 df 0.723 sec
You may look at https://numpy.org/doc/stable/reference/arrays.nditer.html how you can optimize your loops further.
Further possibilities might be to use of some JIT compiler (i.e. pypy or numba), but the effect might only be visible in larger test samples.

Using data from pythons pandas dataframes to sample from normal distributions

I'm trying to sample from a normal distribution using means and standard deviations that are stored in pandas DataFrames.
For example:
means= numpy.arange(10)
means=means.reshape(5,2)
produces:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and:
sts=numpy.arange(10,20)
sts=sts.reshape(5,2)
produces:
0 1
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
How would I produce another pandas dataframe with the same shape but with values sampled from the normal distribution using the corresponding means and standard deviations.
i.e. position 0,0 of this new dataframe would sample from a normal distribution with mean=0 and standard deviation=10, and so on.
My function so far:
def make_distributions(self):
num_data_points,num_species= self.means.shape
samples=[]
for i,j in zip(self.means,self.stds):
for k,l in zip(self.means[i],self.stds[j]):
samples.append( numpy.random.normal(k,l,self.n) )
will sample from the distributions for me but I'm having difficulty putting the data back into the same shaped dataframe as the mean and standard deviation dfs. Does anybody have any suggestions as to how to do this?
Thanks in advance.
You can use numpy.random.normal to sample from a random normal distribution.
IIUC, then this might be easiest, taking advantage of broadcasting:
import numpy as np
np.random.seed(1) # only for demonstration
np.random.normal(means,sts)
array([[ 16.24345364, -5.72932055],
[ -4.33806103, -10.94859209],
[ 16.11570681, -29.52308045],
[ 33.91698823, -5.94051732],
[ 13.74270373, 4.26196287]])
Check that it works:
np.random.seed(1)
print np.random.normal(0,10)
print np.random.normal(1,11)
16.2434536366
-5.72932055015
If you need a pandas DataFrame:
import pandas as pd
pd.DataFrame(np.random.normal(means,sts))
I will use dictionary to construct this dataframe. Suppose indices and columns are the same for means and stds:
means= numpy.arange(10)
means=pd.DataFrame(means.reshape(5,2))
stds=numpy.arange(10,20)
stds=pd.DataFrame(sts.reshape(5,2))
samples={}
for i in means.columns:
col={}
for j in means.index:
col[j]=numpy.random.normal(means.ix[j,i],stds.ix[j,i],2)
samples[i]=col
print(pd.DataFrame(samples))
# 0 1
#0 [0.0760974520154, 3.29439282825] [11.1292510583, 0.318246201796]
#1 [-25.4518020981, 19.2176263823] [17.0826945017, 9.36179435872]
#2 [14.5402484325, 8.33808246538] [6.96459947914, 26.5552235093]
#3 [0.775891790613, -2.09168601369] [2.38723023677, 15.8099942902]
#4 [-0.828518484847, 45.4592922652] [26.8088977308, 16.0818556353]
Or reset the dtype of a DataFrame and reassign values:
import itertools
samples = means * 0
samples = samples.astype(object)
for i,j in itertools.product(means.index, means.columns):
samples.set_value(i,j,numpy.random.normal(means.ix[i,j],stds.ix[i,j],2))

How do I convert a row from a pandas DataFrame from a Series back to a DataFrame?

I am iterating through the rows of a pandas DataFrame, expanding each one out into N rows with additional info on each one (for simplicity I've made it a random number here):
from pandas import DataFrame
import pandas as pd
from numpy import random, arange
N=3
x = DataFrame.from_dict({'farm' : ['A','B','A','B'],
'fruit':['apple','apple','pear','pear']})
out = DataFrame()
for i,row in x.iterrows():
rows = pd.concat([row]*N).reset_index(drop=True) # requires row to be a DataFrame
out = out.append(rows.join(DataFrame({'iter': arange(N), 'value': random.uniform(size=N)})))
In this loop, row is a Series object, so the call to pd.concat doesn't work. How do I convert it to a DataFrame? (Eg. the difference between x.ix[0:0] and x.ix[0])
Thanks!
Given what you commented, I would try
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
results = x.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
This should give you a separate result dataframe. I have assumed that every farm-fruit combination is unique... there might be other ways, if we'd know more about your data.
Update
Running code example
def giveMeSomeRows(group):
return random.uniform(low=group.low, high=group.high, size=N)
N = 3
df = pd.DataFrame(arange(0,8).reshape(4,2), columns=['low', 'high'])
df['farm'] = 'a'
df['fruit'] = arange(0,4)
results = df.groupby(['farm', 'fruit']).apply(giveMeSomeRows)
df
low high farm fruit
0 0 1 a 0
1 2 3 a 1
2 4 5 a 2
3 6 7 a 3
results
farm fruit
a 0 [0.176124290969, 0.459726835079, 0.999564934689]
1 [2.42920143009, 2.37484506501, 2.41474002256]
2 [4.78918572452, 4.25916442343, 4.77440617104]
3 [6.53831891152, 6.23242754976, 6.75141668088]
If instead you want a dataframe, you can update the function to
def giveMeSomeRows(group):
return pandas.DataFrame(random.uniform(low=group.low, high=group.high, size=N))
results
0
farm fruit
a 0 0 0.281088
1 0.020348
2 0.986269
1 0 2.642676
1 2.194996
2 2.650600
2 0 4.545718
1 4.486054
2 4.027336
3 0 6.550892
1 6.363941
2 6.702316

Categories

Resources