Create a function to perform a T-Test in python - python

I'm using the following code to compute t-test in python
import researchpy as rp
import scipy.stats as stats
summary, results = rp.ttest(group1= dfEnt['HA'][df['Q8_5'] == 0], group1_name= "Nascent",
group2= dfEnt['HA'][df['Q8_5'] == 2], group2_name= "Established")
How could I create a function that will provide as argument the name of the dataframe with the column on which I want to compute the t-test. I would like for example to rune t-test with dfEnt['IA] or dfSel['IA],...
Thanks for your help

What version of researchpy are you using? In the newest version that can be completed using -difference_test()-. The code below will conduct a ttest on the column -Exercise- which contains the group categories using the values from -StessReactivity-.
import pandas as pd
import researchpy as rp
import numpy as np
np.random.seed(12345678)
df = pd.DataFrame(np.random.randint(10, size= (100, 2)),
columns= ['No', 'Yes'])
df["id"] = range(1, df.shape[0] + 1)
df2 = pandas.melt(df, id_vars = "id", value_vars = ["No", "Yes"],
var_name = "Exercise", value_name = "StressReactivity")
rp.difference_test("StressReactivity ~ C(Exercise)",
data = df2,
equal_variances = True,
independent_samples = True)

Related

Difference of two Dataframes is not exact

I am trying to get random values of dataframe DF1 and them storing them in a new variable DF2. I want to take difference to the remaining values will be not in origional dataframe DF1. I need to do this task without using sklearn library.
I tried two ways to get random values and they are following:
Method 1:
DF2 = DF1.sample(n = 1000, random_state = 10)
Method 2:
chosen_idx = np.random.choice(2000, replace = False, size = 1000)
DF2 = DF1.iloc[chosen_idx]
Following is how I take their difference to get dataframe with remaining values, say DF3:
DF3 = pd.concat([DF1, DF2]).drop_duplicates(keep=False)
The problem is, the difference of len(DF1), len(DF2) and len(DF3) should be 0. But it is not. I am not sure where I am wrong. Following is my actual code with different variables:
def train_validation_test(set_dataframe):
if isinstance(set_dataframe, pd.DataFrame):
df_length = len(set_dataframe.index)
seventy = math.floor(df_length*0.7)
seventy = seventy if seventy%2==0 else seventy+1
remaining = int((df_length - seventy)/2)
# one = set_dataframe.sample(n = seventy, random_state = 10)
chosen_idx = np.random.choice(df_length, replace = False, size = seventy)
one = set_dataframe.iloc[chosen_idx]
return one
else:
return print('Argument passed is not dataframe. Please pass dataframe as argument.')
abc = train_validation_test(task01_df)
xyz = pd.concat([task01_df, abc]).drop_duplicates(keep=False)
print(len(task01_df) - len(abc) - len(xyz))
The result is 7 but it is depending on random_state. It is never 0 and having varying value.
You can use train_test_split from sklearn:
# Python env: pip install sklearn
# Conda env: conda install sklearn
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
DF1 = pd.DataFrame(np.random.randint(1, 100, (2000, 3)), columns=list('ABC'))
DF2, DF3 = train_test_split(DF1, train_size=1000)
Output:
>>> DF1.shape
(2000, 3)
>>> DF2.shape
(1000, 3)
>>> DF3.shape
(1000, 3)
>>> DF2.index.intersection(DF3.index)
Int64Index([], dtype='int64') # no overlaps

VIF function returns all 'inf' values

I'm handling with multicollinearity problem with variance_inflation_factor() function.
But after running the function, I found that the function returned all the scores as infinite values.
Here's my code:
from rdkit import Chem
import pandas as pd
import numpy as np
from numpy import array
data = pd.read_csv('Descriptors_raw.csv')
class_ = pd.read_csv('class_file.csv')
class_tot = pd.read_csv('class_total.csv')
mols_A1 = Chem.SDMolSupplier('finaldata_A1.sdf')
mols_A2 = Chem.SDMolSupplier('finaldata_A2.sdf')
mols_B = Chem.SDMolSupplier('finaldata_B.sdf')
mols_C = Chem.SDMolSupplier('finaldata_C.sdf')
mols = []
mols.extend(mols_A1)
mols.extend(mols_A2)
mols.extend(mols_B)
mols.extend(mols_C)
mols_df = pd.DataFrame(mols)
mols = pd.concat([mols_df, class_tot, data], axis=1)
mols = mols.dropna(axis=0, thresh=1400)
mols.groupby('target_name_quarter').mean()
fill_mean_func = lambda g: g.fillna(g.mean())
mols = mols.groupby('target_name_quarter').apply(fill_mean_func)
molfiles = mols.loc[:, :'target_quarter']
descriptors = mols.loc[:, 'nAcid':'Zagreb']
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
fitted = scaler.fit(descriptors)
descriptors_scaled = scaler.transform(descriptors)
descriptors_scaled = pd.DataFrame(descriptors_scaled, columns=descriptors.columns, index = list(descriptors.index.values))
from sklearn.feature_selection import VarianceThreshold
def variance_threshold_selector(data, threshold):
selector = VarianceThreshold(threshold)
selector.fit(data)
return data[data.columns[selector.get_support(indices=True)]]
descriptors_del_lowvar = variance_threshold_selector(descriptors_scaled, 0.01)
mols = pd.concat([molfiles, descriptors_del_lowvar.loc[:, 'nAcid':'Zagreb']], axis=1)
mols.loc[:, 'nAcid':'Zagreb'].corr()
import seaborn as sns
from statsmodels.stats.outliers_influence import variance_inflation_factor
% matplotlib inline
sns.pairplot(mols[['apol', 'nAtom', 'nHeavyAtom', 'nH', 'nAcid']])
vif = pd.DataFrame()
des = mols.loc[:, 'nAcid':'Zagreb']
vif["VIF factor"] = [variance_inflation_factor(des.values, i) for i in range(des.shape[1])]
vif["features"] = des.columns
print(vif)
I used MinMaxScaler() when eliminate low-variance features so as to make all the variables in same range.
print(vif) returns a dataframe with all infinite values and I cannot figure out why.
Thank you in advance :)
This shows a perfect correlation between two independent variables. In the case of perfect correlation, we get R2 =1, which lead to 1/(1-R2) infinity. To solve this problem we need to drop one of the variables from the dataset which is causing this perfect multicollinearity.

Plot distribution of differences between two pandas dataframe columns

I have a pandas dataframe, which have columns A & B
I just want to plot a distribution graph of the percentage of differences between column A & B
A B
1 1.051990e+10 1.051990e+04
2 1.051990e+10 1.051990e+04
5 4.841800e+10 1.200000e+10
8 2.327700e+10 2.716000e+10
9 1.204900e+10 2.100000e+08
Distribution graph will be like, how many records are having 10% of differences, how many are 20% difference
I tried as follows
df percCal(x,y):
return (x-y)*100/x
df['perc'] = df.apply(lambda x: percCal(df['A'], df['B']), axis=1)
This is not working, as i'm newbie please help
You don't need the lambda operation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df1 = pd.DataFrame(np.random.randint(1, 10, (20, 2)), columns=['A', 'B'])
def percCal(x,y):
return (x-y)*100/x
Alternatively, just manipulate the columns directly:
df1['diff'] = (df1['A'] - df1['B']) * 100 / df1['A']
Apply the function and plot:
df1['diff'] = percCal(df1['A'], df1['B'])
df1['diff'].plot(kind='density')
df['perc'] = (df['A'] - df['B']) *100/df['A']
def percCal(x,y):
return (x-y)*100/x
df['perc'] = df.apply(lambda x: percCal(x['A'], x['B']), axis=1)
Change dfin lambda for x in this case you are giving the function the data xthat means you are giving the percCalwhat you have in the row of the data frame and when you use dfyou are giving actually the data frame and the function is returning a data frame not a value. But please check your code, if xin the function can be 0 is a problem.
Think this is what you are looking for:
# Dummy df
data = [
[1.051990e+10, 1.051990e+04],
[1.051990e+10, 1.051990e+04],
[4.841800e+10, 1.200000e+10],
[2.327700e+10, 2.716000e+10],
[1.204900e+10, 2.100000e+08],
]
cols = ['A', 'B']
df2 = pd.DataFrame(data, columns=cols)
# Solution
import seaborn as sns
df2['pct_diff'] = (df2['A'] - df2['B']) / df2['A']
sns.distplot(df2['pct_diff']);

python scipy spearman correlations

I am trying to obtain the column names from the dataframe (df) and associate them to the resulting array produced by the spearmanr correlation function. I need to associate both the column names (a-j) back to the correlation value (spearman) and the p-values (spearman_pvalue). Is there an intuitive way to perform this task?
from scipy.stats import pearsonr,spearmanr
import numpy as np
import pandas as pd
df=pd.DataFrame(np.random.randint(0,100,size= (100,10)),columns=list('abcdefghij'))
def binary(row):
if row>=50:
return 1
else:
return 0
df['target']=df.a.apply(binary)
spearman,spearman_pvalue=spearmanr(df.drop(['target'],axis=1),df.target)
print(spearman)
print(spearman_pvalue)
It seems you need:
from scipy.stats import spearmanr
df=pd.DataFrame(np.random.randint(0,100,size= (100,10)),columns=list('abcdefghij'))
#print (df)
#faster for binary df
df['target'] = (df['a'] >= 50).astype(int)
#print (df)
spearman,spearman_pvalue=spearmanr(df.drop(['target'],axis=1),df.target)
df1 = pd.DataFrame(spearman.reshape(-1, 11), columns=df.columns)
#print (df1)
df2 = pd.DataFrame(spearman_pvalue.reshape(-1, 11), columns=df.columns)
#print (df2)
### Kyle, we can assign the index back to the column names for the total matrix:
df2=df2.set_index(df.columns)
df1=df1.set_index(df.columns)
Or:
df1 = pd.DataFrame(spearman.reshape(-1, 11),
columns=df.columns,
index=df.columns)
df2 = pd.DataFrame(spearman_pvalue.reshape(-1, 11),
columns=df.columns,
index=df.columns)

Visually separating bar chart clusters in pandas

This is more of a hack that almost works.
#!/usr/bin/env python
from pandas import *
import matplotlib.pyplot as plt
from numpy import zeros
# Create original dataframe
df = DataFrame(np.random.rand(5,4), index=['art','mcf','mesa','perl','gcc'],
columns=['pol1','pol2','pol3','pol4'])
# Estimate average
average = df.mean()
average.name = 'average'
# Append dummy row with zeros and then average
row = DataFrame([dict({p:0.0 for p in df.columns}), ])
df = df.append(row)
df = df.append(average)
print df
df.plot(kind='bar')
plt.show()
and gives:
pol1 pol2 pol3 pol4
art 0.247309 0.139797 0.673009 0.265708
mcf 0.951582 0.319486 0.447658 0.259821
mesa 0.888686 0.177007 0.845190 0.946728
perl 0.902977 0.863369 0.194451 0.698102
gcc 0.836407 0.700306 0.739659 0.265613
0 0.000000 0.000000 0.000000 0.000000
average 0.765392 0.439993 0.579993 0.487194
and
It gives the visual separation between benchmarks and average.
Is there a way to get rid of the 0 at the x-axis??
It turns out that DataFrame does not allow me to have muptiple dummy rows this way.
My solution was to change
row = pd.DataFrame([dict({p:0.0 for p in df.columns}), ])
into
row = pd.Series([dict({p:0.0 for p in df.columns}), ])
row.name = ""
Series can be named with empty string.
Still pretty hacky, but it works:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Create original dataframe
df = pd.DataFrame(np.random.rand(5,4), index=['art','mcf','mesa','perl','gcc'],
columns=['pol1','pol2','pol3','pol4'])
# Estimate average
average = df.mean()
average.name = 'average'
# Append dummy row with zeros and then average
row = pd.DataFrame([dict({p:0.0 for p in df.columns}), ])
df = df.append(row)
df = df.reindex(np.where(df.index, df.index, ''))
df = df.append(average)
print df
df.plot(kind='bar')
plt.show()

Categories

Resources