T Test on Multiple Columns in Dataframe

T Test on Multiple Columns in Dataframe - python

Dataframe looks something like:
decade rain snow
1910 0.2 0.2
1910 0.3 0.4
2000 0.4 0.5
2010 0.1 0.1
I'd love some help with a function in python to run a t test comparing decade combinations for a given column. This function works great except does not take an input column such as rain or snow.
from itertools import combinations
def ttest_run(c1, c2):
results = st.ttest_ind(cat1, cat2,nan_policy='omit')
df = pd.DataFrame({'dec1': c1,
'dec2': c2,
'tstat': results.statistic,
'pvalue': results.pvalue},
index = [0])
return df
df_list = [ttest_run(i, j) for i, j in combinations(data['decade'].unique().tolist(), 2)]
final_df = pd.concat(df_list, ignore_index = True)

I think you want something like this:
import pandas as pd
from itertools import combinations
from scipy import stats as st
d = {'decade': ['1910', '1910', '2000', '2010', '1990', '1990', '1990', '1990'],
'rain': [0.2, 0.3, 0.3, 0.1, 0.1, 0.2, 0.3, 0.4],
'snow': [0.2, 0.4, 0.5, 0.1, 0.1, 0.2, 0.3, 0.4]}
df = pd.DataFrame(data=d)
def all_pairwise(df, compare_col = 'decade'):
decade_pairs = [(i,j) for i, j in combinations(df[compare_col].unique().tolist(), 2)]
# or add a list of colnames to function signature
cols = list(df.columns)
cols.remove(compare_col)
list_of_dfs = []
for pair in decade_pairs:
for col in cols:
c1 = df[df[compare_col] == pair[0]][col]
c2 = df[df[compare_col] == pair[1]][col]
results = st.ttest_ind(c1, c2, nan_policy='omit')
tmp = pd.DataFrame({'dec1': pair[0],
'dec2': pair[1],
'tstat': results.statistic,
'pvalue': results.pvalue}, index = [col])
list_of_dfs.append(tmp)
df_stats = pd.concat(list_of_dfs)
return df_stats
df_stats = all_pairwise(df)
df_stats
Now if you execute that code you'll get runtime warnings from division by 0 errors occurring from too few data points when calculating t-statistics which cause the Nans in the output
>>> df_stats
dec1 dec2 tstat pvalue
rain 1910 2000 NaN NaN
snow 1910 2000 NaN NaN
rain 1910 2010 NaN NaN
snow 1910 2010 NaN NaN
rain 1910 1990 0.000000 1.000000
snow 1910 1990 0.436436 0.685044
rain 2000 2010 NaN NaN
...
If you don't want all columns but only some specified set change the function signature/definition line to read:
def all_pairwise(df, cols, compare_col = 'decade'):
where cols should be an iterable of string column names (a list will work fine). You'll need to remove the two lines:
cols = list(df.columns)
cols.remove(compare_col)
from the function body and otherwise will work fine.
You'll always get the runtime warnings unless you filter out decades with too few records before passing to the function.
Here is an example call from the version that accepts a list of columns as arguments and shows the runtime warning.
>>> all_pairwise(df, cols=['rain'])
/usr/local/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3723: RuntimeWarning: Degrees of freedom <= 0 for slice
return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
/usr/local/lib/python3.8/site-packages/numpy/core/_methods.py:254: RuntimeWarning: invalid value encountered in double_scalars
ret = ret.dtype.type(ret / rcount)
dec1 dec2 tstat pvalue
rain 1910 2000 NaN NaN
rain 1910 2010 NaN NaN
rain 1910 1990 0.0 1.0
rain 2000 2010 NaN NaN
rain 2000 1990 NaN NaN
rain 2010 1990 NaN NaN
>>>

Related

How to do math operations on a dataframe with an undefined number of columns?

I have a data frame in which there is an indefinite number of columns, to be defined later.
Like this:
index
GDP
2004
2005
...
brasil
1000
0.10
0.10
...
china
1000
0.15
0.10
...
india
1000
0.05
0.10
...
df = pd.DataFrame({'index': ['brasil', 'china', 'india'],
'GDP': [1000,1000,1000],
'2004': [0.10, 0.15, 0.05],
'2005': [0.10, 0.10, 0.10]})
Being the column GDP the initial GDP, and the columns from 2004 onwards being floats, representing percentages, relating to GDP growth in each year.
Using percentages to get the absolute number of the GDP in each year, based on initial GDP. I need a dataframe like this:
index
GDP
2004
2005
brasil
1000
1100
1210
china
1000
1150
1265
india
1000
1050
1155
I tried to use itertuples, df.columns and for loops, but i probably missing something.
Remembering that there are an indefinite number of columns.
Thank you very much in advance!

My answer is a combination of Wardy and user19*.
Starting with...
df = pd.DataFrame(data={'GDP': [1000, 1000, 1000],
'2004': [0.10, 0.15, 0.5],
'2005': [0.10, 0.10, 0.10],
'index': ['brasil', 'china', 'india']})
Find the percentage columns and make sure they are in the right order.
columns_of_interest = sorted(c for c in df.columns if c not in ['GDP', 'index'])
Now we calculate...
running_GDP = df['GDP'] # starting value
for column in columns_of_interest:
running_GDP *= 1.0 + df[column]
df[column] = running_GDP
This results in
GDP 2004 2005 index
0 1000 1100.0 1210.0 brasil
1 1000 1150.0 1265.0 china
2 1000 1500.0 1650.0 india

A simple way is to count the columns and loop over:
num = df.shape[1]
start = 2
for idx in range(start, num):
df.iloc[:, idx] = df.iloc[:, idx-1] * (1+df.iloc[:, idx])
print(df)
which gives
index GDP 2004 2005
0 brasil 1000 1100.0 1210.0
1 china 1000 1150.0 1265.0
2 india 1000 1050.0 1155.0

You can use df.columns to access a list of the dataframes columns.
Then you can do a loop over all of these column names. Here is an example of your data frame where I multiplied every value by 2. If you want to do different operations to different columns you can add conditions into the loop.
df = pd.DataFrame({'index': ['brasil', 'china', 'india'],
'GDP': [1000,1000,1000],
'2004': [0.10, 0.15, 0.5],
'2005': [0.10, 0.10, 0.10]})
for colName in df.columns:
df[colName] *= 2
print(df)
this returns...
index GDP 2004 2005
0 brasilbrasil 2000 0.2 0.2
1 chinachina 2000 0.3 0.2
2 indiaindia 2000 1.0 0.2
Hope this helps!

Add one to the percentages; calculate the cumulative product;
q = (df.iloc[:,2:] + 1).cumprod(axis=1)
multiply by the beginning gdp.
q = q.mul(df['GDP'],axis='index')
If you are trying to change the original DataFrame assign the result.
df.iloc[:,2:] = q
If you want to make new DataFrame concatenate the result with the first columns of the original.
new = pd.concat([df.iloc[:,:2],q],axis=1)
You can put those first two lines together if you want.
q = (df.iloc[:,2:] + 1).cumprod(axis=1).mul(df.GDP,axis='index')

Subtracting value from column gives NaN only

I have multiple column csv file and I want to subtract values of column X31-X27,Y31-Y27,Z31-Z27 from the same dataframe but when I am subtracting it gives me NaN values.
Here is the values of csv file:
It gives me the result as shown in picture
Help me to figure out this problem
import pandas as pd
import os
import numpy as np
df27 = pd.read_csv('D:27.txt', names=['No27','X27','Y27','Z27','Date27','Time27'], sep='\s+')
df28 = pd.read_csv('D:28.txt', names=['No28','X28','Y28','Z28','Date28','Time28'], sep='\s+')
df29 = pd.read_csv('D:29.txt', names=['No29','X29','Y29','Z29','Date29','Time29'], sep='\s+')
df30 = pd.read_csv('D:30.txt', names=['No30','X30','Y30','Z30','Date30','Time30'], sep='\s+')
df31 = pd.read_csv('D:31.txt', names=['No31','X31','Y31','Z31','Date31','Time31'], sep='\s+')
total=pd.concat([df27,df28,df29,df30,df31], axis=1)
total.to_csv('merge27-31.csv', index = False)
print(total)
df2731 = pd.read_csv('C:\\Users\\finalmerge27-31.csv')
df2731.reset_index(inplace=True)
print(df2731)
df227 = df2731[['X31', 'Y31', 'Z31']] - df2731[['X27', 'Y27', 'Z27']]
print(df227)

# input data
df = pd.DataFrame({'x27':[-1458.88, 181.78, 1911.84, 3739.3, 5358.19], 'y27':[-5885.8, -5878.1,-5786.5,-5735.7, -5545.6],
'z27':[1102,4139,4616,4108,1123], 'x31':[-1458, 181, 1911, np.nan, 5358], 'y31':[-5885, -5878, -5786, np.nan, -5554],
'z31':[1102,4138,4616,np.nan,1123]})
df
x27 y27 z27 x31 y31 z31
0 -1458.88 -5885.8 1102 -1458.0 -5885.0 1102.0
1 181.78 -5878.1 4139 181.0 -5878.0 4138.0
2 1911.84 -5786.5 4616 1911.0 -5786.0 4616.0
3 3739.30 -5735.7 4108 NaN NaN NaN
4 5358.19 -5545.6 1123 5358.0 -5554.0 1123.0
pd.DataFrame(df1.values - df2.values).rename(columns={0:'x32-x27', 1:'y31-y27', 2:'z31-x31'})
Out:
x32-x27 y31-y27 z31-x31
0 -0.88 -0.8 0.0
1 0.78 -0.1 1.0
2 0.84 -0.5 0.0
3 NaN NaN NaN
4 0.19 8.4 0.0

Find the closest value upper and lower from reference value in pandas df

I'm trying to find the closest value upper and lower to an other value(price).
My code only show the closest value, no matter if it's up or down.
DF
SYMBOL price gainddS8 gainddS7_5 gainddS7 gainddS6_5 \
102 1000SHIBUSDT 0.016049 -1.1 -0.1 1.5 2.12
9 ADAUSDT 0.572700 -15.371514 -2.5 -1.0 2.497339
24 ALGOUSDT 0.391300 -1.117796 0.5104497 14.091197 16.077897
Expected result
SYMBOL price closestdown closestup
102 1000SHIBUSDT 0.016049 -0.1 1.5
9 ADAUSDT 0.572700 -1.0 2.497339
24 ALGOUSDT 0.391300 -1.117796 0.5104497
my code that find the closest value is:
df1 = df.filter(like='gain')
pos = df1.sub(df['price'], axis=0).abs().to_numpy().argmin(axis=1)
df['closestvalue'] = df1.to_numpy()[np.arange(len(df1)), pos]
the closest upper value will always be the cell next to the closest lower on the right if that can make it simple

Given your data frame:
import pandas as pd
df = pd.DataFrame(
{"Symbol": ["1000SHIBUSDT", "ADAUSDT", "ALGOUSDT"],
"Price": [0.016049, 0.572700, 0.391300],
"gainddS8": [-1.1, -15.371514, -1.117796],
"gainddS7_5": [-0.1, -2.5, 0.5104497],
"gainddS7": [1.5, -1.0, 14.091197],
"gainddS6_5": [2.12, 2.497339, 16.077897]
}
)
You could do something like this:
tmp = df.drop(columns=["Symbol"])
# Make sure that your columns have float data types, otherwise convert them to floats
df_new = df[["Symbol", "Price"]].assign(
closestdown=tmp[tmp.apply(lambda x: x < tmp["Price"])].max(axis=1),
closestup=tmp[tmp.apply(lambda x: x > tmp["Price"])].min(axis=1)
)
print(df_new)
which results into
Symbol Price closestdown closestup
0 1000SHIBUSDT 0.016049 -0.100000 1.500000
1 ADAUSDT 0.572700 -1.000000 2.497339
2 ALGOUSDT 0.391300 -1.117796 0.510450

Optimal way to acquire percentiles of DataFrame rows

Problem
I have a pandas DataFrame df:
year val0 val1 val2 ... val98 val99
1983 -42.187 15.213 -32.185 12.887 -33.821
1984 39.213 -142.344 23.221 0.230 1.000
1985 -31.204 0.539 2.000 -1.000 3.442
...
2007 4.239 5.648 -15.483 3.794 -25.459
2008 6.431 0.831 -34.210 0.000 24.527
2009 -0.160 2.639 -2.196 52.628 71.291
My desired output, i.e. new_df, contains the 9 different percentiles including the median, and should have the following format:
year percentile_10 percentile_20 percentile_30 percentile_40 median percentile_60 percentile_70 percentile_80 percentile_90
1983 -40.382 -33.182 -25.483 -21.582 -14.424 -9.852 -3.852 6.247 10.528
...
2009 -3.248 0.412 6.672 10.536 12.428 20.582 46.248 52.837 78.991
Attempt
The following was my initial attempt:
def percentile(n):
def percentile_(x):
return np.percentile(x, n)
percentile_.__name__ = 'percentile_%s' % n
return percentile_
new_df = df.groupby('year').agg([percentile(10), percentile(20), percentile(30), percentile(40), np.median, percentile(60), percentile(70), percentile(80), percentile(90)]).reset_index()
However, instead of returning the percentiles of all columns, it calculated these percentiles for each val column and therefore returned 1000 columns. As it calculated the percentiles for each val, all percentiles returned the same values.
I still managed to run the desired task by trying the following:
list_1 = []
list_2 = []
list_3 = []
list_4 = []
mlist = []
list_6 = []
list_7 = []
list_8 = []
list_9 = []
for i in range(len(df)):
list_1.append(np.percentile(df.iloc[i,1:],10))
list_2.append(np.percentile(df.iloc[i,1:],20))
list_3.append(np.percentile(df.iloc[i,1:],30))
list_4.append(np.percentile(df.iloc[i,1:],40))
mlist.append(np.median(df.iloc[i,1:]))
list_6.append(np.percentile(df.iloc[i,1:],60))
list_7.append(np.percentile(df.iloc[i,1:],70))
list_8.append(np.percentile(df.iloc[i,1:],80))
list_9.append(np.percentile(df.iloc[i,1:],90))
df['percentile_10'] = list_1
df['percentile_20'] = list_2
df['percentile_30'] = list_3
df['percentile_40'] = list_4
df['median'] = mlist
df['percentile_60'] = list_6
df['percentile_70'] = list_7
df['percentile_80'] = list_8
df['percentile_90'] = list_9
new_df= df[['year', 'percentile_10','percentile_20','percentile_30','percentile_40','median','percentile_60','percentile_70','percentile_80','percentile_90']]
But this blatantly is such a laborous, manual, and one-dimensional way to achieve the task. What is the most optimal way to find the percentiles of each row for multiple columns?

You can get use .describe() function like this:
# Create Datarame
df = pd.DataFrame(np.random.randn(5,3))
# .apply() the .describe() function with "axis = 1" rows
df.apply(pd.DataFrame.describe, axis=1)
output:
count mean std min 25% 50% 75% max
0 3.0 0.422915 1.440097 -0.940519 -0.330152 0.280215 1.104632 1.929049
1 3.0 1.615037 0.766079 0.799817 1.262538 1.725259 2.022647 2.320036
2 3.0 0.221560 0.700770 -0.585020 -0.008149 0.568721 0.624849 0.680978
3 3.0 -0.119638 0.182402 -0.274168 -0.220240 -0.166312 -0.042373 0.081565
4 3.0 -0.569942 0.807865 -1.085838 -1.035455 -0.985072 -0.311994 0.361084
if you want other percentiles than the default 0.25, .05, .075 you can create your own function where you change the values of .describe(percentiles = [0.1, 0.2...., 0.9])

Use DataFrame.quantile with convert year to index and last transpose with rename columns by custom lambda function:
a = np.arange(1, 10) / 10
f = lambda x: f'percentile_{int(x * 100)}' if x != 0.5 else 'median'
new_df = df.set_index('year').quantile(a, axis=1).T.rename(columns=f)
print (new_df)
percentile_10 percentile_20 percentile_30 percentile_40 median \
year
1983 -38.8406 -35.4942 -33.4938 -32.8394 -32.185
1984 -85.3144 -28.2848 0.3840 0.6920 1.000
1985 -19.1224 -7.0408 -0.6922 -0.0766 0.539
2007 -21.4686 -17.4782 -11.6276 -3.9168 3.794
2008 -20.5260 -6.8420 0.1662 0.4986 0.831
2009 -1.3816 -0.5672 0.3998 1.5194 2.639
percentile_60 percentile_70 percentile_80 percentile_90
year
1983 -14.1562 3.8726 13.3522 14.2826
1984 9.8884 18.7768 26.4194 32.8162
1985 1.1234 1.7078 2.2884 2.8652
2007 3.9720 4.1500 4.5208 5.0844
2008 3.0710 5.3110 10.0502 17.2886
2009 22.6346 42.6302 56.3606 63.8258

Bayesian Averaging in a Dataframe

I'm attempting to extract a series of Bayesian averages, based on a dataframe (by row).
For example, say I have a series of (0 to 1) user ratings of candy bars, stored in a dataframe like so:
User1 User2 User3
Snickers 0.01 NaN 0.7
Mars Bars 0.25 0.4 0.1
Milky Way 0.9 1.0 NaN
Almond Joy NaN NaN NaN
Babe Ruth 0.5 0.1 0.3
I'd like to create a column in a different DF which represents each candy bar's Bayesian Average from the above data.
To calculate the BA, I'm using the equation presented here:
S = score of the candy bar
R = average of user ratings for the candy bar
C = average of user ratings for all candy bars
w = weight assigned to R and computed as v/(v+m), where v is the number of user ratings for that candy bar, and m is average number of reviews for all candy bars.
I've translated that into python as such:
def bayesian_average(df):
"""given a dataframe, returns a series of bayesian averages"""
R = df.mean(axis=1)
C = df.sum(axis=1).sum()/df.count(axis=1).sum()
w = df.count(axis=1)/(df.count(axis=1)+(df.count(axis=1).sum()/len(df.dropna(how='all', inplace=False))))
return ((w*R) + ((1-w)*C))
other_df['bayesian_avg'] = bayesian_average(ratings_df)
However, my calculation seems to be off, in such a way that as the number of User columns in my initial dataframe grows, the final calculated Bayesian average grows as well (into numbers greater than 1).
Is this a problem with the fundamental equation I'm using, or with how I've translated it into python? Or is there an easier way to handle this in general (e.g. a preexisting package/function)?
Thanks!

I began with the dataframe you gave as an example:
d = {
'Bar': ['Snickers', 'Mars Bars', 'Milky Way', 'Almond Joy', 'Babe Ruth'],
'User1': [0.01, 0.25, 0.9, np.nan, 0.5],
'User2': [np.nan, 0.4, 1.0, np.nan, 0.1],
'User3': [0.7, 0.1, np.nan, np.nan, 0.3]
}
df = pd.DataFrame(data=d)
Which looks like this:
Bar User1 User2 User3
0 Snickers 0.01 NaN 0.7
1 Mars Bars 0.25 0.4 0.1
2 Milky Way 0.90 1.0 NaN
3 Almond Joy NaN NaN NaN
4 Babe Ruth 0.50 0.1 0.3
The first thing I did was create a list of all columns that had user reviews:
user_cols = []
for col in df.columns.values:
if 'User' in col:
user_cols.append(col)
Next, I found it most straightforward to create each variable of the Bayesian Average equation either as a column in the dataframe, or as a standalone variable:
Calculate the value of v for each bar:
df['v'] = df[user_cols].count(axis=1)
Calculate the value of m (equals 2.0 in this example):
m = np.mean(df['v'])
Calculate the value of w for each bar:
df['w'] = df['v']/(df['v'] + m)
And calculate the value of R for each bar:
df['R'] = np.mean(df[user_cols], axis=1)
Finally, get the value of C (equals 0.426 in this example):
C = np.nanmean(df[user_cols].values.flatten())
And now we're ready to calculate the Bayesian Average score, S, for each candy bar:
df['S'] = df['w']*df['R'] + (1 - df['w'])*C
This gives us a dataframe that looks like this:
Bar User1 User2 User3 v w R S
0 Snickers 0.01 NaN 0.7 2 0.5 0.355 0.3905
1 Mars Bars 0.25 0.4 0.1 3 0.6 0.250 0.3204
2 Milky Way 0.90 1.0 NaN 2 0.5 0.950 0.6880
3 Almond Joy NaN NaN NaN 0 0.0 NaN NaN
4 Babe Ruth 0.50 0.1 0.3 3 0.6 0.300 0.3504
Where the final column S contains all the S-scores for the candy bars. If you want you could then delete the v, w, and R temporary columns: df = df.drop(['v', 'w', 'R'], axis=1):
Bar User1 User2 User3 S
0 Snickers 0.01 NaN 0.7 0.3905
1 Mars Bars 0.25 0.4 0.1 0.3204
2 Milky Way 0.90 1.0 NaN 0.6880
3 Almond Joy NaN NaN NaN NaN
4 Babe Ruth 0.50 0.1 0.3 0.3504

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

T Test on Multiple Columns in Dataframe - python

Related

How to do math operations on a dataframe with an undefined number of columns?

Subtracting value from column gives NaN only

Find the closest value upper and lower from reference value in pandas df

Optimal way to acquire percentiles of DataFrame rows

Bayesian Averaging in a Dataframe

Categories

Resources