How to apply scipy.stats.describe to each group? - python

I would appreciate if you could let me know how to apply scipy.stats.describe to calculate summary statistics by group. My data (TrainSet) is like this:
Financial Distress x1 x2 x3
0 1.28 0.02 0.87
0 1.27 0.01 0.82
0 1.05 -0.06 0.92
1 1.11 -0.02 0.86
0 1.06 0.11 0.81
0 1.06 0.08 0.88
1 0.87 -0.03 0.79
I want to compute the summary statistics by "Financial Distress". I mean something like this post but via scipy.stats.describe because I need skewness and kurtosis for x1, x2, and x3 by group. However, my code doesn't provide the statistics by group.
desc=dict()
for col in TrainSet.columns:
if [TrainSet["Financial Distress"]==0]:
desc[col] = describe(TrainSet[col]())
df = pd.DataFrame.from_dict(desc, orient='index')
df.to_csv("Descriptive Statistics3.csv")
In fact, I need something like this:
Group 0 1
statistics nobs minmax mean variance skewness kurtosis nobs minmax mean variance skewness kurtosis
Financial Distress 2569 (0, 1) 0.0 0.0 4.9 22.1 50 (0, 1) 0.0 0.0 2.9 22.1
x1 2569 (0.1, 38) 1.4 1.7 16.5 399.9 50 (-3.6, 3.8) 0.3 0.1 0.5 21.8
x2 2569 (-0.2, 0.7) 0.1 0.0 1.0 1.8 50 (-0.3, 0.7) 0.1 0.0 0.9 1.2
x3 2569 (0.1, 0.9) 0.6 0.0 -0.5 -0.2 50 (0.1, 0.9) 0.6 0.0 -0.6 -0.3
x4 2569 (5.3, 6.3) 0.9 0.3 3.2 19.7 50 (-26, 38) 14.0 12.0 15.1 26.5
x5 2569 (-0.2, 0.8) 0.2 0.0 0.8 1.4 50 (0.3, 0.9) 0.4 0.0 0.5 -0.3
Or
nobs minmax mean variance skewness kurtosis
x1 0 5 (1.05, 1.28) 1.144 0.01433 4.073221e-01 -1.825477
1 2 (0.87, 1.11) 0.990 0.02880 1.380350e-15 -2.000000
x2 0 5 (-0.06, 0.11) 0.032 0.00437 -1.992376e-01 -1.130951
1 2 (-0.03, -0.02) -0.025 0.00005 1.058791e-15 -2.000000
x3 0 5 (0.81, 0.92) 0.860 0.00205 1.084093e-01 -1.368531
1 2 (0.79, 0.86) 0.825 0.00245 4.820432e-15 -2.000000
Thanks in advance,

If you wish to describe 3 series independently by group, it seems you'll need 3 dataframes. You can construct these dataframes and then concatenate them:
from scipy.stats import describe
grouper = df.groupby('FinancialDistress')
variables = df.columns[1:]
res = pd.concat([pd.DataFrame(describe(g[x]) for _, g in grouper)\
.reset_index().assign(cat=x).set_index(['cat', 'index']) \
for x in variables], axis=0)
print(res)
nobs minmax mean variance skewness kurtosis
cat index
x1 0 5 (1.05, 1.28) 1.144 0.01433 4.073221e-01 -1.825477
1 2 (0.87, 1.11) 0.990 0.02880 1.380350e-15 -2.000000
x2 0 5 (-0.06, 0.11) 0.032 0.00437 -1.992376e-01 -1.130951
1 2 (-0.03, -0.02) -0.025 0.00005 1.058791e-15 -2.000000
x3 0 5 (0.81, 0.92) 0.860 0.00205 1.084093e-01 -1.368531
1 2 (0.79, 0.86) 0.825 0.00245 4.820432e-15 -2.000000

Related

Calculate the sum of a pandas column depending on the change of values in another column

I have a dataframe as follows:
df =
col_1 val_1
0 4.0 0.89
1 4.0 0.56
2 49.0 0.7
3 49.0 1.23
4 49.0 0.8
5 52.0 0.5
6 52.0 0.2
I want to calculate the sum of the column val_1 with a penalising factor which depends on the change in the values of col_1.
For example: If there is a change in the value in col_1, then we take the value from previous row in val_1 and subtract with a penalising factor of 0.4
sum = 0.89 + (0.56-0.4) (because there is change of value in col_1 from 4.0 to 49.0) +0.7 +1.23 + (0.8 - 0.4) (because there is a change of value in col_1 from 49.0 to 52.0) + 0.5 + 0.2
sum = 4.08
Is there a way to do this?
use np.where to assign a new column and measure changes with .shift() against each row.
import numpy as np
df['val_1_adj'] = np.where(df['col_1'].ne(df['col_1'].shift(-1).ffill()),
df['val_1'].sub(0.4),
df['val_1'])
print(df)
col_1 val_1 val_1_adj
0 4.0 0.89 0.89
1 4.0 0.56 0.16
2 49.0 0.70 0.70
3 49.0 1.23 1.23
4 49.0 0.80 0.40
5 52.0 0.50 0.50
6 52.0 0.20 0.20
df['val_1_adj'].sum()
4.08
Slight variation on #UmarH's answer
df['penalties'] = np.where(~df.col_1.diff(-1).isin([0, np.nan]), 0.4, 0)
my_sum = (df['val_1'] - df['penalties']).sum()
print(my_sum)
Output:
4.08

pandas how to add column of group by running range

I have a dataframe:
A B
0 0.1
0.1 0.3
0.35 0.48
1.3 1.5
1.5 1.9
2.2 2.9
3.1 3.4
5.1 5.5
And I want to add a column that will be the rank of B after grouping in to bins of 1.5, so it will be
A B T
0 0.1 0
0.1 0.3 0
0.35 0.48 0
1.3 1.5 0
1.5 1.9 1
2.2 2.9 1
3.1 3.4 2
5.1 5.5 3
What is the best way to do so?
Use cut with Series.factorize:
df['T'] = pd.factorize(pd.cut(df.B, bins=np.arange(0, df.B.max() + 1.5, 1.5)))[0]
print (df)
A B T
0 0.00 0.10 0
1 0.10 0.30 0
2 0.35 0.48 0
3 1.30 1.50 0
4 1.50 1.90 1
5 2.20 2.90 1
6 3.10 3.40 2
7 5.10 5.50 3

Pandas Series Resample

I have the following pandas series:
dummy_array = pd.Series(np.array(range(-10, 11)), index=(np.array(range(0, 21))/10))
This yield the following array:
0.0 -10
0.1 -9
0.2 -8
0.3 -7
0.4 -6
0.5 -5
0.6 -4
0.7 -3
0.8 -2
0.9 -1
1.0 0
1.1 1
1.2 2
1.3 3
1.4 4
1.5 5
1.6 6
1.7 7
1.8 8
1.9 9
2.0 10
If I want to resample, how can I do it? I read the docs and it suggested this:
dummy_array.resample('20S').mean()
But it's not working. Any ideas?
Thank you.
Edit:
I want my final vector to have double the frequency. So something like this:
0.0 -10
0.05 -9.5
0.1 -9
0.15 -8.5
0.2 -8
0.25 -7.5
etc.
Here is a solution using np.linspace(), .reindex() and interpolate:
The data frame dummmy_array is created as described above.
# get properties of original index
start = dummy_array.index.min()
end = dummy_array.index.max()
num_gridpoints_orig = dummy_array.index.size
# calc number of grid-points in new index
num_gridpoints_new = (num_gridpoints_orig * 2) - 1
# create new index, with twice the number of grid-points (i.e., smaller step-size)
idx_new = np.linspace(start, end, num_gridpoints_new)
# re-index the data frame. New grid-points have value of NaN,
# and we replace these NaNs with interpolated values
df2 = dummy_array.reindex(index=idx_new).interpolate()
print(df2.head())
0.00 -10.0
0.05 -9.5
0.10 -9.0
0.15 -8.5
0.20 -8.0
Create a list of differences based on the original array. We then break it down into values and indices to create the 'pd.Series'. Join the new pd.series and reorder it.
# new list
ups = [[x+0.05,y+0.5] for x,y in zip(dummy_array.index, dummy_array)]
idx = [i[0] for i in ups]
val = [i[1] for i in ups]
d2 = pd.Series(val, index=idx)
d3 = pd.concat([dummy_array,d2], axis=0)
d3.sort_values(inplace=True)
d3
0.00 -10.0
0.05 -9.5
0.10 -9.0
0.15 -8.5
0.20 -8.0
0.25 -7.5
0.30 -7.0
0.35 -6.5
0.40 -6.0
0.45 -5.5
0.50 -5.0
0.55 -4.5
0.60 -4.0
0.65 -3.5
0.70 -3.0
0.75 -2.5
0.80 -2.0
0.85 -1.5
0.90 -1.0
0.95 -0.5
1.00 0.0
1.05 0.5
1.10 1.0
1.15 1.5
1.20 2.0
1.25 2.5
1.30 3.0
1.35 3.5
1.40 4.0
1.45 4.5
1.50 5.0
1.55 5.5
1.60 6.0
1.65 6.5
1.70 7.0
1.75 7.5
1.80 8.0
1.85 8.5
1.90 9.0
1.95 9.5
2.00 10.0
2.05 10.5
dtype: float64
Thank you all for your contributions. After looking at the answers and thinking a bit more I found a more generic solution that should handle every possible case. In this case, I wanted to upsample dummy_arrayA to the same index as dummy_arrayB. What I did was to create a new index which has both A and B. I then use the reindex and interpolate function to calculate what would be the new values, and at the end I drop the old indexes so that I get the same array size as dummy_array-B.
import pandas as pd
import numpy as np
# Create Dummy arrays
dummy_arrayA = pd.Series(np.array(range(0, 4)), index=[0,0.5,1.0,1.5])
dummy_arrayB = pd.Series(np.array(range(0, 5)), index=[0,0.4,0.8,1.2,1.6])
# Create new index based on array A
new_ind = pd.Index(dummy_arrayA.index)
# merge index A and B
new_ind=new_ind.union(dummy_arrayB.index)
# Use the reindex function. This will copy all the values and add the missing ones with nan. Then we call the interpolate function with the index method. So that it's interpolates based on the time.
df2 = dummy_arrayA.reindex(index=new_ind).interpolate(method="index")
# Delete the points.
New_ind_inter = dummy_arrayA.index.intersection(new_ind)
# We need to prevent that common point are also deleted.
new_ind = new_ind.difference(New_ind_inter)
# Delete the old points. So that the final array matchs dummy_arrayB
df2 = df2.drop(new_ind)
print(df2)

sum column based on level selected in column header

I have a pd.dataframe and it looks like this. Note column names represent level.
df
PC 0 1 2 3
0 PC_1 0.74 0.25 0.1 0.0
1 PC_1 0.72 0.26 0.1 0.1
2 PC_2 0.80 0.18 0.2 0.0
3 PC_3 0.79 0.19 0.1 0.1
I want to create another 4 columns next to the existing columns and shift the values based on the condition assigned.
For example: if level =1, df should look like this:
df
PC 0 1 2 3 0_1 1_1 2_1 3_1
0 PC_1 0.74 0.25 0.1 0.0 0.0 (0.72+0.25) 0.1 0.0
1 PC_1 0.72 0.26 0.1 0.1 0.0 (0.72+0.26) 0.1 0.1
2 PC_2 0.80 0.18 0.2 0.0 0.0 (0.80+0.18) 0.2 0.0
3 PC_3 0.79 0.19 0.1 0.1 0.0 (0.79+0.19) 0.1 0.0
If level=3,
df
PC 0 1 2 3 0_3 1_3 2_3 3_3
0 PC_1 0.74 0.25 0.1 0.0 0.0 0.0 0.0 sum(0.74+0.25+0.1+0.0)
1 PC_1 0.72 0.26 0.1 0.1 0.0 0.0 0.0 sum(0.72+0.26+0.1+0.1)
2 PC_2 0.80 0.18 0.2 0.0 0.0 0.0 0.0 sum(0.80+0.18+0.20+0.0)
3 PC_3 0.79 0.19 0.1 0.1 0.0 0.0 0.0 sum(0.79+0.19+0.1+0.1)
I don't know how to solve the problem and am looking for help.
Thank you in advance.
Set 'PC' to the index to make things easier. We zero everything before your column, cumsum up to the column, and keep everything as is after your column.
df = df.set_index('PC')
def add_sum(df, level):
i = df.columns.get_loc(level)
df_add = (pd.concat([pd.DataFrame(0, index=df.index, columns=df.columns[:i]),
df.cumsum(1).iloc[:, i],
df.iloc[:, i+1:]],
axis=1)
.add_suffix(f'_{level}'))
return pd.concat([df, df_add], axis=1)
add_sum(df, '1') # 1 if columns labels are int
0 1 2 3 0_1 1_1 2_1 3_1
PC
PC_1 0.74 0.25 0.1 0.0 0 0.99 0.1 0.0
PC_1 0.72 0.26 0.1 0.1 0 0.98 0.1 0.1
PC_2 0.80 0.18 0.2 0.0 0 0.98 0.2 0.0
PC_3 0.79 0.19 0.1 0.1 0 0.98 0.1 0.1
add_sum(df, '3')
0 1 2 3 0_3 1_3 2_3 3_3
PC
PC_1 0.74 0.25 0.1 0.0 0 0 0 1.09
PC_1 0.72 0.26 0.1 0.1 0 0 0 1.18
PC_2 0.80 0.18 0.2 0.0 0 0 0 1.18
PC_3 0.79 0.19 0.1 0.1 0 0 0 1.18
As you wrote based on level selected in column header in the title,
I understand that:
there is no "external" level variable,
the level (how many columns to sum) results just from
the source column name.
So the task is actually to "concatenate" your both expected results (you presented only how to compute column 1_1 and 3_1) and compute other
new columns the same way.
The solution to do it is surprisingly concise.
Run the following one-liner:
df = df.join(df.iloc[:, 1:].cumsum(axis=1)
.rename(lambda name: str(name) + '_1', axis=1))
Details:
df.iloc[:, 1:] - Take all rows, starting from column 1 (column
numbers from 0).
cumsum(axis=1) - Compute cumulative sum, horizontally.
rename(..., axis=1) - Rename columns.
lambda name: str(name) + '_1' - Lambda function to compute new
column name.
The result so far - new columns.
df = df.join(...) - Join with the original DataFrame and save the
result back under df.

Calculating percentage of times column values meet varying conditions

I have a DataFrame like the one below, only with about 25 columns and 3000 rows. I need a second DF that displays the percentage of times that all the rows in each column from df_A are >= the target in df_B.
For example, in df_A, column d02 is >= .04 three times out of five (the len of the column), so that should be reflected in df_B as 60%.
I know how to do the comparison and percentages separately, but I am lost on how to put everything together and populate the new DF.
df_A
d01 d02 d03
0 0.028 0.021 0.028
1 0.051 0.063 0.093
2 0.084 0.084 0.084
3 0.061 0.061 0.072
4 0.015 0.015 0.015
Goal...
df_B
target d01 d02 d03
.02 p p p
.04 p .60 p
.06 p p p
.08 p p p
.15 p p p
.20 p p p
.25 p p p
.30 p p p
One way is to use numpy:
a, t, n = df_A.values, df_T.values, len(df_A.index)
res = np.zeros((len(df_T.index), len(df_A.columns)))
for i in range(res.shape[0]):
for j in range(res.shape[1]):
res[i, j] = np.sum(a[:, j] >= t[i]) / n
result = df_T.join(pd.DataFrame(res, columns=df_A.columns))
Setup
df_A:
d01 d02 d03
0 0.028 0.021 0.028
1 0.051 0.063 0.093
2 0.084 0.084 0.084
3 0.061 0.061 0.072
4 0.015 0.015 0.015
df_T:
target
0 0.02
1 0.04
2 0.06
3 0.08
4 0.15
5 0.20
6 0.25
7 0.30
Result
target d01 d02 d03
0 0.02 0.8 0.8 0.8
1 0.04 0.6 0.6 0.6
2 0.06 0.4 0.6 0.6
3 0.08 0.2 0.2 0.4
4 0.15 0.0 0.0 0.0
5 0.20 0.0 0.0 0.0
6 0.25 0.0 0.0 0.0
7 0.30 0.0 0.0 0.0
Performance benchmarking
The numpy version can be further optimised using numba.
%timeit allen(df_A, target) # 40ms
%timeit louis(df_A, target) # 7.79ms
%timeit jpp(df_A, df_T) # 4.29ms
df_A = pd.concat([df_A]*10)
df_T = pd.concat([df_T]*5)
target = [.02, .04, .06, .08, .15, .20, .25, .30] * 5
def allen(df_A, target):
df_B = pd.DataFrame(index=target, columns=df_A.columns).rename_axis('target',axis=0)
return df_B.apply(lambda x: df_A.ge(x.name).sum().div(len(df_A)), axis=1).reset_index()
def jpp(df_A, df_T):
a, t, n = df_A.values, df_T.values, len(df_A.index)
res = np.zeros((len(df_T.index), len(df_A.columns)))
for i in range(res.shape[0]):
for j in range(res.shape[1]):
res[i, j] = np.sum(a[:, j] >= t[i]) / n
return df_T.join(pd.DataFrame(res, columns=df_A.columns))
def louis(df_A, target):
dic = {key: [] for key in df_A}
for t in target:
for key in dic:
s = 0
for val in df_A[key]:
if val >= t:
s += 1
dic[key].append(s / len(df_A[key]))
return pd.DataFrame(data = dic, index = target)
Method
Create a list of targets.
Create a dictionary which will associate to each column name, the list of the percentages corresponding to the targets.
Loop on the targets, and for each target, loop on the columns to calculate the percentage and put it in the dictionary.
Create a DataFrame with the dictionary and the list of targets.
Code
df_A = pd.DataFrame(data = {
"d01": [ 0.028, 0.051, 0.084, 0.061, 0.015],
"d02": [ 0.021, 0.063, 0.084, 0.061, 0.015],
"d03": [ 0.028, 0.093, 0.084, 0.072, 0.015] })
target = [.02, .04, .06, .08, .15, .20, .25, .30]
dic = {key: [] for key in df_A}
for t in target:
for key in dic:
s = 0
for val in df_A[key]:
if val >= t:
s += 1
dic[key].append(s / len(df_A[key]))
df_B = pd.DataFrame(data = dic, index = target)
Result on the example
df_B
d01 d02 d03
0.02 0.8 0.8 0.8
0.04 0.6 0.6 0.6
0.06 0.4 0.6 0.6
0.08 0.2 0.2 0.4
0.15 0.0 0.0 0.0
0.20 0.0 0.0 0.0
0.25 0.0 0.0 0.0
0.30 0.0 0.0 0.0
Suppose you have (copied example data from Louis):
df_A = pd.DataFrame(data = {
"d01": [ 0.028, 0.051, 0.084, 0.061, 0.015],
"d02": [ 0.021, 0.063, 0.084, 0.061, 0.015],
"d03": [ 0.028, 0.093, 0.084, 0.072, 0.015] })
target = [.02, .04, .06, .08, .15, .20, .25, .30]
df_B = pd.DataFrame(index=target, columns=df_A.columns).rename_axis('target',axis=0)
You can use a lambda function to calculate the percentage.
df_B.apply(lambda x: df_A.ge(x.name).sum().div(len(df_A)), axis=1).reset_index()
Out[249]:
target d01 d02 d03
0 0.02 0.8 0.8 0.8
1 0.04 0.6 0.6 0.6
2 0.06 0.4 0.6 0.6
3 0.08 0.2 0.2 0.4
4 0.15 0.0 0.0 0.0
5 0.20 0.0 0.0 0.0
6 0.25 0.0 0.0 0.0
7 0.30 0.0 0.0 0.0

Categories

Resources