sum column based on level selected in column header - python

I have a pd.dataframe and it looks like this. Note column names represent level.
df
PC 0 1 2 3
0 PC_1 0.74 0.25 0.1 0.0
1 PC_1 0.72 0.26 0.1 0.1
2 PC_2 0.80 0.18 0.2 0.0
3 PC_3 0.79 0.19 0.1 0.1
I want to create another 4 columns next to the existing columns and shift the values based on the condition assigned.
For example: if level =1, df should look like this:
df
PC 0 1 2 3 0_1 1_1 2_1 3_1
0 PC_1 0.74 0.25 0.1 0.0 0.0 (0.72+0.25) 0.1 0.0
1 PC_1 0.72 0.26 0.1 0.1 0.0 (0.72+0.26) 0.1 0.1
2 PC_2 0.80 0.18 0.2 0.0 0.0 (0.80+0.18) 0.2 0.0
3 PC_3 0.79 0.19 0.1 0.1 0.0 (0.79+0.19) 0.1 0.0
If level=3,
df
PC 0 1 2 3 0_3 1_3 2_3 3_3
0 PC_1 0.74 0.25 0.1 0.0 0.0 0.0 0.0 sum(0.74+0.25+0.1+0.0)
1 PC_1 0.72 0.26 0.1 0.1 0.0 0.0 0.0 sum(0.72+0.26+0.1+0.1)
2 PC_2 0.80 0.18 0.2 0.0 0.0 0.0 0.0 sum(0.80+0.18+0.20+0.0)
3 PC_3 0.79 0.19 0.1 0.1 0.0 0.0 0.0 sum(0.79+0.19+0.1+0.1)
I don't know how to solve the problem and am looking for help.
Thank you in advance.

Set 'PC' to the index to make things easier. We zero everything before your column, cumsum up to the column, and keep everything as is after your column.
df = df.set_index('PC')
def add_sum(df, level):
i = df.columns.get_loc(level)
df_add = (pd.concat([pd.DataFrame(0, index=df.index, columns=df.columns[:i]),
df.cumsum(1).iloc[:, i],
df.iloc[:, i+1:]],
axis=1)
.add_suffix(f'_{level}'))
return pd.concat([df, df_add], axis=1)
add_sum(df, '1') # 1 if columns labels are int
0 1 2 3 0_1 1_1 2_1 3_1
PC
PC_1 0.74 0.25 0.1 0.0 0 0.99 0.1 0.0
PC_1 0.72 0.26 0.1 0.1 0 0.98 0.1 0.1
PC_2 0.80 0.18 0.2 0.0 0 0.98 0.2 0.0
PC_3 0.79 0.19 0.1 0.1 0 0.98 0.1 0.1
add_sum(df, '3')
0 1 2 3 0_3 1_3 2_3 3_3
PC
PC_1 0.74 0.25 0.1 0.0 0 0 0 1.09
PC_1 0.72 0.26 0.1 0.1 0 0 0 1.18
PC_2 0.80 0.18 0.2 0.0 0 0 0 1.18
PC_3 0.79 0.19 0.1 0.1 0 0 0 1.18

As you wrote based on level selected in column header in the title,
I understand that:
there is no "external" level variable,
the level (how many columns to sum) results just from
the source column name.
So the task is actually to "concatenate" your both expected results (you presented only how to compute column 1_1 and 3_1) and compute other
new columns the same way.
The solution to do it is surprisingly concise.
Run the following one-liner:
df = df.join(df.iloc[:, 1:].cumsum(axis=1)
.rename(lambda name: str(name) + '_1', axis=1))
Details:
df.iloc[:, 1:] - Take all rows, starting from column 1 (column
numbers from 0).
cumsum(axis=1) - Compute cumulative sum, horizontally.
rename(..., axis=1) - Rename columns.
lambda name: str(name) + '_1' - Lambda function to compute new
column name.
The result so far - new columns.
df = df.join(...) - Join with the original DataFrame and save the
result back under df.

Related

Calculate the sum of a pandas column depending on the change of values in another column

I have a dataframe as follows:
df =
col_1 val_1
0 4.0 0.89
1 4.0 0.56
2 49.0 0.7
3 49.0 1.23
4 49.0 0.8
5 52.0 0.5
6 52.0 0.2
I want to calculate the sum of the column val_1 with a penalising factor which depends on the change in the values of col_1.
For example: If there is a change in the value in col_1, then we take the value from previous row in val_1 and subtract with a penalising factor of 0.4
sum = 0.89 + (0.56-0.4) (because there is change of value in col_1 from 4.0 to 49.0) +0.7 +1.23 + (0.8 - 0.4) (because there is a change of value in col_1 from 49.0 to 52.0) + 0.5 + 0.2
sum = 4.08
Is there a way to do this?
use np.where to assign a new column and measure changes with .shift() against each row.
import numpy as np
df['val_1_adj'] = np.where(df['col_1'].ne(df['col_1'].shift(-1).ffill()),
df['val_1'].sub(0.4),
df['val_1'])
print(df)
col_1 val_1 val_1_adj
0 4.0 0.89 0.89
1 4.0 0.56 0.16
2 49.0 0.70 0.70
3 49.0 1.23 1.23
4 49.0 0.80 0.40
5 52.0 0.50 0.50
6 52.0 0.20 0.20
df['val_1_adj'].sum()
4.08
Slight variation on #UmarH's answer
df['penalties'] = np.where(~df.col_1.diff(-1).isin([0, np.nan]), 0.4, 0)
my_sum = (df['val_1'] - df['penalties']).sum()
print(my_sum)
Output:
4.08

I would like to copy a Dataframe and interpolate the values of this new dataframe to achieve Data Augmentation

My original dataframe looks like this:
No index
Value1
Value2
Value3
0
1.0
0.0
0.0
1
1.0
0.2
0.2
2
1.0
0.4
0.4
3
0.8
0.6
0.6
4
0.5
0.4
0.8
5
0.1
0.2
1.0
And what I want to achieve is the following:
No index
Value1
Value2
Value3
0
1.0
0.1
0.1
1
1.0
0.3
0.3
2
0.9
0.5
0.5
3
0.65
0.5
0.7
4
0.3
0.3
0.9
5
0.1
0.2
1.0
I would basically like to shift the new dataframe by 1 index, and then compute the average of the two original values. But keeping the values in the last row the same.
Is there someone who can help me with this? Thank you in advance.
Use rolling_mean and get values from the last row:
out = df.rolling(2).mean().shift(-1)
out.loc[len(df)-1] = df.tail(1).values
Output:
>>> out
Value1 Value2 Value3
0 1.00 0.1 0.1
1 1.00 0.3 0.3
2 0.90 0.5 0.5
3 0.65 0.5 0.7
4 0.30 0.3 0.9
5 0.10 0.2 1.0

Losing the label column while doing aggregate function on a dataframe

I've been trying to do aggregate function on a dataframe that consist of number and string. While doing the aggregate function, I realize that the string data is missing. I want to keep the string data (label) as I need it to label the result of aggregation. Here is what I've coded :
def function (df) :
l_dfrange = []
step = 10
gr = df.groupby(['label'], as_index=False)
l_grouped = list(gr)
for i in range(len(l_grouped)):
df_range = pd.DataFrame(l_grouped[i][1])
df_range["ID"] = np.arange(len(df_range))//step
df_range = df_range.groupby("ID").agg([np.mean, np.std])
l_dfrange.append(df_range)
return l_dfrange, df_range
Initial dataframe :
gyro_x gyro_y gyro_z label
1 0.05 0.05 0.6 jump
2 0.03 0.03 0.6 jump
3 0.02 0.04 0.6 jump
4 0.08 0.09 0.6 stand
5 0.03 0.03 0.6 stand
6 0.02 0.04 0.6 stand
7 0.05 0.05 0.6 jump
8 0.03 0.03 0.6 jump
9 0.02 0.04 0.6 jump
Result that I want :
Note that for the example, i limit the group to only consisted of 3 rows each group, and they were sorted by label and ID to indentify the group
gyro_x gyro_y gyro_z label ID
1 0.05 0.05 0.6 jump 1
2 0.03 0.03 0.6 jump 1
3 0.02 0.04 0.6 jump 1
7 0.05 0.05 0.6 jump 2
8 0.03 0.03 0.6 jump 2
9 0.02 0.04 0.6 jump 2
4 0.08 0.09 0.6 stand 3
5 0.03 0.03 0.6 stand 3
6 0.02 0.04 0.6 stand 3
The end result that i want
ID mean_gyro_x std_gyro_x mean_gyro_y std_gyro_y label
1 0.05 0.05 0.6 0.6 jump
2 0.05 0.05 0.6 0.6 jump
3 0.03 0.03 0.6 0.6 stand
I combine first 3 rows in the example to get the aggregate result but also keeping the label (as they have been grouped by their label before). Is there any way I could keep the label? Also can I change the type to data frame? When I turn the l_dfrange into dataframe, it always return with feature name (columns), but no data.
created tmp.csv as follows:
gyro_x,gyro_y,gyro_z,label
0.05,0.05,0.6,jump
0.03,0.03,0.6,jump
0.02,0.04,0.6,jump
0.08,0.09,0.6,stand
0.03,0.03,0.6,stand
0.02,0.04,0.6,stand
pythonic style and pandas is very cool as you can see below:
import numpy as np
import pandas as pd
df = pd.read_csv('tmp.csv')
print(df)
df = df.groupby('label').agg({'gyro_x': ['mean', 'std'], 'gyro_y': ['mean', 'std']}).reset_index()
df.columns = ['label', 'mean_gyro_x', 'std_gyro_x', 'mean_gyro_y', 'std_gyro_y']
print(df)
Output:
gyro_x gyro_y gyro_z label
0 0.05 0.05 0.6 jump
1 0.03 0.03 0.6 jump
2 0.02 0.04 0.6 jump
3 0.08 0.09 0.6 stand
4 0.03 0.03 0.6 stand
5 0.02 0.04 0.6 stand
label mean_gyro_x std_gyro_x mean_gyro_y std_gyro_y
0 jump 0.033333 0.015275 0.040000 0.010000
1 stand 0.043333 0.032146 0.053333 0.032146

Pandas Series Resample

I have the following pandas series:
dummy_array = pd.Series(np.array(range(-10, 11)), index=(np.array(range(0, 21))/10))
This yield the following array:
0.0 -10
0.1 -9
0.2 -8
0.3 -7
0.4 -6
0.5 -5
0.6 -4
0.7 -3
0.8 -2
0.9 -1
1.0 0
1.1 1
1.2 2
1.3 3
1.4 4
1.5 5
1.6 6
1.7 7
1.8 8
1.9 9
2.0 10
If I want to resample, how can I do it? I read the docs and it suggested this:
dummy_array.resample('20S').mean()
But it's not working. Any ideas?
Thank you.
Edit:
I want my final vector to have double the frequency. So something like this:
0.0 -10
0.05 -9.5
0.1 -9
0.15 -8.5
0.2 -8
0.25 -7.5
etc.
Here is a solution using np.linspace(), .reindex() and interpolate:
The data frame dummmy_array is created as described above.
# get properties of original index
start = dummy_array.index.min()
end = dummy_array.index.max()
num_gridpoints_orig = dummy_array.index.size
# calc number of grid-points in new index
num_gridpoints_new = (num_gridpoints_orig * 2) - 1
# create new index, with twice the number of grid-points (i.e., smaller step-size)
idx_new = np.linspace(start, end, num_gridpoints_new)
# re-index the data frame. New grid-points have value of NaN,
# and we replace these NaNs with interpolated values
df2 = dummy_array.reindex(index=idx_new).interpolate()
print(df2.head())
0.00 -10.0
0.05 -9.5
0.10 -9.0
0.15 -8.5
0.20 -8.0
Create a list of differences based on the original array. We then break it down into values and indices to create the 'pd.Series'. Join the new pd.series and reorder it.
# new list
ups = [[x+0.05,y+0.5] for x,y in zip(dummy_array.index, dummy_array)]
idx = [i[0] for i in ups]
val = [i[1] for i in ups]
d2 = pd.Series(val, index=idx)
d3 = pd.concat([dummy_array,d2], axis=0)
d3.sort_values(inplace=True)
d3
0.00 -10.0
0.05 -9.5
0.10 -9.0
0.15 -8.5
0.20 -8.0
0.25 -7.5
0.30 -7.0
0.35 -6.5
0.40 -6.0
0.45 -5.5
0.50 -5.0
0.55 -4.5
0.60 -4.0
0.65 -3.5
0.70 -3.0
0.75 -2.5
0.80 -2.0
0.85 -1.5
0.90 -1.0
0.95 -0.5
1.00 0.0
1.05 0.5
1.10 1.0
1.15 1.5
1.20 2.0
1.25 2.5
1.30 3.0
1.35 3.5
1.40 4.0
1.45 4.5
1.50 5.0
1.55 5.5
1.60 6.0
1.65 6.5
1.70 7.0
1.75 7.5
1.80 8.0
1.85 8.5
1.90 9.0
1.95 9.5
2.00 10.0
2.05 10.5
dtype: float64
Thank you all for your contributions. After looking at the answers and thinking a bit more I found a more generic solution that should handle every possible case. In this case, I wanted to upsample dummy_arrayA to the same index as dummy_arrayB. What I did was to create a new index which has both A and B. I then use the reindex and interpolate function to calculate what would be the new values, and at the end I drop the old indexes so that I get the same array size as dummy_array-B.
import pandas as pd
import numpy as np
# Create Dummy arrays
dummy_arrayA = pd.Series(np.array(range(0, 4)), index=[0,0.5,1.0,1.5])
dummy_arrayB = pd.Series(np.array(range(0, 5)), index=[0,0.4,0.8,1.2,1.6])
# Create new index based on array A
new_ind = pd.Index(dummy_arrayA.index)
# merge index A and B
new_ind=new_ind.union(dummy_arrayB.index)
# Use the reindex function. This will copy all the values and add the missing ones with nan. Then we call the interpolate function with the index method. So that it's interpolates based on the time.
df2 = dummy_arrayA.reindex(index=new_ind).interpolate(method="index")
# Delete the points.
New_ind_inter = dummy_arrayA.index.intersection(new_ind)
# We need to prevent that common point are also deleted.
new_ind = new_ind.difference(New_ind_inter)
# Delete the old points. So that the final array matchs dummy_arrayB
df2 = df2.drop(new_ind)
print(df2)

How to apply scipy.stats.describe to each group?

I would appreciate if you could let me know how to apply scipy.stats.describe to calculate summary statistics by group. My data (TrainSet) is like this:
Financial Distress x1 x2 x3
0 1.28 0.02 0.87
0 1.27 0.01 0.82
0 1.05 -0.06 0.92
1 1.11 -0.02 0.86
0 1.06 0.11 0.81
0 1.06 0.08 0.88
1 0.87 -0.03 0.79
I want to compute the summary statistics by "Financial Distress". I mean something like this post but via scipy.stats.describe because I need skewness and kurtosis for x1, x2, and x3 by group. However, my code doesn't provide the statistics by group.
desc=dict()
for col in TrainSet.columns:
if [TrainSet["Financial Distress"]==0]:
desc[col] = describe(TrainSet[col]())
df = pd.DataFrame.from_dict(desc, orient='index')
df.to_csv("Descriptive Statistics3.csv")
In fact, I need something like this:
Group 0 1
statistics nobs minmax mean variance skewness kurtosis nobs minmax mean variance skewness kurtosis
Financial Distress 2569 (0, 1) 0.0 0.0 4.9 22.1 50 (0, 1) 0.0 0.0 2.9 22.1
x1 2569 (0.1, 38) 1.4 1.7 16.5 399.9 50 (-3.6, 3.8) 0.3 0.1 0.5 21.8
x2 2569 (-0.2, 0.7) 0.1 0.0 1.0 1.8 50 (-0.3, 0.7) 0.1 0.0 0.9 1.2
x3 2569 (0.1, 0.9) 0.6 0.0 -0.5 -0.2 50 (0.1, 0.9) 0.6 0.0 -0.6 -0.3
x4 2569 (5.3, 6.3) 0.9 0.3 3.2 19.7 50 (-26, 38) 14.0 12.0 15.1 26.5
x5 2569 (-0.2, 0.8) 0.2 0.0 0.8 1.4 50 (0.3, 0.9) 0.4 0.0 0.5 -0.3
Or
nobs minmax mean variance skewness kurtosis
x1 0 5 (1.05, 1.28) 1.144 0.01433 4.073221e-01 -1.825477
1 2 (0.87, 1.11) 0.990 0.02880 1.380350e-15 -2.000000
x2 0 5 (-0.06, 0.11) 0.032 0.00437 -1.992376e-01 -1.130951
1 2 (-0.03, -0.02) -0.025 0.00005 1.058791e-15 -2.000000
x3 0 5 (0.81, 0.92) 0.860 0.00205 1.084093e-01 -1.368531
1 2 (0.79, 0.86) 0.825 0.00245 4.820432e-15 -2.000000
Thanks in advance,
If you wish to describe 3 series independently by group, it seems you'll need 3 dataframes. You can construct these dataframes and then concatenate them:
from scipy.stats import describe
grouper = df.groupby('FinancialDistress')
variables = df.columns[1:]
res = pd.concat([pd.DataFrame(describe(g[x]) for _, g in grouper)\
.reset_index().assign(cat=x).set_index(['cat', 'index']) \
for x in variables], axis=0)
print(res)
nobs minmax mean variance skewness kurtosis
cat index
x1 0 5 (1.05, 1.28) 1.144 0.01433 4.073221e-01 -1.825477
1 2 (0.87, 1.11) 0.990 0.02880 1.380350e-15 -2.000000
x2 0 5 (-0.06, 0.11) 0.032 0.00437 -1.992376e-01 -1.130951
1 2 (-0.03, -0.02) -0.025 0.00005 1.058791e-15 -2.000000
x3 0 5 (0.81, 0.92) 0.860 0.00205 1.084093e-01 -1.368531
1 2 (0.79, 0.86) 0.825 0.00245 4.820432e-15 -2.000000

Categories

Resources