I have a dataframe, used describe() on the dataframe, then inverted the describe() table.
Now, I want to add a column to this new table of skewness and kurtosis values.
I want to add a "Skewness" column and "Kurtosis" column to the right of "max" column. The Skewness column will have all of the skewness values of each row. The Kurtosis column will have the Kurtosis values of each row.
What you see so far is the transposed describe() table I called "summary_transpose"
count mean std min 25% 50% 75% max
Unnamed: 0 1000.0 499.5 288.8 0.0 249.8 499.5 749.2 999.0
FINAL_MARGIN 1000.0 -1.2 15.3 -39.0 -8.0 -2.0 8.0 28.0
SHOT_NUMBER 1000.0 6.4 4.7 1.0 3.0 5.0 9.0 23.0
PERIOD 1000.0 2.5 1.1 1.0 2.0 2.0 4.0 6.0
SHOT_CLOCK 979.0 11.8 5.4 0.3 8.0 11.5 15.0 24.0
DRIBBLES 1000.0 1.6 2.9 0.0 0.0 1.0 2.0 23.0
TOUCH_TIME 1000.0 2.9 2.6 -4.3 0.9 2.1 4.2 20.4
SHOT_DIST 1000.0 12.3 7.8 0.1 5.6 10.4 18.5 41.6
PTS_TYPE 1000.0 2.2 0.4 2.0 2.0 2.0 2.0 3.0
CLOSE_DEF_DIST 1000.0 3.6 2.3 0.0 2.1 3.1 4.7 19.8
FGM 1000.0 0.5 0.5 0.0 0.0 0.0 1.0 1.0
PTS 1000.0 1.0 1.1 0.0 0.0 0.0 2.0 3.0
The code below adds the Skewness and Kurtosis columns next to the max column.
import scipy.stats as stats
summary = round(df.describe(), 1) # rounds each value to 0.1
summary_transpose = summary.T # transposes the original summary table
summary_transpose['Skewness'] = stats.skew(df._get_numeric_data(), nan_policy='omit')
summary_transpose['Kurtosis'] = stats.kurtosis(df._get_numeric_data(), nan_policy='omit')
Related
I have a df which looks like this:
A B C
5.1 1.1 7.3
5.0 0.3 7.2
4.9 1.7 7.0
10.2 1.1 7.9
10.3 1.0 7.0
15.4 2.0 7.1
15.1 1.0 7.3
0.0 0.9 7.3
0.0 1.3 7.9
0.0 0.5 7.5
-5.1 1.0 7.3
-10.3 0.8 7.3
-10.1 1.0 7.1
I want to detect the range from column "A" and get the mean and std for all the columns and save the result in a new df.
Expected Output:
mean_A Std_A mean_B Std_B mean_C Std_C
5.0 ... 1.03 ... 7.17 ...
10.25 ... 1.05 ... 7.45 ...
... ... ... ... ... ...
So, I want to get the average from group of data based on column "A".
I am new to Python and SO. I hope I was able to explain my goal.
Groups are defined by difference of values in A is greater like 5, pass to GroupBy.agg and aggregate mean with std:
df = df.groupby(df.A.diff().abs().gt(5).cumsum()).agg(['mean','std'])
df.columns = df.columns.map(lambda x: f'{x[1]}_{x[0]}')
print (df)
mean_A std_A mean_B std_B mean_C std_C
A
0 5.00 0.100000 1.033333 0.702377 7.166667 0.152753
1 10.25 0.070711 1.050000 0.070711 7.450000 0.636396
2 15.25 0.212132 1.500000 0.707107 7.200000 0.141421
3 0.00 0.000000 0.900000 0.400000 7.566667 0.305505
4 -5.10 NaN 1.000000 NaN 7.300000 NaN
5 -10.20 0.141421 0.900000 0.141421 7.200000 0.141421
I am currently filtering my dataset based on certain statements as such :
from sklearn.datasets import load_iris
iris = load_iris()
data1 = pd.DataFrame(data= np.c_[iris['data'], iris['target']],
columns= iris['feature_names'] + ['target'])
# filter dataset
data1[(data1['sepal length (cm)'] > 4) | (data1['sepal width (cm)'] > 3)]
I want to be able to get the next 10 rows following each filter too and I am not sure how to even start that so for example when they find one row where the length is greater than 4, I want to return the next 10 as well as that one etc.
Please let me know how I can do this.
the dataset that you loaded has a sequential index starting at 0.
to get the 10 rows following a filter, you're looking for rows where the index is between index of the most recent filtered row and 10 + index of the most recent filtered row
however, the filter that you have provided (data1['sepal length (cm)'] > 4) | (data1['sepal width (cm)'] > 3) matches on every row. the minimum width is 2. so for this illustration i'll use the filter sepal length (cm) == 4.6 and filter the next 5 rows instead of 10.
filt = data1['sepal length (cm)'] == 4.6
data1.loc[filt, 'sentinel'] = data1.index[filt]
data1.sentinel = data1.sentinel.ffill()
data1[(data1.index >= data1.sentinel) & (data1.index <= data1.sentinel + 5)]
This filters 21 rows below
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) target sentinel
3 4.6 3.1 1.5 0.2 0.0 3.0
4 5.0 3.6 1.4 0.2 0.0 3.0
5 5.4 3.9 1.7 0.4 0.0 3.0
6 4.6 3.4 1.4 0.3 0.0 6.0
7 5.0 3.4 1.5 0.2 0.0 6.0
8 4.4 2.9 1.4 0.2 0.0 6.0
9 4.9 3.1 1.5 0.1 0.0 6.0
10 5.4 3.7 1.5 0.2 0.0 6.0
11 4.8 3.4 1.6 0.2 0.0 6.0
22 4.6 3.6 1.0 0.2 0.0 22.0
23 5.1 3.3 1.7 0.5 0.0 22.0
24 4.8 3.4 1.9 0.2 0.0 22.0
25 5.0 3.0 1.6 0.2 0.0 22.0
26 5.0 3.4 1.6 0.4 0.0 22.0
27 5.2 3.5 1.5 0.2 0.0 22.0
47 4.6 3.2 1.4 0.2 0.0 47.0
48 5.3 3.7 1.5 0.2 0.0 47.0
49 5.0 3.3 1.4 0.2 0.0 47.0
50 7.0 3.2 4.7 1.4 1.0 47.0
51 6.4 3.2 4.5 1.5 1.0 47.0
52 6.9 3.1 4.9 1.5 1.0 47.0
I have a DataFrame df1 that looks like this:
userId movie1 movie2 movie3
0 4.1 0.0 1.0
1 3.1 1.1 3.4
2 2.8 0.0 1.7
3 0.0 5.0 0.0
4 0.0 0.0 0.0
5 2.3 0.0 2.0
and another DataFrame, df2 that looks like this:
userId movie4 movie5 movie6
0 4.1 0.0 1.0
1 3.1 1.1 3.4
2 2.8 0.0 1.7
3 0.0 5.0 0.0
4 0.0 0.0 0.0
5 2.3 0.0 2.0
How do I select one column from df2 and add it to df1? For example, adding movie6 to df1 would result:
userId movie1 movie2 movie3 movie6
0 4.1 0.0 1.0 1.0
1 3.1 1.1 3.4 3.4
2 2.8 0.0 1.7 1.7
3 0.0 5.0 0.0 0.0
4 0.0 0.0 0.0 0.0
5 2.3 0.0 2.0 2.0
df1=pd.concat([df1,df2['movie6']],axis=0)
You can merge on the shared column, userId:
df1 = df1.merge(df2[["userId","movie6"]], on="userId")
Having input dataframe:
x_1 x_2
0 0.0 0.0
1 1.0 0.0
2 2.0 0.2
3 2.5 1.5
4 1.5 2.0
5 -2.0 -2.0
and additional dataframe as follows:
index x_1_x x_2_x x_1_y x_2_y value dist dist_rank
0 0 0.0 0.0 0.1 0.1 5.0 0.141421 2.0
4 0 0.0 0.0 1.5 1.0 -2.0 1.802776 3.0
5 0 0.0 0.0 0.0 0.0 3.0 0.000000 1.0
9 1 1.0 0.0 0.1 0.1 5.0 0.905539 1.0
11 1 1.0 0.0 2.0 0.4 3.0 1.077033 3.0
14 1 1.0 0.0 0.0 0.0 3.0 1.000000 2.0
18 2 2.0 0.2 0.1 0.1 5.0 1.902630 3.0
20 2 2.0 0.2 2.0 0.4 3.0 0.200000 1.0
22 2 2.0 0.2 1.5 1.0 -2.0 0.943398 2.0
29 3 2.5 1.5 2.0 0.4 3.0 1.208305 3.0
30 3 2.5 1.5 2.5 2.5 4.0 1.000000 1.0
31 3 2.5 1.5 1.5 1.0 -2.0 1.118034 2.0
38 4 1.5 2.0 2.0 0.4 3.0 1.676305 3.0
39 4 1.5 2.0 2.5 2.5 4.0 1.118034 2.0
40 4 1.5 2.0 1.5 1.0 -2.0 1.000000 1.0
45 5 -2.0 -2.0 0.1 0.1 5.0 2.969848 2.0
46 5 -2.0 -2.0 1.0 -2.0 6.0 3.000000 3.0
50 5 -2.0 -2.0 0.0 0.0 3.0 2.828427 1.0
I want to create new columns in input dataframe, basing on additional dataframe with respect to dist_rank. It should extract x_1_y, x_2_y and value for each row, with respect to index and dist_rank so my expected output is following:
I tried following lines:
df['value_dist_rank1']=result.loc[result['dist_rank']==1.0, 'value']
df['value_dist_rank1 ']=result[result['dist_rank']==1.0]['value']
but both gave the same output:
x_1 x_2 value_dist_rank1
0 0.0 0.0 NaN
1 1.0 0.0 NaN
2 2.0 0.2 NaN
3 2.5 1.5 NaN
4 1.5 2.0 NaN
5 -2.0 -2.0 3.0
Here is a way to do it :
(For the sake of clarity I consider the input df as df1 and the additional df as df2)
# First we goupby df2 by index to get all the column information of each index on one line
df2 = df2.groupby('index').agg(lambda x: list(x)).reset_index()
# Then we explode each column into three columns since there is always three columns for each index
columns = ['dist_rank', 'value', 'x_1_y', 'x_2_y']
column_to_add = ['value', 'x_1_y', 'x_2_y']
for index, row in df2.iterrows():
for i in range(3):
column_names = ["{}_dist_rank{}".format(x, row.dist_rank[i])[:-2] for x in column_to_add]
values = [row[x][i] for x in column_to_add]
for column, value in zip(column_names, values):
df2.loc[index, column] = value
# We drop the columns that are not useful :
df2.drop(columns=columns+['dist', 'x_1_x', 'x_2_x'], inplace = True)
# Finally we merge the modified df with our initial dataframe :
result = df1.merge(df2, left_index=True, right_on='index', how='left')
Output :
x_1 x_2 index value_dist_rank2 x_1_y_dist_rank2 x_2_y_dist_rank2 \
0 0.0 0.0 0 5.0 0.1 0.1
1 1.0 0.0 1 3.0 0.0 0.0
2 2.0 0.2 2 -2.0 1.5 1.0
3 2.5 1.5 3 -2.0 1.5 1.0
4 1.5 2.0 4 4.0 2.5 2.5
5 -2.0 -2.0 5 5.0 0.1 0.1
value_dist_rank3 x_1_y_dist_rank3 x_2_y_dist_rank3 value_dist_rank1 \
0 -2.0 1.5 1.0 3.0
1 3.0 2.0 0.4 5.0
2 5.0 0.1 0.1 3.0
3 3.0 2.0 0.4 4.0
4 3.0 2.0 0.4 -2.0
5 6.0 1.0 -2.0 3.0
x_1_y_dist_rank1 x_2_y_dist_rank1
0 0.0 0.0
1 0.1 0.1
2 2.0 0.4
3 2.5 2.5
4 1.5 1.0
5 0.0 0.0
I have a dataframe such as:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
5 11.4 5.6 3.2 1.6 0.8 1.0
Where the final row contains averages. I would like to rename the final row label to "A" so that the dataframe will look like this:
0 1 2 3 4 5
0 41.0 22.0 9.0 4.0 2.0 1.0
1 6.0 1.0 2.0 1.0 1.0 1.0
2 4.0 2.0 4.0 1.0 0.0 1.0
3 1.0 2.0 1.0 1.0 1.0 1.0
4 5.0 1.0 0.0 1.0 0.0 1.0
A 11.4 5.6 3.2 1.6 0.8 1.0
I understand columns can be done with df.columns = . . .. But how can I do this with a specific row label?
You can get the last index using negative indexing similar to that in Python
last = df.index[-1]
Then
df = df.rename(index={last: 'a'})
Edit: If you are looking for a one-liner,
df.index = df.index[:-1].tolist() + ['a']
use index attribute:
df.index = df.index[:-1].append(pd.Index(['A']))