If I run
import numpy as np
import pandas as pd
import sys
df = pd.read_csv(sys.argv[1]) # note to self: argv[0] is script file content
description = df.groupby(['option','subcase']).describe()
totals = df.groupby('option').describe().set_index(np.array(['total'] * df['option'].nunique()), append=True)
description = description.append(totals).sort_index()
print(description)
on .csv
option,subcase,cost,time
A,sub1,4,3
A,sub1,2,0
A,sub2,3,8
A,sub2,1,2
B,sub1,13,0
B,sub1,11,0
B,sub2,5,2
B,sub2,3,4
, I get an output like this:
cost time \
count mean std min 25% 50% 75% max count
option subcase
A sub1 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0 2.0
sub2 2.0 2.0 1.414214 1.0 1.50 2.0 2.50 3.0 2.0
total 4.0 2.5 1.290994 1.0 1.75 2.5 3.25 4.0 4.0
B sub1 2.0 12.0 1.414214 11.0 11.50 12.0 12.50 13.0 2.0
sub2 2.0 4.0 1.414214 3.0 3.50 4.0 4.50 5.0 2.0
total 4.0 8.0 4.760952 3.0 4.50 8.0 11.50 13.0 4.0
mean std min 25% 50% 75% max
option subcase
A sub1 1.50 2.121320 0.0 0.75 1.5 2.25 3.0
sub2 5.00 4.242641 2.0 3.50 5.0 6.50 8.0
total 3.25 3.403430 0.0 1.50 2.5 4.25 8.0
B sub1 0.00 0.000000 0.0 0.00 0.0 0.00 0.0
sub2 3.00 1.414214 2.0 2.50 3.0 3.50 4.0
total 1.50 1.914854 0.0 0.00 1.0 2.50 4.0
This is annoying, especially if you want to save it as a .csv instead of displaying it in a console.
(e.g. python myscript.py my.csv > my.summary)
How do I stop this linebreak from happening?
Adding : pd.set_option
pd.set_option('expand_frame_repr', False)
print(description)
cost time
count mean std min 25% 50% 75% max count mean std min 25% 50% 75% max
option subcase
A sub1 2.0 3.0 1.414214 2.0 2.50 3.0 3.50 4.0 2.0 1.50 2.121320 0.0 0.75 1.5 2.25 3.0
sub2 2.0 2.0 1.414214 1.0 1.50 2.0 2.50 3.0 2.0 5.00 4.242641 2.0 3.50 5.0 6.50 8.0
total 4.0 2.5 1.290994 1.0 1.75 2.5 3.25 4.0 4.0 3.25 3.403430 0.0 1.50 2.5 4.25 8.0
B sub1 2.0 12.0 1.414214 11.0 11.50 12.0 12.50 13.0 2.0 0.00 0.000000 0.0 0.00 0.0 0.00 0.0
sub2 2.0 4.0 1.414214 3.0 3.50 4.0 4.50 5.0 2.0 3.00 1.414214 2.0 2.50 3.0 3.50 4.0
total 4.0 8.0 4.760952 3.0 4.50 8.0 11.50 13.0 4.0 1.50 1.914854 0.0 0.00 1.0 2.50 4.0
Related
I have the following pandas DataFrame called df:
timestamp param_1 param_2
0.000 -0.027655 0.0
0.25 -0.034012 0.0
0.50 -0.040369 0.0
0.75 -0.046725 0.0
1.00 -0.050023 0.0
1.25 -0.011015 0.0
1.50 -0.041366 0.0
1.75 -0.056723 0.0
2.00 -0.013081 0.0
Now I need to add two new columns created from the following lists:
timestamp_new = [0.5, 1.0, 1.5, 2.0]
param_3 = [10.0, 25.0, 15.0, 22.0]
The problem is that timestamp_new has a different granularity. Thus, I need to interpolate (linearly) both timestamp_new and param_3 in order to fit the granularity of timestamp in df.
Expected result (please notice that I interpolated param_3 values randomly just to show the format of an expected result):
timestamp param_1 param_2 param_3
0.000 -0.027655 0.0 8.0
0.25 -0.034012 0.0 9.0
0.50 -0.040369 0.0 10.0
0.75 -0.046725 0.0 20.0
1.00 -0.050023 0.0 25.0
1.25 -0.011015 0.0 18.0
1.50 -0.041366 0.0 15.0
1.75 -0.056723 0.0 17.0
2.00 -0.013081 0.0 22.0
Is there any way to do it?
Let's try reindex().interpolate:
ref_df = pd.Series(param_3, index=timestamp_new)
new_vals = (ref_df.reindex(df['timestamp'])
.interpolate('index')
.bfill() # fill the first few nans
.ffill() # fill the last few nans
)
df['param_3'] = df['timestamp'].map(new_vals)
Output:
timestamp param_1 param_2 param_3
0 0.00 -0.027655 0.0 10.0
1 0.25 -0.034012 0.0 10.0
2 0.50 -0.040369 0.0 10.0
3 0.75 -0.046725 0.0 17.5
4 1.00 -0.050023 0.0 25.0
5 1.25 -0.011015 0.0 20.0
6 1.50 -0.041366 0.0 15.0
7 1.75 -0.056723 0.0 18.5
8 2.00 -0.013081 0.0 22.0
I have a dataframe, used describe() on the dataframe, then inverted the describe() table.
Now, I want to add a column to this new table of skewness and kurtosis values.
I want to add a "Skewness" column and "Kurtosis" column to the right of "max" column. The Skewness column will have all of the skewness values of each row. The Kurtosis column will have the Kurtosis values of each row.
What you see so far is the transposed describe() table I called "summary_transpose"
count mean std min 25% 50% 75% max
Unnamed: 0 1000.0 499.5 288.8 0.0 249.8 499.5 749.2 999.0
FINAL_MARGIN 1000.0 -1.2 15.3 -39.0 -8.0 -2.0 8.0 28.0
SHOT_NUMBER 1000.0 6.4 4.7 1.0 3.0 5.0 9.0 23.0
PERIOD 1000.0 2.5 1.1 1.0 2.0 2.0 4.0 6.0
SHOT_CLOCK 979.0 11.8 5.4 0.3 8.0 11.5 15.0 24.0
DRIBBLES 1000.0 1.6 2.9 0.0 0.0 1.0 2.0 23.0
TOUCH_TIME 1000.0 2.9 2.6 -4.3 0.9 2.1 4.2 20.4
SHOT_DIST 1000.0 12.3 7.8 0.1 5.6 10.4 18.5 41.6
PTS_TYPE 1000.0 2.2 0.4 2.0 2.0 2.0 2.0 3.0
CLOSE_DEF_DIST 1000.0 3.6 2.3 0.0 2.1 3.1 4.7 19.8
FGM 1000.0 0.5 0.5 0.0 0.0 0.0 1.0 1.0
PTS 1000.0 1.0 1.1 0.0 0.0 0.0 2.0 3.0
The code below adds the Skewness and Kurtosis columns next to the max column.
import scipy.stats as stats
summary = round(df.describe(), 1) # rounds each value to 0.1
summary_transpose = summary.T # transposes the original summary table
summary_transpose['Skewness'] = stats.skew(df._get_numeric_data(), nan_policy='omit')
summary_transpose['Kurtosis'] = stats.kurtosis(df._get_numeric_data(), nan_policy='omit')
Below is an example DataFrame.
0 1 2 3 4
0 0.0 13.00 4.50 30.0 0.0,13.0
1 0.0 13.00 4.75 30.0 0.0,13.0
2 0.0 13.00 5.00 30.0 0.0,13.0
3 0.0 13.00 5.25 30.0 0.0,13.0
4 0.0 13.00 5.50 30.0 0.0,13.0
5 0.0 13.00 5.75 0.0 0.0,13.0
6 0.0 13.00 6.00 30.0 0.0,13.0
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
11 2.0 13.25 1.00 30.0 0.0,13.25
12 2.0 13.25 1.25 30.0 0.0,13.25
13 2.0 13.25 1.50 30.0 0.0,13.25
14 2.0 13.25 1.75 30.0 0.0,13.25
15 2.0 13.25 2.00 30.0 0.0,13.25
16 2.0 13.25 2.25 30.0 0.0,13.25
I want to split this into new dataframes when the row in column 0 changes.
0 1 2 3 4
0 0.0 13.00 4.50 30.0 0.0,13.0
1 0.0 13.00 4.75 30.0 0.0,13.0
2 0.0 13.00 5.00 30.0 0.0,13.0
3 0.0 13.00 5.25 30.0 0.0,13.0
4 0.0 13.00 5.50 30.0 0.0,13.0
5 0.0 13.00 5.75 0.0 0.0,13.0
6 0.0 13.00 6.00 30.0 0.0,13.0
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
11 2.0 13.25 1.00 30.0 0.0,13.25
12 2.0 13.25 1.25 30.0 0.0,13.25
13 2.0 13.25 1.50 30.0 0.0,13.25
14 2.0 13.25 1.75 30.0 0.0,13.25
15 2.0 13.25 2.00 30.0 0.0,13.25
16 2.0 13.25 2.25 30.0 0.0,13.25
I've tried adapting the following solutions without any luck so far. Split array at value in numpy
Split a large pandas dataframe
Looks like you want to groupby the first colum. You could create a dictionary from the groupby object, and have the groupby keys be the dictionary keys:
out = dict(tuple(df.groupby(0)))
Or we could also build a list from the groupby object. This becomes more useful when we only want positional indexing rather than based on the grouping key:
out = [sub_df for _, sub_df in df.groupby(0)]
We could then index the dict based on the grouping key, or the list based on the group's position:
print(out[0])
0 1 2 3 4
0 0.0 13.0 4.50 30.0 0.0,13.0
1 0.0 13.0 4.75 30.0 0.0,13.0
2 0.0 13.0 5.00 30.0 0.0,13.0
3 0.0 13.0 5.25 30.0 0.0,13.0
4 0.0 13.0 5.50 30.0 0.0,13.0
5 0.0 13.0 5.75 0.0 0.0,13.0
6 0.0 13.0 6.00 30.0 0.0,13.0
Based on
I want to split this into new dataframes when the row in column 0 changes.
If you only want to group when value in column 0 changes , You can try:
d=dict([*df.groupby(df['0'].ne(df['0'].shift()).cumsum())])
print(d[1])
print(d[2])
0 1 2 3 4
0 0.0 13.0 4.50 30.0 0.0,13.0
1 0.0 13.0 4.75 30.0 0.0,13.0
2 0.0 13.0 5.00 30.0 0.0,13.0
3 0.0 13.0 5.25 30.0 0.0,13.0
4 0.0 13.0 5.50 30.0 0.0,13.0
5 0.0 13.0 5.75 0.0 0.0,13.0
6 0.0 13.0 6.00 30.0 0.0,13.0
0 1 2 3 4
7 1.0 13.25 0.00 30.0 0.0,13.25
8 1.0 13.25 0.25 0.0 0.0,13.25
9 1.0 13.25 0.50 30.0 0.0,13.25
10 1.0 13.25 0.75 30.0 0.0,13.25
I will use GroupBy.__iter__:
d = dict(df.groupby(df['0'].diff().ne(0).cumsum()).__iter__())
#d = dict(df.groupby(df[0].diff().ne(0).cumsum()).__iter__())
Note that if there are repeated non-consecutive values different groups will be created, if you only use groupby(0) they will be grouped in the same group
I'm using pandas in Python, and I have performed some crosstab calculations and concatenations, and at the end up with a data frame that looks like this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
The problem is that I want the last 4 rows, that start with Superior to be places before Total row. So, simply I want to switch the positions of last 4 rows with the 4 rows that start with Regular. How can I achieve this in pandas? So that I get this:
ID 5 6 7 8 9 10 11 12 13
Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
More generalized solution Categorical and argsort, I know this df was ordered , so ffill is safe here
s=df.ID
s=s.where(s.isin(['Total','Regular','Superior'])).ffill()
s=pd.Categorical(s,['Total','Superior','Regular'],ordered=True)
df=df.iloc[np.argsort(s)]
df
Out[188]:
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
5 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
6 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
7 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
8 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
1 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
2 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
3 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
4 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
Here's one way:
import numpy as np
df.iloc[1:,:] = np.roll(df.iloc[1:,:].values, 4, axis=0)
ID 5 6 7 8 9 10 11 12 13
0 Total 87.0 3.0 9.0 6.0 92.0 7.0 3.0 3.0 20.0
1 Superior 15.0 1.0 1.0 1.0 11.0 0.0 0.0 0.0 2.0
2 CR 3.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0
3 HDG 5.0 1.0 1.0 1.0 4.0 0.0 0.0 0.0 0.0
4 PPG 7.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 2.0
5 Regular 72.0 2.0 8.0 5.0 81.0 7.0 3.0 3.0 18.0
6 CR 22.0 0.0 0.0 0.0 17.0 0.0 0.0 0.0 3.0
7 HDG 20.0 0.0 0.0 0.0 24.0 4.0 0.0 0.0 1.0
8 PPG 30.0 2.0 8.0 5.0 40.0 3.0 3.0 3.0 14.0
For a specific answer to this question, just use iloc
df.iloc[[0,5,6,7,8,1,2,3,4],:]
For a more generalized solution,
m = (df.ID.eq('Superior') | df.ID.eq('Regular')).cumsum()
pd.concat([df[m==0], df[m==2], df[m==1]])
or
order = (2,1)
pd.concat([df[m==0], *[df[m==c] for c in order]])
where order defines the mapping from previous ordering to new ordering.
I am trying to compute mean and var along axis=1 of dataframe using only first k columns (compute as .iloc[:,:-5]),naively, I would run as:
df.groupby('id').agg([lambda x: x.iloc[:,:-5].mean(axis=1), lambda x: x.iloc[:,:-5].var(axis=1)])
but it throws the 'too many indexers' error.
EDIT
Some data:
0 1 2 3 4 5 6 7 8 9 Q1 Q2 Q3 Q4 id
0 3.0 3.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 3.0 12.0 0.83 80.0 1.000 11.0
1 3.0 3.0 4.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 14.0 1.60 80.0 1.000 11.0
2 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 13.0 1.40 75.0 1.000 11.0
3 3.0 3.0 4.0 4.0 4.0 3.0 3.0 3.0 3.0 3.0 12.0 0.50 80.0 0.848 11.0
4 3.0 4.0 4.0 4.0 7.0 7.0 5.0 4.0 4.0 2.0 13.0 1.74 70.0 0.883 11.0
13 3.0 3.0 2.0 2.0 2.0 2.0 3.0 2.0 3.0 3.0 12.0 3.67 45.0 1.000 14.0
14 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 13.0 3.67 48.0 0.848 14.0
15 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 12.0 1.67 70.0 0.848 14.0
16 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 NaN 2.0 12.0 3.33 60.0 0.848 14.0
17 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 12.0 3.33 60.0 0.848 14.0
25 4.0 4.0 6.0 5.0 NaN 6.0 4.0 3.0 NaN 4.0 11.0 3.36 85.0 0.796 17.0
26 4.0 5.0 4.0 7.0 6.0 5.0 4.0 6.0 7.0 5.0 8.0 4.76 50.0 0.725 17.0
27 4.0 4.0 3.0 4.0 5.0 4.0 5.0 3.0 3.0 5.0 9.0 3.33 50.0 0.725 17.0
28 3.0 4.0 4.0 3.0 4.0 4.0 NaN 3.0 NaN 3.0 10.0 3.12 75.0 0.725 17.0
29 3.0 3.0 2.0 NaN 2.0 1.0 NaN NaN 1.0 2.0 15.0 3.05 79.0 0.725 17.0
39 3.0 3.0 5.0 4.0 4.0 4.0 4.0 4.0 NaN 5.0 12.0 1.19 80.0 0.883 18.0
40 5.0 4.0 5.0 5.0 5.0 5.0 4.0 5.0 7.0 4.0 9.0 1.83 75.0 0.883 18.0
41 5.0 6.0 4.0 4.0 4.0 4.0 4.0 4.0 7.0 7.0 12.0 1.71 35.0 1.000 18.0
42 5.0 5.0 5.0 5.0 4.0 NaN 4.0 4.0 3.0 2.0 13.0 0.86 85.0 1.000 18.0
43 3.0 3.0 3.0 3.0 3.0 3.0 3.0 5.0 3.0 3.0 11.0 1.36 75.0 1.000 18.0
48 1
It seems you need first:
df['m'] = df.iloc[:,:-5].mean(axis=1)
df['v'] = df.iloc[:,:-5].var(axis=1)
and then aggregate if necesary.