How to get maximums of multiple groups based on grouping column? - python

I have an initial dataset data grouped by id:
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
3 0.12 1.00
3 0.34 0.71
3 0.64 0.43
3 0.89 0.14
4 0.32 1.00
4 0.33 0.66
4 0.45 0.33
4 0.76 0.00
I am trying to predict the maximum y based on variable x while considering the groups. First, I train_test_split based on the groups:
data_train
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
and
data_test
id x y
3 0.12 1.00
3 0.34 0.66
3 0.64 0.33
3 0.89 0.00
4 0.33 1.00
4 0.32 0.66
4 0.45 0.33
4 0.76 0.00
After training the model and applying the model on data_test, I get:
y_hat
0.65
0.33
0.13
0.00
0.33
0.34
0.21
0.08
I am trying to transform y_hat so that the maximum in each of the initial groups is 1.00; otherwise, it is 0.00:
y_hat_transform
1.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
How would I do that? Note that the groups can be of varying sizes.
Edit: To simplify the problem, I have id_test and y_hat, where
id_test
3
3
3
3
4
4
4
4
and I am trying to get y_hat_transform.

id y
0 3 0.65
1 3 0.65
2 3 0.33
3 3 0.13
4 3 0.00
5 4 0.33
6 4 0.34
7 4 0.21
8 4 0.08
# Find max rows per group and assign them values
# I see 1.0 and 0.0 as binary so directly did it by casting to float
# transform gives new column of same size and repeated maxs per group
id_y['y_transform'] = (id_y['y'] == id_y.groupby(['id'])['y'].transform(max)).astype(float)

Related

Fill all the columns of the dataframe with condition

I am trying to fill the dataframe with certain condition but I can not find the appropriate solution. I have a bit larger dataframe bet let's say that my pandas dataframe looks like this:
0
1
2
3
4
5
0.32
0.40
0.60
1.20
3.40
0.00
0.17
0.12
0.00
1.30
2.42
0.00
0.31
0.90
0.80
1.24
4.35
0.00
0.39
0.00
0.90
1.50
1.40
0.00
And I want to update the values, so that if 0.00 appears once in a row (row 2 and 4) that until the end all the values are 0.00. Something like this:
0
1
2
3
4
5
0.32
0.40
0.60
1.20
3.40
0.00
0.17
0.12
0.00
0.00
0.00
0.00
0.31
0.90
0.80
1.24
4.35
0.00
0.39
0.00
0.00
0.00
0.00
0.00
I have tried with
for t in range (1,T-1):
data= np.where(df[t-1]==0,0,df[t])
and several others ways but I couldn't get what I want.
Thanks!
Try as follows:
Select from df with df.eq(0). This will get us all zeros and the rest as NaN values.
Now, add df.ffill along axis=1. This will continue all the zeros through to the end of each row.
Finally, change the dtype to bool by chaining df.astype, thus turning all zeros into False, and all NaN values into True.
We feed the result to df.where. For all True values, we'll pick from the df itself, for all False values, we'll insert 0.
df = df.where(df[df.eq(0)].ffill(axis=1).astype(bool), 0)
print(df)
0 1 2 3 4 5
0 0.32 0.40 0.6 1.20 3.40 0.0
1 0.17 0.12 0.0 0.00 0.00 0.0
2 0.31 0.90 0.8 1.24 4.35 0.0
3 0.39 0.00 0.0 0.00 0.00 0.0

Sklearn.svm.LinearSVC is inconsistent, often times it only predicts for 1 of 4 results alone and others times more, rarely does it predict for all 4

I have been working with this custom dataset that is supposed to predict out of 4 possibilities (A, B, C, D).
The training and the test dataset contains an even amount of each of the 4 possible results.
This dataset is then passed to a sklearn.svm.LinearSVC() model to predict.
For some reason most of the time I run the model on the test dataset I often get reports which are either predicting the same result exclusively or at best not predicting one of the results at all while the rest of the predictions are spread evenly.
Rarely (one in 3ish) do I get a report where (as it should) the predictions are spread evenly amongst all the 4 results.
X = trainer.drop(['TResult'], axis = 1)
y = trainer['TResult']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
newModel = LinearSVC()
newModel.fit(X_train, y_train)
pred = newModel.predict(X_test)
rep = classification_report(y_test, pred)
print(rep)
Report/Run 1:
precision recall f1-score support
A 0.00 0.00 0.00 56 <Nothing here
B 0.00 0.00 0.00 79 <Nothing here
C 0.24 1.00 0.38 66
D 0.00 0.00 0.00 78 <Nothing here
avg / total 0.06 0.24 0.09 279
Report/Run 2:
precision recall f1-score support
A 0.20 0.18 0.19 65
B 0.32 0.17 0.22 78 ]expected
C 0.22 0.59 0.32 51
D 0.50 0.26 0.34 85
avg / total 0.33 0.28 0.27 279
Report/Run 3:
precision recall f1-score support
A 0.32 0.75 0.45 67
B 0.00 0.00 0.00 79 <Nothing here
C 0.00 0.00 0.00 68 <Nothing here
D 0.33 0.62 0.43 65
avg / total 0.15 0.32 0.21 279
Report/Run 4:
precision recall f1-score support
A 0.23 0.08 0.12 59
B 0.24 0.75 0.36 69
C 0.00 0.00 0.00 73 <Nothing here
D 0.39 0.21 0.27 78
avg / total 0.22 0.26 0.19 279
Report/Run 5:
precision recall f1-score support
A 1.00 0.02 0.03 66
B 0.25 0.73 0.38 63 ]expected
C 0.60 0.04 0.08 71
D 0.37 0.43 0.40 79
avg / total 0.55 0.30 0.23 279

How can I divide an entire dataframe by certain rows?

I have a dataframe that looks like the below, except much longer. Ultimately, Var, Type, and Level, when combined, represent unique entries. I want to divide the unexposed entries against the other entries in the dataframe, according to the appropriate grouping (e.g., 'Any-All Exposed' would be divided by 'Any All Unexposed', whereas 'Any Existing Exposed' would be divided by 'Any Existing Unexposed.'
Var Type Level Metric1 Metric2 Metric3
Any All Unexposed 34842 30783 -12
Any All Exposed 54167 54247 0.15
Any All LowExposure 20236 20311 0.37
Any All MediumExposure 15254 15388 0.87
Any All HighExposure 18677 18548 0.7
Any New Unexposed 0 23785 0
Any New Exposed 0 43030 0
Any New LowExposure 0 16356 0
Any New MediumExposure 0 12213 0
Any New HighExposure 0 14461 0
Any Existing Unexposed 34843 6998 -80
Any Existing Exposed 54167 11217 -80
Any Existing LowExposure 20236 3955 -81
Any Existing MediumExposure 15254 3175 -79
Any Existing HighExposure 18677 4087 -78
The most straightforward way to do this, I think, would be creating a mulitindex, but I've tried a variety of methods to no avail (normally, receiving an error that it can't divide on a non-unique index).
An expected result would be something like, where in every row is divided by the Unexposed row according to the var and type values.
Var Type Level Metric1 Metric2 Metric3 MP1 MP2 MP3
Any All Unexposed 34842 30783 -12 1.00 1.00 1.00
Any All Exposed 54167 54247 0.15 1.55 1.76 -0.01
Any All LowExposure 20236 20311 0.37 0.58 0.66 -0.03
Any All MediumExposure 15254 15388 0.87 0.44 0.50 -0.07
Any All HighExposure 18677 18548 0.7 0.54 0.60 -0.06
Any New Unexposed 0 23785 0 0.00 1.00 0.00
Any New Exposed 0 43030 0 0.00 1.81 0.00
Any New LowExposure 0 16356 0 0.00 0.69 0.00
Any New MediumExposure 0 12213 0 0.00 0.51 0.00
Any New HighExposure 0 14461 0 0.00 0.61 0.00
Any Existing Unexposed 34843 6998 -80 1.00 1.00 1.00
Any Existing Exposed 54167 11217 -80 1.55 1.60 1.00
Any Existing LowExposure 20236 3955 -81 0.58 0.57 1.01
Any Existing MediumExposure 15254 3175 -79 0.44 0.45 0.99
Any Existing HighExposure 18677 4087 -78 0.54 0.58 0.98
To divide every row in each Var/Type grouping by a specific Level, use groupby and divide.
For example, to divide by Unexposed, as in your example output:
def divide_by(g, denom_lvl):
cols = ["Metric1", "Metric2", "Metric3"]
num = g[cols]
denom = g.loc[g.Level==denom_lvl, cols].iloc[0]
return num.divide(denom).fillna(0).round(2)
df.groupby(['Var','Type']).apply(divide_by, denom_lvl='Unexposed')
Output:
Metric1 Metric2 Metric3
0 1.00 1.00 1.00
1 1.55 1.76 -0.01
2 0.58 0.66 -0.03
3 0.44 0.50 -0.07
4 0.54 0.60 -0.06
5 0.00 1.00 0.00
6 0.00 1.81 0.00
7 0.00 0.69 0.00
8 0.00 0.51 0.00
9 0.00 0.61 0.00
10 1.00 1.00 1.00
11 1.55 1.60 1.00
12 0.58 0.57 1.01
13 0.44 0.45 0.99
14 0.54 0.58 0.98
Im not sure if i got it correctly. Would sth like this do the trick?
You can parse all unique combinations and perform the division.
var_col= df['Var'].unique()
type_col= df['Type'].unique()
for i in var_col:
for j in type_col:
result= df[df['Var']==i][df['Type']==j][df['Level']=='Exposed'] / df[df['Var']==i][df['Type']==j][df['Level']=='Unexposed']
...

Python(Pandas) - Create a column by matching column's values into dataframe

I have the below assumed dataframe
a b c d e F
0.02 0.62 0.31 0.67 0.27 a
0.30 0.07 0.23 0.42 0.00 a
0.82 0.59 0.34 0.73 0.29 a
0.90 0.80 0.13 0.14 0.07 d
0.50 0.62 0.94 0.34 0.53 d
0.59 0.84 0.95 0.42 0.54 d
0.13 0.33 0.87 0.20 0.25 d
0.47 0.37 0.84 0.69 0.28 e
Column F represents the columns of the dataframe.
For each row of column F I want to find relevant row and column from the rest of the dataframe and return the values into one column
The outcome will look like this:
a b c d e f To_Be_Filled
0.02 0.62 0.31 0.67 0.27 a 0.02
0.30 0.07 0.23 0.42 0.00 a 0.30
0.82 0.59 0.34 0.73 0.29 a 0.82
0.90 0.80 0.13 0.14 0.07 d 0.14
0.50 0.62 0.94 0.34 0.53 d 0.34
0.59 0.84 0.95 0.42 0.54 d 0.42
0.13 0.33 0.87 0.20 0.25 d 0.20
0.47 0.37 0.84 0.69 0.28 e 0.28
I am able to identify each case with the below, but not sure how to do it across the whole dataframe.
test.loc[test.iloc[:,5]==a,test.columns==a]
Many thanks in advance.
You can use lookup:
df['To_Be_Filled'] = df.lookup(np.arange(len(df)), df['F'])
df
Out:
a b c d e F To_Be_Filled
0 0.02 0.62 0.31 0.67 0.27 a 0.02
1 0.30 0.07 0.23 0.42 0.00 a 0.30
2 0.82 0.59 0.34 0.73 0.29 a 0.82
3 0.90 0.80 0.13 0.14 0.07 d 0.14
4 0.50 0.62 0.94 0.34 0.53 d 0.34
5 0.59 0.84 0.95 0.42 0.54 d 0.42
6 0.13 0.33 0.87 0.20 0.25 d 0.20
7 0.47 0.37 0.84 0.69 0.28 e 0.28
np.arange(len(df)) can be replaced with df.index.

python pandas change dataframe to pivoted columns

I have a dataframe that looks as following:
Type Month Value
A 1 0.29
A 2 0.90
A 3 0.44
A 4 0.43
B 1 0.29
B 2 0.50
B 3 0.14
B 4 0.07
I want to change the dataframe to following format:
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
Is this possible ?
Use set_index + unstack
df.set_index(['Month', 'Type']).Value.unstack()
Type A B
Month
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
To match your exact output
df.set_index(['Month', 'Type']).Value.unstack().rename_axis(None)
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
Pivot solution:
In [70]: df.pivot(index='Month', columns='Type', values='Value')
Out[70]:
Type A B
Month
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
In [71]: df.pivot(index='Month', columns='Type', values='Value').rename_axis(None)
Out[71]:
Type A B
1 0.29 0.29
2 0.90 0.50
3 0.44 0.14
4 0.43 0.07
You're having a case of long format table which you want to transform to a wide format.
This is natively handled in pandas:
df.pivot(index='Month', columns='Type', values='Value')

Categories

Resources