Slicing a pandas dataframe based on less or equal citerion - python

I bet this question has been answered a number of times but I am struggling to find a definitive solution.
I need to delete dataframe rows based on a greater or equal condition. Because of float64 type I am not able to satisfy the "equal" part of the condition. Splitting the condition into two seems cumbersome and not very pandorable. Can someone help me with finding solution?
Thanks.
Dataframe:
Sg Sw temp_S Krg Krw Pc
0 0.00 1.00 -5.263158e-02 0.000000 0.650000 0.000000
1 0.05 0.95 -4.382459e-17 0.000000 0.650000 0.000000
2 0.10 0.90 5.263158e-02 0.000000 0.593548 0.095790
3 0.15 0.85 1.052632e-01 0.000000 0.537097 0.107775
4 0.20 0.80 1.578947e-01 0.000000 0.480645 0.122121
5 0.25 0.75 2.105263e-01 0.000000 0.424194 0.139496
6 0.30 0.70 2.631579e-01 0.000000 0.367742 0.160837
7 0.35 0.65 3.157895e-01 0.000000 0.311290 0.187397
8 0.36 0.64 3.263158e-01 0.000000 0.300000 0.193483
9 0.40 0.60 3.684211e-01 0.014167 0.230400 0.221009
Slicing:
print(object.sc_df[object.sc_df['Sg'].values > 0.05])
Output:
Sg Sw temp_S Krg Krw Pc
2 0.10 0.90 0.052632 0.000000 0.593548 0.095790
3 0.15 0.85 0.105263 0.000000 0.537097 0.107775
4 0.20 0.80 0.157895 0.000000 0.480645 0.122121
5 0.25 0.75 0.210526 0.000000 0.424194 0.139496
6 0.30 0.70 0.263158 0.000000 0.367742 0.160837
7 0.35 0.65 0.315789 0.000000 0.311290 0.187397
8 0.36 0.64 0.326316 0.000000 0.300000 0.193483
9 0.40 0.60 0.368421 0.014167 0.230400 0.221009
As you can see, line 1 is missing. What would be the best way satisfying "equal" condition?

Related

How to get maximums of multiple groups based on grouping column?

I have an initial dataset data grouped by id:
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
3 0.12 1.00
3 0.34 0.71
3 0.64 0.43
3 0.89 0.14
4 0.32 1.00
4 0.33 0.66
4 0.45 0.33
4 0.76 0.00
I am trying to predict the maximum y based on variable x while considering the groups. First, I train_test_split based on the groups:
data_train
id x y
1 0.21 1.00
1 0.34 0.66
1 0.35 0.33
1 0.94 0.00
2 0.11 1.00
2 0.90 0.66
2 0.31 0.33
2 0.33 0.00
and
data_test
id x y
3 0.12 1.00
3 0.34 0.66
3 0.64 0.33
3 0.89 0.00
4 0.33 1.00
4 0.32 0.66
4 0.45 0.33
4 0.76 0.00
After training the model and applying the model on data_test, I get:
y_hat
0.65
0.33
0.13
0.00
0.33
0.34
0.21
0.08
I am trying to transform y_hat so that the maximum in each of the initial groups is 1.00; otherwise, it is 0.00:
y_hat_transform
1.00
0.00
0.00
0.00
0.00
1.00
0.00
0.00
How would I do that? Note that the groups can be of varying sizes.
Edit: To simplify the problem, I have id_test and y_hat, where
id_test
3
3
3
3
4
4
4
4
and I am trying to get y_hat_transform.
id y
0 3 0.65
1 3 0.65
2 3 0.33
3 3 0.13
4 3 0.00
5 4 0.33
6 4 0.34
7 4 0.21
8 4 0.08
# Find max rows per group and assign them values
# I see 1.0 and 0.0 as binary so directly did it by casting to float
# transform gives new column of same size and repeated maxs per group
id_y['y_transform'] = (id_y['y'] == id_y.groupby(['id'])['y'].transform(max)).astype(float)

divide all columns by each column in pandas

I have a data frame like 1 and I am trying to create a new data frame 2 which consists of ratios of each column of above data frame.
I tried below mentioned logic.
df_new = pd.concat([df[df.columns.difference([col])].div(df[col], axis=0)\
.add_suffix('/R') for col in df.columns], axis=1)
Output is:
B/R C/R D/R A/R C/R D/R A/R B/R D/R A/R B/R C/R
0 0.46 1.16 0.78 2.16 2.50 1.69 0.86 0.40 0.68 1.28 0.59 1.48
1 1.05 1.25 1.64 0.95 1.19 1.55 0.80 0.84 1.30 0.61 0.64 0.77
2 1.56 2.78 2.78 0.64 1.79 1.79 0.36 0.56 1.00 0.36 0.56 1.00
3 0.54 2.23 0.35 1.86 4.14 0.64 0.45 0.24 0.16 2.89 1.56 6.44
However, here I am facing two issues. One is I am getting both A/B and B/A which are not needed and also increases number of columns. Is there a way to get the output only A/B and eliminate/restrict B/A.
Second issue is with Naming of columns using add suffix method which does not convey which is divided by which. Is there a way to create column names like A/B for Column A divided by column B.
Use combinations with divide columns in list comprehension:
df = pd.DataFrame({
'A':[5,3,6,9,2,4],
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,8],
})
from itertools import combinations
L = {f'{a}/{b}': df[a].div(df[b]) for a, b in combinations(df.columns, 2)}
df = pd.concat(L, axis=1)
print (df)
A/B A/C A/D B/C B/D C/D
0 1.25 0.714286 5.000000 0.571429 4.000000 7.000000
1 0.60 0.375000 1.000000 0.625000 1.666667 2.666667
2 1.50 0.666667 1.200000 0.444444 0.800000 1.800000
3 1.80 2.250000 1.285714 1.250000 0.714286 0.571429
4 0.40 1.000000 2.000000 2.500000 5.000000 2.000000
5 1.00 1.333333 0.500000 1.333333 0.500000 0.375000

Calculating the share of each code, by ID

I have this data-frame:
ID code X X_total
A 456 40 40
A 789 0 40
B 123 75 100
B 987 25 100
C 789 13 91
C 987 0 91
C 123 35 91
C 456 43 91
I want the calculate the share of each code (from [123, 465, 789, 987]), by dividing X by X_total, for each ID.
Expected result:
ID share_123 share_456 share_789 share_987
A 0.00 1.00 0.00 0.00
B 0.75 0.00 0.00 0.25
C 0.38 0.47 0.14 0.00
Let us do crosstab
s = pd.crosstab(df.ID, df.code, df.X ,aggfunc='sum', normalize='index').add_prefix("share_")
Out[70]:
code 123 456 789 987
ID
A 0.000000 1.000000 0.000000 0.00
B 0.750000 0.000000 0.000000 0.25
C 0.384615 0.472527 0.142857 0.00
Or with df.pivot with your logic:
df.assign(k=df['X'].div(df['X_total'])).pivot("ID","code","k").fillna(0)
code 123 456 789 987
ID
A 0.000000 1.000000 0.000000 0.00
B 0.750000 0.000000 0.000000 0.25
C 0.384615 0.472527 0.142857 0.00
Adding formatting:
(df.assign(k=df['X'].div(df['X_total'])).pivot("ID","code","k").fillna(0)
.add_prefix("share_").round(2).rename_axis(None,axis=1).reset_index())
ID share_123 share_456 share_789 share_987
0 A 0.00 1.00 0.00 0.00
1 B 0.75 0.00 0.00 0.25
2 C 0.38 0.47 0.14 0.00
Another approach with groupby + unstack
df['X'].div(df['X_total']).groupby([df['ID'], df['code']]).sum().unstack(fill_value=0)
code 123 456 789 987
ID
A 0.000000 1.000000 0.000000 0.00
B 0.750000 0.000000 0.000000 0.25
C 0.384615 0.472527 0.142857 0.00

Correlation between values

I want to make correlation in this DataFrame but not the way it is shown, but to rank values ​​from the lowest to largest.
import pandas as pd
import numpy as np
rs = np.random.RandomState(1)
df = pd.DataFrame(rs.rand(9, 8))
corr = df.corr()
corr.style.background_gradient().set_precision(2)
0 1 2 3 4 5 6 7
0 1 0.42 0.031 -0.16 -0.35 0.23 -0.22 0.4
1 0.42 1 -0.24 -0.55 0.011 0.3 -0.26 0.23
2 0.031 -0.24 1 0.29 0.44 0.29 0.23 0.25
3 -0.16 -0.55 0.29 1 -0.33 -0.42 0.58 -0.37
4 -0.35 0.011 0.44 -0.33 1 0.46 0.074 0.19
5 0.23 0.3 0.29 -0.42 0.46 1 -0.41 0.71
6 -0.22 -0.26 0.23 0.58 0.074 -0.41 1 -0.66
7 0.4 0.23 0.25 -0.37 0.19 0.71 -0.66 1
You can use sort_values:
import pandas as pd
import numpy as np
rs = np.random.RandomState(1)
df = pd.DataFrame(rs.rand(9, 8))
corr = df.corr()
print(corr)
print(corr.sort_values(by=0, axis=1, inplace=False)) # by=0 means first row
Results:
0 1 2 3 4 5 6 7
0 1.000000 0.418246 0.030692 -0.160001 -0.352993 0.230069 -0.216804 0.395662
1 0.418246 1.000000 -0.244115 -0.549013 0.010745 0.299203 -0.262351 0.232681
2 0.030692 -0.244115 1.000000 0.288011 0.435907 0.285408 0.225205 0.253840
3 -0.160001 -0.549013 0.288011 1.000000 -0.326950 -0.415688 0.578549 -0.366539
4 -0.352993 0.010745 0.435907 -0.326950 1.000000 0.455738 0.074293 0.193905
5 0.230069 0.299203 0.285408 -0.415688 0.455738 1.000000 -0.413383 0.708467
6 -0.216804 -0.262351 0.225205 0.578549 0.074293 -0.413383 1.000000 -0.664207
7 0.395662 0.232681 0.253840 -0.366539 0.193905 0.708467 -0.664207 1.000000
0 1 7 5 2 3 6 4
0 1.000000 0.418246 0.395662 0.230069 0.030692 -0.160001 -0.216804 -0.352993
1 0.418246 1.000000 0.232681 0.299203 -0.244115 -0.549013 -0.262351 0.010745
2 0.030692 -0.244115 0.253840 0.285408 1.000000 0.288011 0.225205 0.435907
3 -0.160001 -0.549013 -0.366539 -0.415688 0.288011 1.000000 0.578549 -0.326950
4 -0.352993 0.010745 0.193905 0.455738 0.435907 -0.326950 0.074293 1.000000
5 0.230069 0.299203 0.708467 1.000000 0.285408 -0.415688 -0.413383 0.455738
6 -0.216804 -0.262351 -0.664207 -0.413383 0.225205 0.578549 1.000000 0.074293
7 0.395662 0.232681 1.000000 0.708467 0.253840 -0.366539 -0.664207 0.193905

Group by - select most recent 4 events

I have the following df in pandas:
df:
DATE STOCK DATA1 DATA2 DATA3
01/01/12 ABC 0.40 0.88 0.22
04/01/12 ABC 0.50 0.49 0.13
07/01/12 ABC 0.85 0.36 0.83
10/01/12 ABC 0.28 0.12 0.39
01/01/13 ABC 0.86 0.87 0.58
04/01/13 ABC 0.95 0.39 0.87
07/01/13 ABC 0.60 0.25 0.56
10/01/13 ABC 0.15 0.28 0.69
01/01/11 XYZ 0.94 0.40 0.50
04/01/11 XYZ 0.65 0.19 0.81
07/01/11 XYZ 0.89 0.59 0.69
10/01/11 XYZ 0.12 0.09 0.18
01/01/12 XYZ 0.25 0.94 0.55
04/01/12 XYZ 0.07 0.22 0.67
07/01/12 XYZ 0.46 0.08 0.54
10/01/12 XYZ 0.04 0.03 0.94
...
I want to group by the stocks, sort by date and then for specified columns (in this case DATA1 and DATA3), I want to get the last four items summed (TTM data).
The output would look like this:
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
01/01/12 ABC 0.40 0.88 0.22 NaN NaN
04/01/12 ABC 0.50 0.49 0.13 NaN NaN
07/01/12 ABC 0.85 0.36 0.83 NaN NaN
10/01/12 ABC 0.28 0.12 0.39 2.03 1.56
01/01/13 ABC 0.86 0.87 0.58 2.49 1.92
04/01/13 ABC 0.95 0.39 0.87 2.94 2.66
07/01/13 ABC 0.60 0.25 0.56 2.69 2.39
10/01/13 ABC 0.15 0.28 0.69 2.55 2.70
01/01/11 XYZ 0.94 0.40 0.50 NaN NaN
04/01/11 XYZ 0.65 0.19 0.81 NaN NaN
07/01/11 XYZ 0.89 0.59 0.69 NaN NaN
10/01/11 XYZ 0.12 0.09 0.18 2.59 2.18
01/01/12 XYZ 0.25 0.94 0.55 1.90 2.23
04/01/12 XYZ 0.07 0.22 0.67 1.33 2.09
07/01/12 XYZ 0.46 0.08 0.54 0.89 1.94
10/01/12 XYZ 0.04 0.03 0.94 0.82 2.70
...
My approach so far has been to sort by date, then group, then iterate through each group and if there are 3 older events then the current event I sum. Also, I want to check to see if the dates fall within 1 year. Can anyone offer a better way in Python? Thank you.
Added: As a clarification for the 1 year part, let's say you take the last four dates and it goes 1/1/1993, 4/1/12, 7/1/12, 10/1/12 -- a data error. I wouldn't want to sum those four. I would want that one to say NaN.
For this I think you can use transform and rolling_sum. Starting from your dataframe, I might do something like:
>>> df["DATE"] = pd.to_datetime(df["DATE"]) # switch to datetime to ease sorting
>>> df = df.sort(["STOCK", "DATE"])
>>> rsum_columns = "DATA1", "DATA3"
>>> grouped = df.groupby("STOCK")[rsum_columns]
>>> new_columns = grouped.transform(lambda x: pd.rolling_sum(x, 4))
>>> df[new_columns.columns + "_TTM"] = new_columns
>>> df
DATE STOCK DATA1 DATA2 DATA3 DATA1_TTM DATA3_TTM
0 2012-01-01 00:00:00 ABC 0.40 0.88 0.22 NaN NaN
1 2012-04-01 00:00:00 ABC 0.50 0.49 0.13 NaN NaN
2 2012-07-01 00:00:00 ABC 0.85 0.36 0.83 NaN NaN
3 2012-10-01 00:00:00 ABC 0.28 0.12 0.39 2.03 1.57
4 2013-01-01 00:00:00 ABC 0.86 0.87 0.58 2.49 1.93
5 2013-04-01 00:00:00 ABC 0.95 0.39 0.87 2.94 2.67
6 2013-07-01 00:00:00 ABC 0.60 0.25 0.56 2.69 2.40
7 2013-10-01 00:00:00 ABC 0.15 0.28 0.69 2.56 2.70
8 2011-01-01 00:00:00 XYZ 0.94 0.40 0.50 NaN NaN
9 2011-04-01 00:00:00 XYZ 0.65 0.19 0.81 NaN NaN
10 2011-07-01 00:00:00 XYZ 0.89 0.59 0.69 NaN NaN
11 2011-10-01 00:00:00 XYZ 0.12 0.09 0.18 2.60 2.18
12 2012-01-01 00:00:00 XYZ 0.25 0.94 0.55 1.91 2.23
13 2012-04-01 00:00:00 XYZ 0.07 0.22 0.67 1.33 2.09
14 2012-07-01 00:00:00 XYZ 0.46 0.08 0.54 0.90 1.94
15 2012-10-01 00:00:00 XYZ 0.04 0.03 0.94 0.82 2.70
[16 rows x 7 columns]
I don't know what you're asking by "Also, I want to check to see if the dates fall within 1 year", so I'll leave that alone.

Categories

Resources