pyGAM `y data is not in domain of logit link function`

pyGAM `y data is not in domain of logit link function` - python

I'm trying to find to what degree the chemical properties of a wine dataset influence the quality property of the dataset.
The error:
ValueError: y data is not in domain of logit link function. Expected
domain: [0.0, 1.0], but found [3.0, 9.0]
The code:
import pandas as pd
from pygam import LogisticGAM
white_data = pd.read_csv("winequality-white.csv",sep=';');
X = white_data[[
"fixed acidity","volatile acidity","citric acid","residual sugar","chlorides","free sulfur dioxide",
"total sulfur dioxide","density","pH","sulphates","alcohol"
]]
print(X.describe)
y = pd.Series(white_data["quality"]);
print(white_quality.describe)
white_gam = LogisticGAM().fit(X, y)
The output of said code:
<bound method NDFrame.describe of fixed acidity volatile acidity citric acid residual sugar chlorides \
0 7.0 0.27 0.36 20.7 0.045
1 6.3 0.30 0.34 1.6 0.049
2 8.1 0.28 0.40 6.9 0.050
3 7.2 0.23 0.32 8.5 0.058
4 7.2 0.23 0.32 8.5 0.058
... ... ... ... ... ...
4893 6.2 0.21 0.29 1.6 0.039
4894 6.6 0.32 0.36 8.0 0.047
4895 6.5 0.24 0.19 1.2 0.041
4896 5.5 0.29 0.30 1.1 0.022
4897 6.0 0.21 0.38 0.8 0.020
free sulfur dioxide total sulfur dioxide density pH sulphates \
0 45.0 170.0 1.00100 3.00 0.45
1 14.0 132.0 0.99400 3.30 0.49
2 30.0 97.0 0.99510 3.26 0.44
3 47.0 186.0 0.99560 3.19 0.40
4 47.0 186.0 0.99560 3.19 0.40
... ... ... ... ... ...
4893 24.0 92.0 0.99114 3.27 0.50
4894 57.0 168.0 0.99490 3.15 0.46
4895 30.0 111.0 0.99254 2.99 0.46
4896 20.0 110.0 0.98869 3.34 0.38
4897 22.0 98.0 0.98941 3.26 0.32
alcohol
0 8.8
1 9.5
2 10.1
3 9.9
4 9.9
... ...
4893 11.2
4894 9.6
4895 9.4
4896 12.8
4897 11.8
[4898 rows x 11 columns]>
<bound method NDFrame.describe of 0 6
1 6
2 6
3 6
4 6
..
4893 6
4894 5
4895 6
4896 7
4897 6
Name: quality, Length: 4898, dtype: int64>
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-71-e1c5720823a6> in <module>
16 print(white_quality.describe)
17
---> 18 white_gam = LogisticGAM().fit(X, y)
~/miniconda3/lib/python3.7/site-packages/pygam/pygam.py in fit(self, X, y, weights)
893
894 # validate data
--> 895 y = check_y(y, self.link, self.distribution, verbose=self.verbose)
896 X = check_X(X, verbose=self.verbose)
897 check_X_y(X, y)
~/miniconda3/lib/python3.7/site-packages/pygam/utils.py in check_y(y, link, dist, min_samples, verbose)
227 .format(link, get_link_domain(link, dist),
228 [float('%.2f'%np.min(y)),
--> 229 float('%.2f'%np.max(y))]))
230 return y
231
ValueError: y data is not in domain of logit link function. Expected domain: [0.0, 1.0], but found [3.0, 9.0]
The files: (I'm using Jupyter Notebook but I don't think you'd need to): https://drive.google.com/drive/folders/1RAj2Gh6WfdzpwtgbMaFVuvBVIWwoTUW5?usp=sharing

You probably want to use LinearGAM – LogisticGAM is for classification tasks.

Related

Tensorflow dataframe lambda modify the type of object

So i got a dataframe that has stats of fighter that store data as float
fighers_clean.dtypes
sig_str_abs_pM float64
sig_str_def_pct float64
sig_str_land_pM float64
sig_str_land_pct float64
sub_avg float64
td_avg float64
td_def_pct float64
td_land_pct float64
win% float64
fighers_clean
sig_str_abs_pM sig_str_def_pct sig_str_land_pM sig_str_land_pct sub_avg td_avg td_def_pct td_land_pct win%
name
Hunter Azure 1.57 0.56 4.00 0.50 2.5 2.00 0.75 0.33 1.000000
Jessica Eye 3.36 0.60 3.51 0.36 0.7 0.51 0.56 0.50 0.666667
Rolando Dy 4.47 0.52 3.04 0.37 0.0 0.30 0.68 0.20 0.529412
Gleidson Cutis 8.28 0.59 2.99 0.52 0.0 0.00 0.00 0.00 0.700000
Damien Brown 4.86 0.50 3.66 0.38 0.7 0.68 0.53 0.27 0.586207
... ... ... ... ... ... ... ... ... ...
Xiaonan Yan 4.22 0.64 6.85 0.40 0.0 0.25 0.66 0.50 0.916667
Alexander Yakovlev 2.44 0.58 1.79 0.47 0.2 1.56 0.72 0.33 0.705882
Rani Yahya 1.61 0.52 1.59 0.36 2.1 2.92 0.22 0.32 0.722222
Eddie Yagin 5.77 0.42 3.13 0.30 1.0 0.00 0.62 0.00 0.727273
Jamie Yager 2.55 0.63 3.08 0.39 0.0 0.00 0.66 0.00 0.714286
With this line im trying to add data to another dataframe that have stats about matches
for col in statistics:
matches_clean[col] = matches_clean.apply(
lambda row: fighers_clean.loc[row["fighter_1"], col] - fighers_clean.loc[row["fighter_2"], col], axis=1)
matches_clean.dtypes
fighter_1 object
fighter_2 object
result int64
sig_str_abs_pM object
sig_str_def_pct object
sig_str_land_pM object
sig_str_land_pct object
sub_avg object
td_avg object
td_def_pct object
td_land_pct object
win% object
dtype: object
fighter_1 fighter_2 result sig_str_abs_pM sig_str_def_pct sig_str_land_pM sig_str_land_pct sub_avg td_avg td_def_pct td_land_pct win%
fight_id
d7cbe2f23d75afd1 Julio Arce Hakeem Dawodu 0 0.56 0.03 -1.03 -0.1 0.6 0.58 0.08 0.3 -0.046154
f0418c2c989a5cde Grigorii Popov Davey Grant 0 2.52 -0.1 0.5 -0.15 0.0 -2.64 -0.19 -0.47 0.079167
fc16ccf0994c6e50 Jack Shore Nohelin Hernandez 1 -2.16 0.31 2.26 0.16 1.2 3.96 -0.37 0.07 0.285714
18e1b0df8da7010e Vanessa Melo Tracy Cortez 0 4.17 -0.05 -1.23 -0.26 0.0 -3.0 -0.23 -0.37 -0.286765
57ff0eb2351979c4 Khalid Taha Bruno Silva 1 -0.63 -0.11 1.1 0.1 0.5 -2.31 -0.37 -0.18 0.1875
This cause later an error ValueError: setting an array element with a sequence. here at line X_train_scaled = scaler.fit_transform(X_train)
# get ready for deep learning
X, y = matches_clean.iloc[:, 1:], matches_clean.iloc[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# normalization
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
X_train.dtypes
Im pretty sure Its because the float are converted to object during the lambda function
Do you know why the lambda changes the returned values and how to avoid that ?

Pandas dataframe dosen't recognize values in list

I have a list that looks something like this:
[ deptht depthb clay silt sand OM bulk_density pH sat_hidric_cond
0 89 152 NaN NaN NaN NaN NaN NaN 0.000074
1 0 25 0.20 0.72 0.08 2.00 1.30 5.8 0.000917
2 25 89 0.34 0.58 0.08 0.25 1.48 5.0 0.000091,
deptht depthb clay silt sand OM bulk_density pH sat_hidric_cond
29 0 25 0.07 0.12 0.81 3.0 1.20 6.3 0.0055
32 25 44 0.05 0.11 0.84 1.7 1.20 6.1 0.0055
41 44 70 0.04 0.08 0.88 0.6 1.58 6.4 0.0055
50 70 203 0.02 0.03 0.95 0.3 1.60 7.2 0.0055,
deptht depthb clay silt sand OM bulk_density pH \
6 157 203 0.335 0.323 0.342 0.25 1.90 7.9
8 0 25 0.225 0.527 0.248 2.00 1.40 6.2
9 25 66 0.420 0.502 0.078 0.75 1.53 6.5
12 66 109 0.240 0.518 0.242 0.25 1.53 7.5
15 109 157 0.240 0.560 0.200 0.25 1.45 7.9
sat_hidric_cond
6 0.000074
8 0.000917
9 0.000282
12 0.000776
15 0.000776 ,
deptht depthb clay silt sand OM bulk_density pH \
0 71 109 0.100 0.234 0.666 0.25 1.68 5.8
1 109 152 0.100 0.265 0.635 0.25 1.70 8.2
3 0 23 0.085 0.237 0.678 2.00 1.45 6.2
4 23 71 0.210 0.184 0.606 0.25 1.55 5.5
sat_hidric_cond
0 0.0023
1 0.0023
3 0.0028
4 0.0009 ,
deptht depthb clay silt sand OM bulk_density pH \
3 0 25 0.11 0.230 0.660 0.75 1.55 7.2
4 25 76 0.14 0.192 0.668 0.25 1.55 7.2
6 76 152 0.14 0.556 0.304 0.00 1.75 8.2
sat_hidric_cond
3 0.002800
4 0.002800
6 0.000091 ]
when I try to transform my list into a DataFrame with soil = pd.DataFrame(data)
I get this output
0
0 deptht depthb clay silt sand OM bul...
1 deptht depthb clay silt sand OM bul...
2 deptht depthb clay silt sand OM ...
3 deptht depthb clay silt sand OM ...
4 deptht depthb clay silt sand OM b...
Those are the five elements of my list but is not recognizing the values associated to each variable.
However when I use the squeeze function soil = soil.iloc[1].squeeze()
I get something similar to what I want as result:
deptht depthb clay silt sand OM bulk_density pH sat_hidric_cond
29 0 25 0.07 0.12 0.81 3.0 1.20 6.3 0.0055
32 25 44 0.05 0.11 0.84 1.7 1.20 6.1 0.0055
41 44 70 0.04 0.08 0.88 0.6 1.58 6.4 0.0055
50 70 203 0.02 0.03 0.95 0.3 1.60 7.2 0.0055
But I have to use the iloc function to individually select each element of the list.
What I'm looking for is a method that I can apply to the whole list and get an output like I get when I use the pandas squeeze method.
Any help is greatly appreciated.
Thank you very much.

data is a list and it seems you want to extract the second element of the list:
soil = pd.DataFrame(data[1])

I wish to optimize the code using pythonic ways using lambda and pandas

I have the following Dataframe:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067
13 23 NaN NaN NaN NaN NaN NaN NaN NaN 983.5 BQ0067
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067
17 11 NaN NaN NaN NaN NaN NaN NaN NaN 0.0 BQ0068
18 21 4.83 11.9 28.1 44.2 54.63 16.76 6.70 0.19 953.7 BQ0068
19 22 4.40 10.7 26.3 43.4 57.55 19.85 8.59 0.53 974.9 BQ0068
20 23 17.61 43.8 67.9 122.6 221.20 0.75 0.33 58.27 974.9 BQ0068
21 31 15.09 22.3 33.3 45.6 59.45 0.98 0.38 0.73 1773.7 BQ0068
I wish to do the following things:
Steps:
Whenever the TEST_NUMBER 11 is NaN(null values), I need to remove all rows of particular
PRODUCT_NO. For example, in the given dataframe, PRODUCT_NO. BQ0068 has TEST_NUMBER 11
with NaN values, hence all rows of BQ0068 should be removed.
If any TEST_NUMBER other than TEST_NUMBER 11 has NaN values, then only that particular
TEST_NUMBER's row should be removed. For example, PRODUCT_NO BQ0067 has row of TEST_NUMBER 23 with NaN values. Hence only that particular row of TEST_NUMBER 23should be removed.
After doing the above steps, I need to the computation, for example, for PRODUCT_NO BQ0066, I
need compute the the difference between rows in following way,
TEST_NUMBER 21 - TEST_NUMBER 11, TEST_NUMBER 22 - TEST_NUMBER 11, TEST_NUMBER 23 - TEST_NUMBER 11, TEST_NUMBER 24 - TEST_NUMBER 25,
TEST_NUMBER 21 - TEST_NUMBER 11. And then TEST_NUMBER 31 - TEST_NUMBER 25,
TEST_NUMBER 32 - TEST_NUMBER 25, TEST_NUMBER 33 - TEST_NUMBER 25, TEST_NUMBER 34 -
TEST_NUMBER 25. And carry on the same procedure for successive PRODUCT_NO. As you can see
TEST_NUMBERS frequency is different for each PRODUCT_NO. But in all cases, every
PRODUCT_NO will have only one TEST_NUMBER 11 and the other TEST_NUMBERS will be in range
of 21 to 29 i.e. 21, 22, 23, 24, 25, 26, 27, 28, 29 and 31, 32, 33 ,34, 35, 36, 37, 38, 39
PYTHON CODE
def pick_closest_sample(sample_list, sample_no):
sample_list = sorted(sample_list)
buffer = []
for number in sample_list:
if sample_no // 10 == number// 10:
break
else:
buffer.append(number)
if len(buffer) > 0:
return buffer[-1]
return sample_no
def add_closest_sample_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
if subset.iloc[0].isnull().sum() == 0:
subset.dropna(inplace = True)
sample_list = subset['TEST_NUMBER'].to_list()
subset['target_sample'] = subset['TEST_NUMBER'].apply(lambda x: pick_closest_sample(sample_list,x))
out.append(subset)
if len(out)>0:
out = pd.concat(out)
out.dropna(inplace=True)
return out
Output of above two functions:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM PRODUCT_NO target_sample
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 BQ0066 11
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 BQ0066 11
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 BQ0066 11
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 BQ0066 11
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 BQ0066 11
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 BQ0066 11
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 BQ0066 25
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 BQ0066 25
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 BQ0066 25
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 BQ0066 25
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 BQ0067 11
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 BQ0067 11
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 BQ0067 11
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 BQ0067 22
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 BQ0067 22
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 BQ0067 22
As you can see, all rows of PRODUCT_NO BQ0068 are removed as TEST_NUMBER 11 had NaN values. Also only row of TEST_NUMBER 23 of PRODUCT_NO BQ0067 is removed as it had NaN values. So the requirements mentioned in the first two steps are met. Now the computation for PRODUCT_NO BQ0067 will be like TEST_NUMBER 31 - TEST_NUMBER 22, TEST_NUMBER 32 - TEST_NUMBER 22, TEST_NUMBER 33 - TEST_NUMBER 22
PYTHON CODE
def compute_df(df):
unique_product_nos = list(df['PRODUCT_NO'].unique())
out = []
for product_no in unique_product_nos:
subset = df[df['PRODUCT_NO'] == product_no]
target_list = list(subset['target_sample'].unique())
for target in target_list:
target_df = subset[subset['target_sample'] == target]
target_subset = [subset[subset['TEST_NUMBER'] == target]]*len(target_df)
target_subset = pd.concat(target_subset)
if len(target_subset)> 0:
target_subset.index = target_df.index
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for col in diff_cols:
target_df[col + '_diff'] = target_df[col] - target_subset[col]
out.append(target_df)
if len(out)>0:
out = pd.concat(out)
return out
Output of the above function:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 RPM ... target_sample D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 953.5 ... 11 -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 904.2 ... 11 -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 945.2 ... 11 -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 964.2 ... 11 -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 964.2 ... 11 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 1694.5 ... 25 -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 1735.4 ... 25 -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 1754.4 ... 25 -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 1754.4 ... 25 -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 0.0 ... 11 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 964.0 ... 11 -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 983.5 ... 11 -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 1753.4 ... 22 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 1773.8 ... 22 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 1773.8 ... 22 -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 790.3
Kindly help me optimize the code of three functions I posted, so I could write them in more pythonic way.

Points 1. and 2. can be achieved in a few line with pandas functions.
You can then calculate "target_sample" and your diff_col in the same loop using groupby:
# 1. Whenever TEST_NUMBER == 11 has D1 value NaN, remove all rows with this PRODUCT_NO
drop_prod_no = df[(df.TEST_NUMBER==11) & (df.D1.isna())]["PRODUCT_NO"]
df.drop(df[df.PRODUCT_NO.isin(drop_prod_no)].index, axis=0, inplace=True)
# 2. Drop remaining rows with NaN values
df.dropna(inplace=True)
# 3. set column "target_sample" and calculate diffs
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
closest_sample = last_sample = 11
for index, row in subset.iterrows():
if row.TEST_NUMBER // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
subset.at[index, "target_sample"] = closest_sample
last_sample = row.TEST_NUMBER
for col in diff_cols:
subset.at[index, col + "_diff"] = subset.at[index, col] - float(subset[subset.TEST_NUMBER==closest_sample][col])
new_df = pd.concat([new_df, subset])
print(new_df)
Output:
TEST_NUMBER D1 D10 D50 D90 D99 Q3_15 Q3_10 l-Q3_63 ... D1_diff D10_diff D50_diff D90_diff D99_diff Q3_15_diff Q3_10_diff l-Q3_63_diff RPM_diff
0 11 4.77 12.7 34.9 93.7 213.90 13.74 5.98 21.44 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
1 21 4.43 10.8 25.2 39.8 49.73 20.04 8.45 0.10 ... -0.34 -1.9 -9.7 -53.9 -164.17 6.30 2.47 -21.34 953.5
2 22 4.52 11.3 27.7 48.0 60.51 17.58 7.50 0.58 ... -0.25 -1.4 -7.2 -45.7 -153.39 3.84 1.52 -20.86 904.2
3 23 4.67 11.5 27.2 44.8 56.64 17.49 7.24 0.25 ... -0.10 -1.2 -7.7 -48.9 -157.26 3.75 1.26 -21.19 945.2
4 24 4.41 10.9 26.8 44.5 57.84 18.95 8.31 0.54 ... -0.36 -1.8 -8.1 -49.2 -156.06 5.21 2.33 -20.90 964.2
5 25 28.88 47.3 71.8 140.0 249.40 0.26 0.12 63.42 ... 24.11 34.6 36.9 46.3 35.50 -13.48 -5.86 41.98 964.2
6 31 16.92 23.1 34.3 48.4 92.41 0.51 0.13 1.78 ... -11.96 -24.2 -37.5 -91.6 -156.99 0.25 0.01 -61.64 730.3
7 32 16.35 22.2 33.0 45.8 59.14 0.53 0.11 0.64 ... -12.53 -25.1 -38.8 -94.2 -190.26 0.27 -0.01 -62.78 771.2
8 33 16.42 21.9 32.6 45.9 56.91 0.51 0.10 0.36 ... -12.46 -25.4 -39.2 -94.1 -192.49 0.25 -0.02 -63.06 790.2
9 34 3.47 7.3 14.1 20.7 26.52 56.59 23.71 0.00 ... -25.41 -40.0 -57.7 -119.3 -222.88 56.33 23.59 -63.42 790.2
10 11 5.16 14.2 38.6 123.4 263.80 11.03 4.82 26.90 ... 0.00 0.0 0.0 0.0 0.00 0.00 0.00 0.00 0.0
11 21 4.72 11.6 27.5 44.5 54.91 17.05 7.05 0.20 ... -0.44 -2.6 -11.1 -78.9 -208.89 6.02 2.23 -26.70 964.0
12 22 4.48 11.2 26.4 42.4 52.22 18.38 7.68 0.12 ... -0.68 -3.0 -12.2 -81.0 -211.58 7.35 2.86 -26.78 983.5
14 31 14.80 22.4 33.2 45.5 58.11 1.05 0.36 0.56 ... 10.32 11.2 6.8 3.1 5.89 -17.33 -7.32 0.44 769.9
15 32 16.30 22.1 32.1 44.7 55.12 0.57 0.13 0.23 ... 11.82 10.9 5.7 2.3 2.90 -17.81 -7.55 0.11 790.3
16 33 3.44 7.2 14.0 21.0 26.34 56.72 24.69 0.00 ... -1.04 -4.0 -12.4 -21.4 -25.88 38.34 17.01 -0.12 79
Edit: you can avoid using ìterrows by applying lambda functions like you did:
# 3. set column "target_sample" and calculate diffs
def get_closest_sample(samples, test_no):
closest_sample = last_sample = 11
for smpl in samples:
if smpl // 10 > closest_sample // 10 + 1:
closest_sample = last_sample
if smpl == test_no:
break
last_sample = smpl
return closest_sample
new_df = pd.DataFrame()
diff_cols = ['D1','D10','D50','D90','D99','Q3_15','Q3_10','l-Q3_63','RPM']
for _, subset in df.groupby("PRODUCT_NO"):
sample_list = list(subset["TEST_NUMBER"])
subset["target_sample"] = subset["TEST_NUMBER"].apply(lambda x: get_closest_sample(sample_list, x))
for col in diff_cols:
subset[col + "_diff"] = subset.apply(lambda row: row[col]-float(subset[subset.TEST_NUMBER==row["target_sample"]][col]), axis=1)
new_df = pd.concat([new_df, subset])
print(new_df)

Sum results of pandas groupby

I have the next DataFrame:
stock color 15M_c 60M_c mediodia 1D_c 1D-15M_c
0 PYPL rojo 0.32 0.32 0.47 -0.18 -0.50
1 MSFT verde -0.11 0.38 0.79 -0.48 -0.35
2 PYPL verde -1.44 -1.23 0.28 -1.13 0.30
3 V rojo -0.07 0.23 0.70 0.80 0.91
4 JD rojo 0.87 1.11 1.19 0.43 -0.42
5 FB verde 0.20 0.05 0.22 -0.66 -0.82
.. ... ... ... ... ... ... ...
282 GM verde 0.14 0.06 0.47 0.51 0.37
283 FB verde 0.09 -0.08 0.12 0.22 0.12
284 MSFT rojo -0.16 -0.23 -0.06 -0.01 0.14
285 PYPL verde -0.14 -0.41 -0.07 0.20 0.30
286 V verde -0.02 0.00 0.28 0.42 0.45
And first I grouped by 'stock' and 'color', I do it with the next code:
marcos = ['15M_c','60M_c','mediodia','1D_c','1D-15M_c']
grouped = data.groupby(['stock','color'])
res = grouped[marcos].agg([np.size, np.sum])
So in 'res' I get the next DataFrame:
15M_c 60M_c mediodia 1D_c 1D-15M_c
size sum size sum size sum size sum size sum
stock color
AAPL rojo 10.0 -0.46 10.0 -0.20 10.0 -0.33 10.0 -0.25 10.0 0.18
verde 8.0 1.39 8.0 2.48 8.0 1.06 8.0 -1.57 8.0 -2.88
... ... .. .. .. .. .. .. .. .. .. ..
FB verde 15.0 0.92 15.0 -0.64 15.0 -0.11 15.0 -0.89 15.0 -1.80
MSFT rojo 11.0 0.47 11.0 2.07 11.0 2.71 11.0 4.37 11.0 3.83
verde 18.0 1.46 18.0 2.12 18.0 1.26 18.0 0.97 18.0 -0.54
PYPL rojo 9.0 1.06 9.0 2.68 9.0 5.02 9.0 3.98 9.0 2.84
verde 17.0 -1.57 17.0 -2.40 17.0 0.29 17.0 -0.48 17.0 1.08
V rojo 1.0 -0.22 1.0 -0.28 1.0 -0.36 1.0 -0.29 1.0 -0.06
verde 9.0 -1.01 9.0 -1.42 9.0 -0.86 9.0 0.58 9.0 1.61
And then I want to sum 'verde' row with 'rojo' row for each 'stock', but multipliying rojo sum by -1. The final result I wanted is:
15M_c 60M_c mediodia 1D_c 1D-15M_c
size sum sum sum sum sum
stock
AAPL 18.0 1.85 2.68 1.39 -1.32 -3.06
... .. .. .. .. .. ..
FB 15.0 0.92 -0.64 -0.11 -0.89 -1.80
MSFT 29.0 0.99 0.05 -1.45 -3.40 -4.37
PYPL 26.0 -2.63 -5.08 .. .. ..
V 10.0 -0.79 -1.14 .. .. ..
Thank you very much in advance for your help.

pandas.IndexSlice
Use loc and IndexSlice to change the sign of appropriate values. Then use sum(level=0)
islc = pd.IndexSlice
res.loc[islc[:, 'rojo'], islc[:, 'sum']] *= -1
res.sum(level=0)

Convert the columns in marcos based on the value of color
import numpy as np
for m in marcos:
data[m] = np.where(data['color'] == 'rojo', -data[m], data[m])
Then you can skip grouping by color altogether:
grouped = foo.groupby(['stock'])
res = grouped[marcos].agg([np.size, np.sum])

What is the effective way to have a pivot-table having pandas dataset columns as its rows?

Let's take as an example the following dataset:
make address all 3d our over length_total y
0 0.0 0.64 0.64 0.0 0.32 0.0 278 1
1 0.21 0.28 0.5 0.0 0.14 0.28 1028 1
2 0.06 0.0 0.71 0.0 1.23 0.19 2259 1
3 0.15 0.0 0.46 0.1 0.61 0.0 1257 1
4 0.06 0.12 0.77 0.0 0.19 0.32 749 1
5 0.0 0.0 0.0 0.0 0.0 0.0 21 1
6 0.0 0.0 0.25 0.0 0.38 0.25 184 1
7 0.0 0.69 0.34 0.0 0.34 0.0 261 1
8 0.0 0.0 0.0 0.0 0.9 0.0 25 1
9 0.0 0.0 1.42 0.0 0.71 0.35 205 1
10 0.0 0.0 0.0 0.0 0.0 0.0 23 0
11 0.48 0.0 0.0 0.0 0.48 0.0 37 0
12 0.12 0.0 0.25 0.0 0.0 0.0 491 0
13 0.08 0.08 0.25 0.2 0.0 0.25 807 0
14 0.0 0.0 0.0 0.0 0.0 0.0 38 0
15 0.24 0.0 0.12 0.0 0.0 0.12 227 0
16 0.0 0.0 0.0 0.0 0.75 0.0 77 0
17 0.1 0.0 0.21 0.0 0.0 0.0 571 0
18 0.51 0.0 0.0 0.0 0.0 0.0 74 0
19 0.3 0.0 0.15 0.0 0.0 0.15 155 0
I want to get pivot-table from the previous dataset, in which the columns (make, address all, 3d, our, over, length_total) will have their mean values processed by the column y. The following table is the expected result:
y
1 0
make 0.048 0.183
address 0.173 0.008
all 0.509 0.098
3d 0.01 0.02
our 0.482 0.123
over 0.139 0.052
length_total 626.7 250
Is it possible to get the desired result through pivot_table method from pandas.data object? If so, how?
Is there a more effective way to do this?

Some people like using stack or unstack, but I prefer good ol' pd.melt to "flatten" or "unpivot" a frame:
>>> df_m = pd.melt(df, id_vars="y")
>>> df_m.pivot_table(index="variable", columns="y")
value
y 0 1
variable
3d 0.020 0.010
address 0.008 0.173
all 0.098 0.509
length_total 250.000 626.700
make 0.183 0.048
our 0.123 0.482
over 0.052 0.139
(If you want to preserve the original column order as the new row order, you can use .loc to index into this, something like df2.loc[df.columns].dropna()).
Melting does the flattening, and preserves y as a column, putting the old column names as a new column called "variable" (which can be changed if you like):
>>> pd.melt(df, id_vars="y").head()
y variable value
0 1 make 0.00
1 1 make 0.21
2 1 make 0.06
3 1 make 0.15
4 1 make 0.06
After that we can call pivot_table as we would ordinarily.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyGAM `y data is not in domain of logit link function` - python

You probably want to use LinearGAM – LogisticGAM is for classification tasks.

Related

Tensorflow dataframe lambda modify the type of object

Pandas dataframe dosen't recognize values in list

I wish to optimize the code using pythonic ways using lambda and pandas

Sum results of pandas groupby

What is the effective way to have a pivot-table having pandas dataset columns as its rows?

Categories

Resources