How to split data based on a column value in sklearn - python

I have a data file with following columns
'customer',
'calibrat' - Calibration sample = 1; Validation sample = 0;
'churn',
'churndep',
'revenue',
'mou',
Data file contains some 40000 rows out of which 20000 have value for calibrat as 1. I want to split this data into
X1 = data.loc[:, data.columns != 'churn']
y1 = data.loc[:, data.columns == 'churn']
from imblearn.over_sampling import SMOTE
os = SMOTE(random_state=0)
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.3, random_state=0)
what I want is that in my X1_train should come data for Calibration with calibrat =1
and in X1_test should come all data for validation with calibrat = 0

sklearn.model_selection has several other options other than train_test_split. One of them, aims at solving what you're asking for. In this case you could use GroupShuffleSplit, which as mentioned inthe docs it provides randomized train/test indices to split data according to a third-party provided group. This is useful when you're doing cross-validation, and you want to split in validation-train multiple times, ensuring that the sets are split by the group field. You also have GroupKFold for these cases which is very useful.
So, adapting your example, here's what you could do.
Say you have for instance:
from sklearn.model_selection import GroupShuffleSplit
cols = ['customer', 'calibrat', 'churn', 'churndep', 'revenue', 'mou',]
X = pd.DataFrame(np.random.rand(10, 6), columns=cols)
X['calibrat'] = np.random.choice([0,1], size=10)
print(X)
customer calibrat churn churndep revenue mou
0 0.523571 1 0.394896 0.933637 0.232630 0.103486
1 0.456720 1 0.850961 0.183556 0.885724 0.993898
2 0.411568 1 0.003360 0.774391 0.822560 0.840763
3 0.148390 0 0.115748 0.089891 0.842580 0.565432
4 0.505548 0 0.370198 0.566005 0.498009 0.601986
5 0.527433 0 0.550194 0.991227 0.516154 0.283175
6 0.983699 0 0.514049 0.958328 0.005034 0.050860
7 0.923172 0 0.531747 0.026763 0.450077 0.961465
8 0.344771 1 0.332537 0.046829 0.047598 0.324098
9 0.195655 0 0.903370 0.399686 0.170009 0.578925
y = X.pop('churn')
You can now instanciate GroupShuffleSplit, and do as you would with train_test_split, with the only difference of specifying a group column, which will be used to split X and y so the groups are split according the the groups values:
gs = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)
As mentioned, this is more handy when you want to split into multiple groups, generally for cross validation purposes. Here's just an example of how you'd get two splits, as mentioned in the question:
train_ix, test_ix = next(gs.split(X, y, groups=X.calibrat))
X_train = X.loc[train_ix]
y_train = y.loc[train_ix]
X_test = X.loc[test_ix]
y_test = y.loc[test_ix]
Giving:
print(X_train)
customer calibrat churndep revenue mou
3 0.148390 0 0.089891 0.842580 0.565432
4 0.505548 0 0.566005 0.498009 0.601986
5 0.527433 0 0.991227 0.516154 0.283175
6 0.983699 0 0.958328 0.005034 0.050860
7 0.923172 0 0.026763 0.450077 0.961465
9 0.195655 0 0.399686 0.170009 0.578925
print(X_test)
customer calibrat churndep revenue mou
0 0.523571 1 0.933637 0.232630 0.103486
1 0.456720 1 0.183556 0.885724 0.993898
2 0.411568 1 0.774391 0.822560 0.840763
8 0.344771 1 0.046829 0.047598 0.324098

Related

Import several excel files using glob and iteratively work on them and extract a data frame from results

I have a total of 28 excel files in a folder. Each of them consists of 72 columns (71 variables and 1 target). using the code below, I try to predict the target value for 1 excel file, which the final outcome should be A, B, C, and D.
How can I use glob so that the code reads all excel files, do the prediction process, and finally puts the final results (A, B, C, and D) of each excel file in a row of a new data frame (row1=excel1, row2=excel2 and ...)
import numpy as np
import pandas as pd
import glob
for file in list(glob.glob('*.xlsx')):
data = pd.read_excel(file)
X = data.drop(['HWD'],1)
Y = data['HWD']
# Split DataSet to Train & Test
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y,
test_size=0.2,
random_state=0)
# Prediction 1
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
ABR = AdaBoostRegressor()
ABR.fit(X_train, y_train)
ABRtest=ABR.predict(X_test)
# Calculate metrics 1
from sklearn.metrics import mean_squared_error
import math
from scipy.stats import variation
from sklearn.metrics import mean_poisson_deviance
df_list = []
a = pd.DataFrame({'Obs':(y_test),'Pre':(ABRtest)})
df_list.append(a)
column_1 = a["Obs"]
column_2 = a["Pre"]
A = column_1.corr(column_2)
MSE = mean_squared_error(column_1, column_2)
B = math.sqrt(MSE)
C = variation(column_2 , axis = 0)
D = mean_poisson_deviance(column_1, column_2)
e below is the final row for each excel, I need a code that constructs a data frame with 28 rows containing results for each file.
df_list = []
e = pd.DataFrame({'CC':[A],'RMSE':[B],'RSD':[C],'MPD':[D]})
df_list.append(e)
You could use df.append() to add rows below the existing df
e = pd.DataFrame() #initiate an empty df
for i in range(28):
A,B,C,D = i,i,i,i #assume you have the results here
e = e.append(pd.DataFrame({'CC':[A],'RMSE':[B],'RSD':[C],'MPD':[D]}))
e.reset_index(drop=True, inplace=True)
Is this your desired output?
CC RMSE RSD MPD
0 0 0 0 0
1 1 1 1 1
2 2 2 2 2
...
25 25 25 25 25
26 26 26 26 26
27 27 27 27 27
To insert into your code:
import glob
e = pd.DataFrame({'CC':[],'RMSE':[],'RSD':[],'MPD':[]}, dtype='object') #<--insert here, initiate an empty df
for file in list(glob.glob('*.xlsx')):
data = pd.read_excel(file)
...
C = variation(column_2 , axis = 0)
D = mean_poisson_deviance(column_1, column_2)
e = e.append(pd.DataFrame({'CC':[str(A)],'RMSE':[str(B)],'RSD':[str(C)],'MPD':[str(D)]})) #<--insert here
e.reset_index(drop=True, inplace=True) #<--insert this line outside for-loop

Minmaxscaler inverse_transform doesn't work

my code is as followed:
transform scale
X = dataset #(100, 18)
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(X)
scaled_X = scaler.transform(X)
scaled_series = Series(scaled_X[:, 17])
print(scaled_series.head())
invert transform
inverted_X = scaler.inverse_transform(scaled_X)
inverted_series = Series(inverted_X[:, 17])
print(inverted_series.head())
the problem is that scaled_series and inverted_series are the same result, how should I correct the code?
I guess the problem is specific to your dataset. For instance, when I use an example dataset, the scaled_seriesand the inverted_series gave two different outputs:
Scaled Series output:
0 0.729412
1 0.741176
2 0.741176
3 0.670588
4 0.870588
dtype: float32
Inverted Series output:
0 0.698347
1 0.706612
2 0.706612
3 0.657025
4 0.797521
dtype: float32
Both scaled_series and inverted_series gave different outputs but the values are close to each other. If you scale your data before using MinMaxScalar:
from sklearn.preprocessing import scale
X = scale(X)
Result:
Scaled Series output:
0 0.729412
1 0.741176
2 0.741176
3 0.670588
4 0.870588
dtype: float32
Inverted Series output:
0 -0.188240
1 -0.123413
2 -0.123413
3 -0.512372
4 0.589678
dtype: float32
Now, the outputs are not close to each other, they are completely different.
Code:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.preprocessing import MinMaxScaler, scale
from pandas import Series
X, _ = fetch_olivetti_faces(return_X_y=True)
X = scale(X)
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(X)
scaled_X = scaler.transform(X)
scaled_series = Series(scaled_X[:, 17])
print("\nScaled Series output:")
print(scaled_series.head())
inverted_X = scaler.inverse_transform(scaled_X)
inverted_series = Series(inverted_X[:, 17])
print("\nInverted Series output:")
print(inverted_series.head())
You have to consider the range of your dataset X. If we consider the formula for the MinMax scaler:
Should the range of X be [0,1], there will be no difference made as you will be subtracting 0 and dividing by 1. Thus, returning the same value.
Normalization is only viable for values which are not on the scale of 0-1.

How to predict y=1/x values in Python? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a data frame named df:
import pandas as pd
df = pd.DataFrame({'p': [15-x for x in range(14)]
, 'x': [x for x in range(14)]})
df['y'] = 1000 * (10 / df['p'])
x is only for plotting purposes.
I'm trying to predict the y value based on the p values. I am using SVR from sklearn:
from sklearn.svm import SVR
nlm = SVR(kernel='poly').fit(df[['p']], df['y'])
df['nml'] = nlm.predict(df[['p']])
I have already tried all of kernels but it still doesn't work correct enough.
p x y nml
0 15 0 666.666667 524.669572
1 14 1 714.285714 713.042459
2 13 2 769.230769 876.338765
3 12 3 833.333333 1016.349674
Do you know which sklearn model or other libraries should I use to better fit a model?
You missed the fundamental step "normalize the data"
Fix
df = pd.DataFrame({'p': [15-x for x in range(14)]
, 'x': [x for x in range(14)]})
df['y'] = 1000 * (10 / df['p'])
# Normalize the data (x - mean(x))/std(x)
s_p = np.std(df['p'])
m_p = np.mean(df['p'])
s_y = np.std(df['y'])
m_y = np.mean(df['y'])
df['p_'] = (df['p'] - s_p)/m_p
df['y_'] = (df['y'] - s_y)/m_y
# Fit and make prediction
nlm = SVR(kernel='rbf').fit(df[['p_']], df['y_'])
df['nml'] = nlm.predict(df[['p_']])
# Plot
plt.plot(df['p_'], df['y_'], 'r')
plt.plot(df['p_'], df['nml'], 'g')
plt.show()
# Rescale back and plot
plt.plot(df['p_']*s_p+m_p, df['y_']*s_y+m_y, 'r')
plt.plot(df['p_']*s_p+m_p, df['nml']*s_y+m_y, 'g')
plt.show()
As #mujjiga pointed out, scaling is important part of the process.
I would like to draw your attention on another two key points:
model selection which determines your ability to solve a class of problem;
new scklearn API which helps you to standardize solution development.
Let's start with your dataset:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.arange(14)
df = pd.DataFrame({'x': x, 'p': 15-x})
df['y'] = 1e4/df['p']
Then we import somesklearn API objects of interest:
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, FunctionTransformer
First we create a scaler function for target values:
ysc = StandardScaler()
Notice that we can use different scalers, or build a custom transformation.
# Scaler robust against outliers:
ysc = RobustScaler()
# Logarithmic Transformation:
ysc = FunctionTransformer(func=np.log, inverse_func=np.exp, check_inverse=True)
We scale target using the scaler of our choice:
ysc.fit(df[['y']])
df['yn'] = ysc.transform(df[['y']])
We also build a pipeline with features standardizer and the selected model (we adjusted parameters to improve the fit). We fit it to your dataset using the pipeline:
reg = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=1e3, epsilon=1e-3))
reg.fit(df[['p']], df['yn'])
At this point we can predict values and transform them back to the original scale:
df['ynhat'] = reg.predict(df[['p']])
df['yhat'] = ysc.inverse_transform(df[['ynhat']])
We check the fit score:
reg.score(df[['p']], df['yn']) # 0.9999646718755011
We can also compute absolute and relative error for each point:
df['yaerr'] = df['yhat'] - df['y']
df['yrerr'] = df['yaerr']/df['y']
Final result is:
x p y yn ynhat yhat yaerr yrerr
0 0 15 666.666667 -0.834823 -0.833633 668.077018 1.410352 0.002116
1 1 14 714.285714 -0.794636 -0.795247 713.562403 -0.723312 -0.001013
2 2 13 769.230769 -0.748267 -0.749627 767.619013 -1.611756 -0.002095
3 3 12 833.333333 -0.694169 -0.693498 834.128425 0.795091 0.000954
4 4 11 909.090909 -0.630235 -0.629048 910.497550 1.406641 0.001547
5 5 10 1000.000000 -0.553514 -0.555029 998.204445 -1.795555 -0.001796
6 6 9 1111.111111 -0.459744 -0.460002 1110.805275 -0.305836 -0.000275
7 7 8 1250.000000 -0.342532 -0.341099 1251.697707 1.697707 0.001358
8 8 7 1428.571429 -0.191830 -0.193295 1426.835676 -1.735753 -0.001215
9 9 6 1666.666667 0.009105 0.010458 1668.269984 1.603317 0.000962
10 10 5 2000.000000 0.290414 0.291060 2000.764717 0.764717 0.000382
11 11 4 2500.000000 0.712379 0.690511 2474.088446 -25.911554 -0.010365
12 12 3 3333.333333 1.415652 1.416874 3334.780642 1.447309 0.000434
13 13 2 5000.000000 2.822199 2.821420 4999.076799 -0.923201 -0.000185
Graphically it leads to:
fig, axe = plt.subplots()
axe.plot(df['p'], df['y'], label='$y(p)$')
axe.plot(df['p'], df['yhat'], 'o', label='$\hat{y}(p)$')
axe.set_title(r"SVR Fit for $y(x) = \frac{k}{x-a}$")
axe.set_xlabel('$p = x-a$')
axe.set_ylabel('$y, \hat{y}$')
axe.legend()
axe.grid()
Linearization
In the example above we could not use the poly kernel, we had to use the rbf kernel instead. This is because if we aim to fit a rational function using polynomial we are better to transform our data before fitting using a p = x/(x-b) substitution at the first place. In this case it will merely boil down to perform a linear regression. The example below shows that it works:
Scaler and transformation can be composed into a pipeline as well. We define a pipeline that linearize and scale the problem:
# Rational Fraction Substitution with consecutive Standardization
ysc = make_pipeline(
FunctionTransformer(func=lambda x: x/(x+1),
inverse_func=lambda x: x/(1-x),
check_inverse=True),
StandardScaler()
)
Then we can regress the data using classical OLS:
reg = make_pipeline(StandardScaler(), LinearRegression())
reg.fit(df[['p']], df['yn'])
Which provides correct result:
reg.score(df[['p']], df['yn']) # 0.9999998722172933
This second solution take advantage of a known linearization and thus remove the need to parametrize the model.

How to scale a dataframe with datetime field in it (as a index)?

I want to scale a dataframe, which raises the error as in the title (or below).
My data:
df.head()
timestamp open high low close volume
0 2020-06-25 303.4700 305.26 301.2800 304.16 46340400
1 2020-06-24 309.8400 310.51 302.1000 304.09 123867696
2 2020-06-23 313.4801 314.50 311.6101 312.05 68066900
3 2020-06-22 307.9900 311.05 306.7500 310.62 74007212
4 2020-06-19 314.1700 314.38 306.5300 308.64 135211345
My code:
# Converting the index as date
from datetime import datetime
df.index = pd.to_datetime(df.index)
# Split data
split = len(df) - int(len(df) * 0.8)
df_train = df.iloc[split:]
df_test = df.iloc[:split]
# Normalize
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
df_train = df_train.values.reshape(-1,1) #df_train = scaler.fit_transform(df_train)
df_test = df_test.values.reshape(-1,1) #df_test = scaler.fit_transform(df_train)
# Train the Scaler with training data and smooth data
timestep = 21
for i in range(0,len(df),timestep):
df_train = scaler.fit_transform(df_train[i:i+timestep,:])
#train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])
# You normalize the last bit of remaining data
df_test = scaler.fit_transform(df_test[i+timestep:,:])
#train_data[di+timestep:,:] = scaler.transform(train_data[di+timestep:,:])
The error:
2 timestep = 21
3 for i in range(0,len(df),timestep):
----> 4 df_train = scaler.fit_transform(df_train[i:i+timestep,:])
5 #train_data[di:di+smoothing_window_size,:] = scaler.transform(train_data[di:di+smoothing_window_size,:])
ValueError: could not convert string to float: '2020-05-28'
Help would be appraciated.
Simply iterate through the columns and scale each individually like this:
for col in X.columns:
X[col] = StandardScaler().fit_transform(X[col].to_numpy().reshape(-1,1)
you can create your own scaler if you want to do something within an SKlearn pipeline like this:
class Scaler(StandardScaler):
def __init__(self):
super().__init__()
def fit_transform(self):
for col in X.columns:
X[col] = StandardScaler().fit_transform(X[col].to_numpy().reshape(-1,1)
return X

Filter Dataset to get just images from specific class

I want to prepare the omniglot dataset for n-shot learning.
Therefore I need 5 samples from 10 classes (alphabet)
Code to reproduce
import tensorflow as tf
import tensorflow_datasets as tfds
import numpy as np
builder = tfds.builder("omniglot")
# assert builder.info.splits['train'].num_examples == 60000
builder.download_and_prepare()
# Load data from disk as tf.data.Datasets
datasets = builder.as_dataset()
dataset, test_dataset = datasets['train'], datasets['test']
def resize(example):
image = example['image']
image = tf.image.resize(image, [28, 28])
image = tf.image.rgb_to_grayscale(image, )
image = image / 255
one_hot_label = np.zeros((51, 10))
return image, one_hot_label, example['alphabet']
def stack(image, label, alphabet):
return (image, label), label[-1]
def filter_func(image, label, alphabet):
# get just images from alphabet in array, not just 2
arr = np.array(2,3,4,5)
result = tf.reshape(tf.equal(alphabet, 2 ), [])
return result
# correct size
dataset = dataset.map(resize)
# now filter the dataset for the batch
dataset = dataset.filter(filter_func)
# infinite stream of batches (classes*samples + 1)
dataset = dataset.repeat().shuffle(1024).batch(51)
# stack the images together
dataset = dataset.map(stack)
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(32)
for i, (image, label) in enumerate(tfds.as_numpy(dataset)):
print(i, image[0].shape)
Now I want to filter the images in the dataset by using the filter function.
tf.equal just let me filter by one class, I want something like tensor in array.
Do you see a way doing this with the filter function?
Or is this the wrong way and there is a much simpler way?
I want to create a batch of 51 images and according labels, which are from the same N=10 classes. From every class, I need K=5 different images and an additional one (which I need to classify). Every batch of N*K+1 (51) images should be from 10 new random classes.
Thank you very much in advance.
To KEEP only specific labels use this predicate:
dataset = datasets['train']
def predicate(x, allowed_labels=tf.constant([0, 1, 2])):
label = x['label']
isallowed = tf.equal(allowed_labels, tf.cast(label, allowed_labels.dtype))
reduced = tf.reduce_sum(tf.cast(isallowed, tf.float32))
return tf.greater(reduced, tf.constant(0.))
dataset = dataset.filter(predicate).batch(20)
for i, x in enumerate(tfds.as_numpy(dataset)):
print(x['label'])
# [1 0 0 1 2 1 1 2 1 0 0 1 2 0 1 0 2 2 0 1]
# [1 0 2 2 0 2 1 2 1 2 2 2 0 2 0 2 1 2 1 1]
# [2 1 2 1 0 1 1 0 1 2 2 0 2 0 1 0 0 0 0 0]
allowed_labels specifies labels you want to keep. All labels that are not in this tensor will be filtered out.

Categories

Resources