I have a dataset of 5 features. Two of these features are very similar but do not have the same min and max values.
... | feature 2 | feature 3 | ...
--------------------------------
..., 208.429993, 206.619995, ...
..., 207.779999, 205.050003, ...
..., 206.029999, 203.410004, ...
..., 204.429993, 202.600006, ...
..., 206.429993, 204.25, ...
feature 3 is always smaller than feature 2 and it is important that it stays that way after scaling. But since feature 2 and features 3 do not have the exact same min and max values, after scaling they will both end up having 0 and 1 as min and max by default. This will remove the relationship between the values. In fact after scaling, the first sample becomes:
... | feature 2 | feature 3 | ...
--------------------------------
..., 0.00268, 0.00279, ...
This is something that I do not want. I cannot seem to find a way to manually change the min and max values of MinMaxScaler. There are other ugly hacks such as manipulating the data and combining feature2 and feature 3 into one for the scaling and splitting again afterward. But I would like to know first if there is a solution that is handled by sklearn, such as using the same min and max for multiple features.
Otherwise, the simplest workaround would do.
Fitting scaler with one column and transforming both. Trying with the data you posted:
feature_1 feature_2
0 208.429993 206.619995
1 207.779999 205.050003
2 206.029999 203.410004
3 204.429993 202.600006
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(df['feature_2'].values.reshape(-1,1))
scaler.transform(df)
array([[1.45024949, 1. ],
[1.288559 , 0.60945366],
[0.85323442, 0.20149259],
[0.45522189, 0. ]])
If you scale data that are outside of the range you used to fit the scaler, the scaled data will be outside of [0,1].
The only way to avoid it is to scale each column individually.
Whether or not this is a problem depends on what you want to do with the data after scaling.
Related
I got a dataframe,df, as below:
Name
Sales
Book
Sign
Andy
10000
2
10
Bobo
20000
3
20
Tim
0
2
14
...
...
...
...
And I would like to perform PCA with n_components = 0.9
So I first use StandardScaler() from sklearn to standardize the values and getting the below values:
[ 1.33865216 1.80350169 1.90692518 1.40305228]
[ 0.98050987 0.68720789 0.33371191 0.67278892]
[ 0.95059432 1.10958933 0.47673129 0.85535476]
[-0.20264719 -0.54976631 -0.23836565 -0.81816542]
[-1.01921185 -1.63589 -0.52440442 -1.85270517]
[ 0.89958047 -0.03687457 0.90578946 0.79449948]
[-1.16715811 -0.85146734 -1.23950137 -0.81816542]
[-0.3867463 0.05363574 -0.09534626 0.33808489]]
And I used the from advanced_pca import CustomPCA to perform PCA with varimax rotation. Below is the code:
varimax_pca = CustomPCA(n_components=n_components, rotation='varimax', random_state = 9527)
However, I found something strange that the cumsum of explained_variance_ratio is greater than 1
pca_var_ratio = varimax_pca.fit(Z).explained_variance_ratio_
print(pca_var_ratio)
>>>[0.57124482 1.09019268]
Is there any bug? Is it normal that the cumsum of explained variance ratio can be greater than 1?
Thanks!
This is the link of advanced_pca module: https://pypi.org/project/advanced-pca/0.1/
Apologies in advance for any incorrect wording. The reason I am not finding answers to this might be because I am not using the right terminology.
I have a dataframe that looks something like
0 -0.004973 0.008638 0.000264 -0.021122 -0.017193
1 -0.003744 0.008664 0.000423 -0.021031 -0.015688
2 -0.002526 0.008688 0.000581 -0.020937 -0.014195
3 -0.001322 0.008708 0.000740 -0.020840 -0.012715
4 -0.000131 0.008725 0.000898 -0.020741 -0.011249
5 0.001044 0.008738 0.001057 -0.020639 -0.009800
6 0.002203 0.008748 0.001215 -0.020535 -0.008368
7 0.003347 0.008755 0.001373 -0.020428 -0.006952
8 0.004476 0.008758 0.001531 -0.020319 -0.005554
9 0.005589 0.008758 0.001688 -0.020208 -0.004173
10 0.006687 0.008754 0.001845 -0.020094 -0.002809
...
For each column I would like to scale the data to a float between -1.0 and 1.0 for this column's min and max.
I have tried scikit learn's minmax scaler with scaler = MinMaxScaler(feature_range = (-1, 1)) but some values change sign as a result, which I need to preserve.
Is there a way to 'centre' the scaling on zero?
Have you tried using StandardScaler from sklearn ?
It has with_mean and with_std option, which you can use to get data you want.
The problem with scaling the negative values to the column's minimum value and the positive values to the column's maximum value is that the scale of the positive numbers may be different than the scale of the positive numbers. If you want to use the same scale for both negative and positive values, try the following:
def zero_centered_min_max_scaling(dataframe):
"""
Scale the numerical values in the dataframe to be between -1 and 1, preserving the
signal of all values.
"""
df_copy = dataframe.copy(deep=True)
for column in df_copy.columns:
max_absolute_value = df_copy[column].abs().max()
df_copy[column] = df_copy[column] / max_absolute_value
return df_copy
I have a Pandas Series, that needs to be log-transformed to be normal distributed. But I can´t log transform yet, because there are values =0 and values below 1 (0-4000). Therefore I want to normalize the Series first. I heard of StandardScaler(scikit-learn), Z-score standardization and Min-Max scaling(normalization).
I want to cluster the data later, which would be the best method?
StandardScaler and Z-score standardization use mean, variance etc. Can I use them on "not yet normal distibuted" data?
To transform to logarithms, you need positive values, so translate your range of values (-1,1] to normalized (0,1] as follows
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.uniform(-1,1,(10,1)))
df['norm'] = (1+df[0])/2 # (-1,1] -> (0,1]
df['lognorm'] = np.log(df['norm'])
results in a dataframe like
0 norm lognorm
0 0.360660 0.680330 -0.385177
1 0.973724 0.986862 -0.013225
2 0.329130 0.664565 -0.408622
3 0.604727 0.802364 -0.220193
4 0.416732 0.708366 -0.344795
5 0.085439 0.542719 -0.611163
6 -0.964246 0.017877 -4.024232
7 0.738281 0.869141 -0.140250
8 0.558220 0.779110 -0.249603
9 0.485144 0.742572 -0.297636
If your data is in the range (-1;+1) (assuming you lost the minus in your question) then log transform is probably not what you need. At least from a theoretical point of view, it's obviously the wrong thing to do.
Maybe your data has already been preprocessed (inadequately)? Can you get the raw data? Why do you think log transform will help?
If you don't care about what is the meaningful thing to do, you can call log1p, which is the same as log(1+x) and which will thus work on (-1;∞).
i'm using RandomForestRegressor (from the great Scikt-Learn library in python) for my project,
it gives me good results, but i think i can do better.
when i'm giving features to 'fit(..)' function,
is it better to make categorical features as binary features?
example:
instead of:
===========
continent |
===========
1 |
===========
2 |
===========
3 |
===========
2 |
===========
make something like:
===========================
is_europe | is_asia | ...
===========================
1 | 0 |
===========================
0 | 1 |
===========================
because its working as a tree maybe the second option is better,
or does it will work the same for the first option?
thanks alot!
Binarizing categorical variables is highly recommended, and expects to outperform the model without binarizer transform. If scikit-learn considers continent = [1, 2, 3, 2] as numeric values (continuous variable [quantitative] instead of categorical [qualitative]), it imposes an artificial order constraint on that feature. For example, suppose continent=1 means is_europe, continent=2 means is_asia, and continent=3 means is_america, then it implies that is_asia is always in between is_europe and is_america when examing the relation of the continent feature to your response variable y, which is not necessarily true and have a chance to reduce the model effectiveness. In contrast, making it dummy variables has no such problem and scikit-learn will treat each binary feature separately.
To binarize your categorical variables in scikit-learn, you can use LabelBinarizer.
from sklearn.preprocessing import LabelBinarizer
# your data
# ===========================
continent = [1, 2, 3, 2]
continent_dict = {1:'is_europe', 2:'is_asia', 3:'is_america'}
print(continent_dict)
{1: 'is_europe', 2: 'is_asia', 3: 'is_america'}
# processing
# =============================
binarizer = LabelBinarizer()
# fit on the categorical feature
continent_dummy = binarizer.fit_transform(continent)
print(continent_dummy)
[[1 0 0]
[0 1 0]
[0 0 1]
[0 1 0]]
If you process your data in pandas, then its top-level function pandas.get_dummies also helps.
I'm quite new to scikit-learn and was going through some of the examples of learning and predicting the samples in the iris dataset. But how do I load an external dataset for this purpose?
I downloaded a dataset that has data in the following form;
id attr1 attr2 .... label
123 0 0 ..... abc
234 0 0 ..... dsf
....
....
So how should I load this dataset in order to learn and draw prediction? Thanks.
One option is to use pandas. Assuming the data is space separated:
import pandas as pd
X = pd.read_csv('data.txt', sep=' ').values
where read_csv returns a DataFrame, and the values attribute returns a numpy array containing the data. You might want to separate out the last column of the above X as the labels, say into a one dimensional array y:
X, y = X[:, :-1], X[:, -1]