Scikit-learn: How to normalize row values horizontally? - python

I would like to normalize the values below horizontally instead of vertically. The code read csv file provided after the code and output a new csv file with normalized values. How to make it normalize horizontally? Given the code as below:
Code
#norm_code.py
#normalization = x-min/max-min
import numpy as np
from sklearn import preprocessing
all_data=np.loadtxt(open("c:/Python27/test.csv","r"),
delimiter=",",
skiprows=0,
dtype=np.float64)
x=all_data[:]
print('total number of samples (rows):', x.shape[0])
print('total number of features (columns):', x.shape[1])
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1)).fit(x)
X_minmax=minmax_scale.transform(x)
with open('test_norm.csv',"w") as f:
f.write("\n".join(",".join(map(str, x)) for x in (X_minmax)))
test.csv
1 2 0 4 3
3 2 1 1 0
2 1 1 0 1

You can simply operate on the transpose, and take a transpose of the result:
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1)).fit(x.T)
X_minmax=minmax_scale.transform(x.T).T

Oneliner answer without use of sklearn:
X_minmax = np.transpose( (x-np.min(x,axis=1))/(np.max(x, axis=1)-np.min(x,axis=1)))
This is about 8x faster than using the MinMaxScaler from preprocessing.

from sklearn.preprocessing import MinMaxScaler
data = np.array([[1 , 2 , 0 , 4 , 3],
[3 , 2 , 1, 1, 0],
[2, 1 , 1 , 0 , 1]])
scaler = MinMaxScaler()
print(data)
print(scaler.fit_transform(data.T).T)# row-wise transform

Related

How to get subset of dataset after K-means clustering

I have a dataset val_lab as follows:
[[ 52.85560436 -23.61958699 34.40273147]
[ 70.44462451 -2.74272277 80.32988099]
[ 38.32222473 -11.22753928 24.09593474]
[ 84.83470029 -7.73898094 28.03636332]
[ 76.48246093 0.13784934 76.23718213]
[ 61.21154496 2.24080039 9.38927616]
[ 39.88027333 37.32959609 -19.0592156 ]...]
I use K-means clustering from sklearn and got the prediction value:
from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters= 5 , random_state=0 ).fit_predict(val_lab)
>>>[3 0 1 3 0 3 4 1 4 1 1 1 1 1 1 4 0 3 1 0 3...]
now I want to get the value in every cluster, for example, if y_pred = 3
I get:
[[ 52.85560436 -23.61958699 34.40273147]
[ 84.83470029 -7.73898094 28.03636332]
... ]
(0 and 3 row)
Right now, my idea is:
val_lab_3 = []
for i in range(y_pred.shape[0]):
if y_pred[i] == 3:
val_lab_3.append(val_lab[i,:])
Is there some better idea, because I want to get the subsets in all the clusters. It this too complicated, especially assuming more clusters?
So if I'm understanding this correctly, your rows above are being classified as 0,1,2,3,4 (5 clusters from what I see) and you want to get all of them together.
Pandas would be a good utility. You can use this cluster prediction and make it a new column, then just select those rows where your cluster label is 3
e.g. (assuming your call your new column preds and your original numpy array is called val_lab):
import pandas as pd
df = pd.DataFrame(val_lab)
df['preds'] = y_pred
threes = df[df['preds'] == 3] # This is what you want
print(threes)
I assume val_lab is a numpy array. In that case,
val_lab[y_pred == 3, :]
Will work.

How to convert pandas data frame to NumPy array?

Following the suggestions I got from my previous question here I'm converting a Pandas data frame to a numeric NumPy array. To do this Im used numpy.asarray.
My data frame:
DataFrame
----------
label vector
0 0 1:0.0033524514 2:-0.021896651 3:0.05087798 4:...
1 0 1:0.02134219 2:-0.007388343 3:0.06835007 4:0....
2 0 1:0.030515702 2:-0.0037591448 3:0.066626 4:0....
3 0 1:0.0069114454 2:-0.0149497045 3:0.020777626 ...
4 1 1:0.003118149 2:-0.015105667 3:0.040879637 4:...
... ... ...
19779 0 1:0.0042634667 2:-0.0044222944 3:-0.012995412...
19780 1 1:0.013818732 2:-0.010984628 3:0.060777966 4:...
19781 0 1:0.00019213723 2:-0.010443398 3:0.01679976 4...
19782 0 1:0.010373874 2:0.0043582567 3:-0.0078354385 ...
19783 1 1:0.0016790542 2:-0.028346825 3:0.03908631 4:...
[19784 rows x 2 columns]
DataFrame datatypes :
label object
vector object
dtype: object
To convert into a Numpy Array I'm using this script:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn import metrics
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import matplotlib.pyplot as plt
r_filenameTSV = 'TSV/A19784.tsv'
tsv_read = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame(tsv_read)
df = pd.DataFrame(df.vector.str.split(' ',1).tolist(),
columns = ['label','vector'])
print('DataFrame\n----------\n', df)
print('\nDataFrame datatypes :\n', df.dtypes)
arr = np.asarray(df, dtype=np.float64)
print('\nNumpy Array\n----------\n', arr)
print('\nNumpy Array Datatype :', arr.dtype)
I'm having this error from line nr.22 arr = np.asarray(df, dtype=np.float64)
ValueError: could not convert string to float: ' 1:0.0033524514 2:-0.021896651 3:0.05087798 4:0.0072637126 5:-0.013740167 6:-0.0014883851 7:0.02230502 8:0.0053563705 9:0.00465044 10:-0.0030826542 11:0.010156203 12:-0.021754289 13:-0.03744049 14:0.011198468 15:-0.021201309 16:-0.0006497681 17:0.009229079 18:0.04218278 19:0.020572046 20:0.0021593391 ...
How can I solve this issue?
Regards and thanks for your time
Use list comprehension with nested dictionary comprehension for DataFrame:
df = pd.read_csv(r_filenameTSV, sep='\t',names=["vector"])
df = pd.DataFrame([dict(y.split(':') for y in x.split()) for x in df['vector']])
print (df)
1 2 3 4
0 0.0033524514 -0.021896651 0.05087798 0
1 0.02134219 -0.007388343 0.06835007 0
2 0.030515702 -0.0037591448 0.066626 0
3 0.0069114454 -0.0149497045 0.020777626 0
4 0.003118149 -0.015105667 0.040879637 0.4
And then convert to floats and to numpy array:
print (df.astype(float).to_numpy())
[[ 0.00335245 -0.02189665 0.05087798 0. ]
[ 0.02134219 -0.00738834 0.06835007 0. ]
[ 0.0305157 -0.00375914 0.066626 0. ]
[ 0.00691145 -0.0149497 0.02077763 0. ]
[ 0.00311815 -0.01510567 0.04087964 0.4 ]]
It seems one of your columns is a string, not an integer. Either remove that column or encode it as a string before converting the dataframe to an array

How to select multiple rows from a (geo)pandas dataframe based on an array or propagate metadata of a clustering algorithm result?

I have a geopandas data frame that contains a polygon, region_id and center_point lat and lon in Radians that looks like this:
I then wanted to go about clustering each region by their center point and did the following:
#Set Up
kms_per_radian = 6371.0088
eps = 0.1/kms_per_radian
coords = blocks_meta.as_matrix(columns=['lat', 'lon'])
#Cluster
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=epsilon, algorithm='ball_tree', metric='haversine', min_samples=1).fit(coords)
labels = db.labels_
clusters = pd.Series([coords[labels == n] for n in range(len(set(labels)))])
which yields an array of clusters of center points that looks like this.
array([[ 0.0703843 , 0.170845 ],
[ 0.07037922, 0.17084981],
[ 0.07036705, 0.17085678],
[ 0.0703715 , 0.17083775]])
What I am struggling to figure out how to do is to get the regions_ids associated with each cluster to merge the polygons to create one bigger region without looping through each cluster and for each lat,lon pair and querying the dataframe.
Is there a way of propagating the ids or querying the dataframe per cluster?
Any help here would be appreciated.
Thanks!
EDIT
What I want to avoid is doing this:
clusters_of_regions = []
for cluster in clusters:
cluster_of_regions_ids = []
for entry in cluster:
print(cluster[0][0])
region_id = blocks_meta.loc[blocks_meta['lat'] == cluster[0][0]]['region_id'][1]
cluster_of_regions_ids.append(region_id)
clusters_of_regions.append(cluster_of_regions_ids)
Both to avoid the nested for loop - and when ever I try I keep on getting a key error:
Is there a way I cluster on the regions themselves using the center points as properties.
Thanks
Check the example from skleanr (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). I modified it here to have a dataframe and resemble your example.
from sklearn.cluster import DBSCAN
import pandas as pd
import numpy as np
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
df = pd.DataFrame(X, index=list(range(len(X))), columns = ['col1', 'col2'])
clustering = DBSCAN(eps = 3, min_samples = 2).fit(df)
labels = clustering.labels_
df = df.merge(pd.Series(labels).to_frame().rename(columns={0:'clusters'}), left_index = True, right_index = True, how = 'outer')
df
Gives you:
col1 col2 clusters
0 1 2 0
1 2 2 0
2 2 3 0
3 8 7 1
4 8 8 1
5 25 80 -1
According to the description:
labels_ : array, shape = [n_samples] Cluster labels for each point in
the dataset given to fit(). Noisy samples are given the label -1.
In the example, you get two groups (labels 0 and 1). The -1 is a 'noisy' sample, here that sample is clearly larger than the others.
If you do something similar to this you can have your regions_id and the label next to each other and compare whether there is a 1:1 relation or not.
I think your groups are in your labels.
I think what you want is this (I am using labels = [1,2,3,4]):
df1 = pd.DataFrame(ar)
df1.loc[:,'labels'] = pd.Series(labels)
df1
That will create a df like this one :
0 1 labels
0 0.070384 0.170845 1
1 0.070379 0.170850 2
2 0.070367 0.170857 3
3 0.070372 0.170838 4

Python SKLearn: How to Get Feature Names After OneHotEncoder?

I would like to get the feature names of a data set after it has been transformed by SKLearn OneHotEncoder.
In active_features_ attribute in OneHotEncoder one can see a very good explanation how the attributes n_values_, feature_indices_ and active_features_ get filled after transform() was executed.
My question is:
For e.g. DataFrame based input data:
data = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]}).as_matrix()
How does the code look like to get from the original feature names a, b and c to a list of the transformed feature names
(like e.g:
a-0,a-1, a-2, b-0, b-1, b-2, b-3, c-0, c-1, c-2, c-3
or
a-0,a-1, a-2, b-0, b-1, b-2, b-3, b-4, b-5, b-6, b-7, b-8
or anything that helps to see the assignment of encoded columns to the original columns).
Background: I would like to see the feature importances of some of the algorithms to get a feeling for which feature have the most effect on the algorithm used.
You can use pd.get_dummies():
pd.get_dummies(data["a"],prefix="a")
will give you:
a_0 a_1 a_2
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
which can automatically generates the column names. You can apply this to all your columns and then get the columns names. No need to convert them to a numpy matrix.
So with:
df = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]})
data = df.as_matrix()
the solution looks like:
columns = df.columns
my_result = pd.DataFrame()
temp = pd.DataFrame()
for runner in columns:
temp = pd.get_dummies(df[runner], prefix=runner)
my_result[temp.columns] = temp
print(my_result.columns)
>>Index(['a_0', 'a_1', 'a_2', 'b_0', 'b_1', 'b_4', 'b_5', 'c_0', 'c_1', 'c_4',
'c_5'],
dtype='object')
If I understand correctly you can use feature_indices_ to identify which columns correspond to which feature.
e.g.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
data = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]}).as_matrix()
ohe = OneHotEncoder(sparse=False)
ohe_fitted = ohe.fit_transform(data)
print(ohe_fitted)
print(ohe.feature_indices_) # [ 0 3 9 15]
From the above feature_indices_ we know if we spliced the OneHotEncoded data from 0:3 we would get the features corresponding to the first column in data like so:
print(ohe_fitted[:,0:3])
Each column in the spliced data represents a value in the first feature. The first column is 0, the second 1 and the third column is 2. To illustrate this on the spliced data, the column labels would look like:
a_0 a_1 a_2
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]]
Note that features are sorted first before they are encoded.
You can do that with the open source package feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.encoding import OneHotEncoder
# load titanic data from openML
pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
# divide into train and test
X_train, X_test, y_train, y_test = train_test_split(
data[['sex', 'embarked']], # predictors for this example
data['survived'], # target
test_size=0.3, # percentage of obs in test set
random_state=0) # seed to ensure reproducibility
ohe_enc = OneHotEncoder(
top_categories=None,
variables=['sex', 'embarked'],
drop_last=True)
ohe_enc.fit(X_train)
X_train = ohe_enc.transform(X_train)
X_test = ohe_enc.transform(X_test)
X_train.head()
You should see this output returned:
sex_female embarked_S embarked_C embarked_Q
501 1 1 0 0
588 1 1 0 0
402 1 0 1 0
1193 0 0 0 1
686 1 0 0 1
More details about feature engine here:
https://www.trainindata.com/feature-engine
https://github.com/feature-engine/feature_engine
https://feature-engine.readthedocs.io/en/latest/
There is a OneHotEncoder that does all the work for you.
Package sksurv has a OneHotEncoder that will return a pandas Dataframe with all the column names set-up for you. Check it out. Make sure you set-up an environment to play with the encoder to ensure it doesn't break your current environment. This encoder saved me a lot of time and effort.
scikit-suvival GitHub
OneHotEncoder Documentation
OneHotEncoder now has a method get_feature_names. You can use input_features=data.columns to match to the training data.

numpy wrong shape of imported data and separating the y value

I have a large csv file ~90k rows and 355 columns. The first 354 columns correspond to the presence of different words, showing a 1 or 0 and the last column to a numerical value.
Eg:
table, box, cups, glasses, total
1,0,0,1,30
0,1,1,1,28
1,1,0,1,55
When I use:
d = np.recfromcsv('clean.csv', dtype=None, delimiter=',', names=True)
d.shape
# I get: (89460,)
So my question is:
How do I get a 2d array/matrix? Does it matter?
How can I separate the 'total' column so I can create train,
cross_validation and test sets and train a model?
np.recfromcsv returns a 1-dimensional record array.
When you have a structured array, you can access the columns by field title:
d['total']
returns the totals column.
You can access rows using integer indexing:
d[0]
returns the first row, for example.
If you wish to select all the columns except the last row, then you'd be better off using a 2D plain NumPy array. With a plain NumPy array (as opposed to a structured array) you can select all the rows except the last on using integer indexing:
You could use np.genfromtxt to load the data into a 2D array:
import numpy as np
d = np.genfromtxt('data', dtype=None, delimiter=',', skiprows=1)
print(d.shape)
# (3, 5)
print(d)
# [[ 1 0 0 1 30]
# [ 0 1 1 1 28]
# [ 1 1 0 1 55]]
This select the last column:
print(d[:,-1])
# [30 28 55]
This select everything but the last column:
print(d[:,:-1])
# [[1 0 0 1]
# [0 1 1 1]
# [1 1 0 1]]
Ok after much googling and time wasting this is what anyone trying to get numpy out of the way so they can read a CSV and get on with Scikit Learn needs to do:
# Say your csv file has 10 columns, 1-9 are features and 10
# is the Y you're trying to predict.
cols = range(0,10)
X = np.loadtxt('data.csv', delimiter=',', dtype=float, usecols=cols, ndmin=2, skiprows=1)
Y = np.loadtxt('data.csv', delimiter=',', dtype=float, usecols=(9,), ndmin=2, skiprows=1)
# note how for Y the usecols argument only takes a sequence,
# even though I only want 1 column I have to give it a sequence.

Categories

Resources