How to get subset of dataset after K-means clustering - python

I have a dataset val_lab as follows:
[[ 52.85560436 -23.61958699 34.40273147]
[ 70.44462451 -2.74272277 80.32988099]
[ 38.32222473 -11.22753928 24.09593474]
[ 84.83470029 -7.73898094 28.03636332]
[ 76.48246093 0.13784934 76.23718213]
[ 61.21154496 2.24080039 9.38927616]
[ 39.88027333 37.32959609 -19.0592156 ]...]
I use K-means clustering from sklearn and got the prediction value:
from sklearn.cluster import KMeans
y_pred = KMeans(n_clusters= 5 , random_state=0 ).fit_predict(val_lab)
>>>[3 0 1 3 0 3 4 1 4 1 1 1 1 1 1 4 0 3 1 0 3...]
now I want to get the value in every cluster, for example, if y_pred = 3
I get:
[[ 52.85560436 -23.61958699 34.40273147]
[ 84.83470029 -7.73898094 28.03636332]
... ]
(0 and 3 row)
Right now, my idea is:
val_lab_3 = []
for i in range(y_pred.shape[0]):
if y_pred[i] == 3:
val_lab_3.append(val_lab[i,:])
Is there some better idea, because I want to get the subsets in all the clusters. It this too complicated, especially assuming more clusters?

So if I'm understanding this correctly, your rows above are being classified as 0,1,2,3,4 (5 clusters from what I see) and you want to get all of them together.
Pandas would be a good utility. You can use this cluster prediction and make it a new column, then just select those rows where your cluster label is 3
e.g. (assuming your call your new column preds and your original numpy array is called val_lab):
import pandas as pd
df = pd.DataFrame(val_lab)
df['preds'] = y_pred
threes = df[df['preds'] == 3] # This is what you want
print(threes)

I assume val_lab is a numpy array. In that case,
val_lab[y_pred == 3, :]
Will work.

Related

Pandas to_csv is excluding rows from my target variable

I'm using a simple RandomForestRegressor script to predict a target variable. I'm trying to write a new CSV based on my training / validation data to include the actual value and the predicted value. However, when I export the data, the "Predicted Values" column is missing about half the values, and the values that do show up don't correlate well with the features / actual values. It seems like the values are randomized and then assigned to the first half of the rows.
To test, I've tried not splitting the data between validation and training data in the first place. I'm still finding the same problem.
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
#file path
My_File_Path = "//path.csv"
#read the file
My_Data = pd.read_csv(My_File_Path)
#drop the null values
My_Data = My_Data.dropna(axis=0)
#define the target variable
y = My_Data.Annualized_2018_Payments
my_features = ['feature1','feature2','feature3']
#define the features
x = My_Data[my_features]
#set the split data
train_x, val_x, train_y, val_y = train_test_split(x, y, random_state = 1)
forest_model = RandomForestRegressor(random_state = 1)
forest_model.fit(train_x, train_y)
WA_My_preds = forest_model.predict(val_x)
print("MAE for validation data is ", mean_absolute_error(val_y, WA_My_preds))
#print(My_Data.columns)
My_Data_Predicted = My_Data
#My_Data_Predicted.append(prediction_column, ignore_index = False, sort=None)
My_Data_Predicted['Predicted_Value'] = pd.DataFrame(data = forest_model.predict(My_Data_Predicted[my_features]))
print("The average predicted value is ", My_Data_Predicted['Predicted_Value'].mean())
print("The average true value is ", My_Data_Predicted['Annualized_2018_Payments'].mean())
#write to csv
My_Data_Predicted.to_csv("//path….Preds.csv")
I expect every row to have a column that reads "predicted values" with the values predicted by the random forest regressor. But the last half of the rows are missing that value.
For a short answer and resolution:
Based on testing your code, you should try this line instead:
My_Data_Predicted['Predicted_Value'] = forest_model.predict(My_Data_Predicted[my_features])
And now Here's why this is happening:
I tested this using my own dataset and it looks like the issue is this line:
My_Data_Predicted['Predicted_Value'] = pd.DataFrame(data = forest_model.predict(My_Data_Predicted[my_features]))
What is happening, it would seem, is that when you drop the null rows here:
My_Data = My_Data.dropna(axis=0)
you are also dropping the indexes along with the rows, which is not wrong, but important for your issue. To test this, try My_Data_Predicted.index.max() to get the highest index and compare that to My_Data_Predicted.shape and you will see that there are skipped indexes.
The reason this is a problem is that by making your predicted column a dataframe instead of a series, it is automatically trying to merge the new data based on indexes. The issue is that the original dataframe has a higher max index with some gaps, and this new one for predictions has sequential indexes, so some of your predictions are getting dropped in the process of merging.
Here is a dumbed down example of whats going on (pay attention to the indexes):
My_Data_Predicted predictions My_Data_Predicted (merged)
index a b c index d index a b c d
0 1 4 3 0 1 0 1 4 3 1
3 3 2 7 1 2 3 3 2 7 4
4 2 2 2 2 3 4 2 2 2 5
6 4 3 5 3 4 6 4 3 5 NaN
8 6 2 1 4 5 8 6 2 1 NaN
Notice that in the merged dataframe the last two are NaN because there is no index 6 or 8 in the predictions dataframe.
All of this should resolve by passing in the result if the predictions just like this:
My_Data_Predicted['Predicted_Value'] = forest_model.predict(My_Data_Predicted[my_features])
since the type is a numpy array and will not try to merge on the index.

How to select multiple rows from a (geo)pandas dataframe based on an array or propagate metadata of a clustering algorithm result?

I have a geopandas data frame that contains a polygon, region_id and center_point lat and lon in Radians that looks like this:
I then wanted to go about clustering each region by their center point and did the following:
#Set Up
kms_per_radian = 6371.0088
eps = 0.1/kms_per_radian
coords = blocks_meta.as_matrix(columns=['lat', 'lon'])
#Cluster
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=epsilon, algorithm='ball_tree', metric='haversine', min_samples=1).fit(coords)
labels = db.labels_
clusters = pd.Series([coords[labels == n] for n in range(len(set(labels)))])
which yields an array of clusters of center points that looks like this.
array([[ 0.0703843 , 0.170845 ],
[ 0.07037922, 0.17084981],
[ 0.07036705, 0.17085678],
[ 0.0703715 , 0.17083775]])
What I am struggling to figure out how to do is to get the regions_ids associated with each cluster to merge the polygons to create one bigger region without looping through each cluster and for each lat,lon pair and querying the dataframe.
Is there a way of propagating the ids or querying the dataframe per cluster?
Any help here would be appreciated.
Thanks!
EDIT
What I want to avoid is doing this:
clusters_of_regions = []
for cluster in clusters:
cluster_of_regions_ids = []
for entry in cluster:
print(cluster[0][0])
region_id = blocks_meta.loc[blocks_meta['lat'] == cluster[0][0]]['region_id'][1]
cluster_of_regions_ids.append(region_id)
clusters_of_regions.append(cluster_of_regions_ids)
Both to avoid the nested for loop - and when ever I try I keep on getting a key error:
Is there a way I cluster on the regions themselves using the center points as properties.
Thanks
Check the example from skleanr (https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html). I modified it here to have a dataframe and resemble your example.
from sklearn.cluster import DBSCAN
import pandas as pd
import numpy as np
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
df = pd.DataFrame(X, index=list(range(len(X))), columns = ['col1', 'col2'])
clustering = DBSCAN(eps = 3, min_samples = 2).fit(df)
labels = clustering.labels_
df = df.merge(pd.Series(labels).to_frame().rename(columns={0:'clusters'}), left_index = True, right_index = True, how = 'outer')
df
Gives you:
col1 col2 clusters
0 1 2 0
1 2 2 0
2 2 3 0
3 8 7 1
4 8 8 1
5 25 80 -1
According to the description:
labels_ : array, shape = [n_samples] Cluster labels for each point in
the dataset given to fit(). Noisy samples are given the label -1.
In the example, you get two groups (labels 0 and 1). The -1 is a 'noisy' sample, here that sample is clearly larger than the others.
If you do something similar to this you can have your regions_id and the label next to each other and compare whether there is a 1:1 relation or not.
I think your groups are in your labels.
I think what you want is this (I am using labels = [1,2,3,4]):
df1 = pd.DataFrame(ar)
df1.loc[:,'labels'] = pd.Series(labels)
df1
That will create a df like this one :
0 1 labels
0 0.070384 0.170845 1
1 0.070379 0.170850 2
2 0.070367 0.170857 3
3 0.070372 0.170838 4

Python SKLearn: How to Get Feature Names After OneHotEncoder?

I would like to get the feature names of a data set after it has been transformed by SKLearn OneHotEncoder.
In active_features_ attribute in OneHotEncoder one can see a very good explanation how the attributes n_values_, feature_indices_ and active_features_ get filled after transform() was executed.
My question is:
For e.g. DataFrame based input data:
data = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]}).as_matrix()
How does the code look like to get from the original feature names a, b and c to a list of the transformed feature names
(like e.g:
a-0,a-1, a-2, b-0, b-1, b-2, b-3, c-0, c-1, c-2, c-3
or
a-0,a-1, a-2, b-0, b-1, b-2, b-3, b-4, b-5, b-6, b-7, b-8
or anything that helps to see the assignment of encoded columns to the original columns).
Background: I would like to see the feature importances of some of the algorithms to get a feeling for which feature have the most effect on the algorithm used.
You can use pd.get_dummies():
pd.get_dummies(data["a"],prefix="a")
will give you:
a_0 a_1 a_2
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
which can automatically generates the column names. You can apply this to all your columns and then get the columns names. No need to convert them to a numpy matrix.
So with:
df = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]})
data = df.as_matrix()
the solution looks like:
columns = df.columns
my_result = pd.DataFrame()
temp = pd.DataFrame()
for runner in columns:
temp = pd.get_dummies(df[runner], prefix=runner)
my_result[temp.columns] = temp
print(my_result.columns)
>>Index(['a_0', 'a_1', 'a_2', 'b_0', 'b_1', 'b_4', 'b_5', 'c_0', 'c_1', 'c_4',
'c_5'],
dtype='object')
If I understand correctly you can use feature_indices_ to identify which columns correspond to which feature.
e.g.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
data = pd.DataFrame({"a": [0, 1, 2,0], "b": [0,1,4, 5], "c":[0,1,4, 5]}).as_matrix()
ohe = OneHotEncoder(sparse=False)
ohe_fitted = ohe.fit_transform(data)
print(ohe_fitted)
print(ohe.feature_indices_) # [ 0 3 9 15]
From the above feature_indices_ we know if we spliced the OneHotEncoded data from 0:3 we would get the features corresponding to the first column in data like so:
print(ohe_fitted[:,0:3])
Each column in the spliced data represents a value in the first feature. The first column is 0, the second 1 and the third column is 2. To illustrate this on the spliced data, the column labels would look like:
a_0 a_1 a_2
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]
[ 1. 0. 0.]]
Note that features are sorted first before they are encoded.
You can do that with the open source package feature-engine:
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.encoding import OneHotEncoder
# load titanic data from openML
pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
# divide into train and test
X_train, X_test, y_train, y_test = train_test_split(
data[['sex', 'embarked']], # predictors for this example
data['survived'], # target
test_size=0.3, # percentage of obs in test set
random_state=0) # seed to ensure reproducibility
ohe_enc = OneHotEncoder(
top_categories=None,
variables=['sex', 'embarked'],
drop_last=True)
ohe_enc.fit(X_train)
X_train = ohe_enc.transform(X_train)
X_test = ohe_enc.transform(X_test)
X_train.head()
You should see this output returned:
sex_female embarked_S embarked_C embarked_Q
501 1 1 0 0
588 1 1 0 0
402 1 0 1 0
1193 0 0 0 1
686 1 0 0 1
More details about feature engine here:
https://www.trainindata.com/feature-engine
https://github.com/feature-engine/feature_engine
https://feature-engine.readthedocs.io/en/latest/
There is a OneHotEncoder that does all the work for you.
Package sksurv has a OneHotEncoder that will return a pandas Dataframe with all the column names set-up for you. Check it out. Make sure you set-up an environment to play with the encoder to ensure it doesn't break your current environment. This encoder saved me a lot of time and effort.
scikit-suvival GitHub
OneHotEncoder Documentation
OneHotEncoder now has a method get_feature_names. You can use input_features=data.columns to match to the training data.

Using data from pythons pandas dataframes to sample from normal distributions

I'm trying to sample from a normal distribution using means and standard deviations that are stored in pandas DataFrames.
For example:
means= numpy.arange(10)
means=means.reshape(5,2)
produces:
0 1
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
and:
sts=numpy.arange(10,20)
sts=sts.reshape(5,2)
produces:
0 1
0 10 11
1 12 13
2 14 15
3 16 17
4 18 19
How would I produce another pandas dataframe with the same shape but with values sampled from the normal distribution using the corresponding means and standard deviations.
i.e. position 0,0 of this new dataframe would sample from a normal distribution with mean=0 and standard deviation=10, and so on.
My function so far:
def make_distributions(self):
num_data_points,num_species= self.means.shape
samples=[]
for i,j in zip(self.means,self.stds):
for k,l in zip(self.means[i],self.stds[j]):
samples.append( numpy.random.normal(k,l,self.n) )
will sample from the distributions for me but I'm having difficulty putting the data back into the same shaped dataframe as the mean and standard deviation dfs. Does anybody have any suggestions as to how to do this?
Thanks in advance.
You can use numpy.random.normal to sample from a random normal distribution.
IIUC, then this might be easiest, taking advantage of broadcasting:
import numpy as np
np.random.seed(1) # only for demonstration
np.random.normal(means,sts)
array([[ 16.24345364, -5.72932055],
[ -4.33806103, -10.94859209],
[ 16.11570681, -29.52308045],
[ 33.91698823, -5.94051732],
[ 13.74270373, 4.26196287]])
Check that it works:
np.random.seed(1)
print np.random.normal(0,10)
print np.random.normal(1,11)
16.2434536366
-5.72932055015
If you need a pandas DataFrame:
import pandas as pd
pd.DataFrame(np.random.normal(means,sts))
I will use dictionary to construct this dataframe. Suppose indices and columns are the same for means and stds:
means= numpy.arange(10)
means=pd.DataFrame(means.reshape(5,2))
stds=numpy.arange(10,20)
stds=pd.DataFrame(sts.reshape(5,2))
samples={}
for i in means.columns:
col={}
for j in means.index:
col[j]=numpy.random.normal(means.ix[j,i],stds.ix[j,i],2)
samples[i]=col
print(pd.DataFrame(samples))
# 0 1
#0 [0.0760974520154, 3.29439282825] [11.1292510583, 0.318246201796]
#1 [-25.4518020981, 19.2176263823] [17.0826945017, 9.36179435872]
#2 [14.5402484325, 8.33808246538] [6.96459947914, 26.5552235093]
#3 [0.775891790613, -2.09168601369] [2.38723023677, 15.8099942902]
#4 [-0.828518484847, 45.4592922652] [26.8088977308, 16.0818556353]
Or reset the dtype of a DataFrame and reassign values:
import itertools
samples = means * 0
samples = samples.astype(object)
for i,j in itertools.product(means.index, means.columns):
samples.set_value(i,j,numpy.random.normal(means.ix[i,j],stds.ix[i,j],2))

Transform Pandas DataFrame with n-level hierarchical index into n-D Numpy array

Question
Is there a good way to transform a DataFrame with an n-level index into an n-D Numpy array (a.k.a n-tensor)?
Example
Suppose I set up a DataFrame like
from pandas import DataFrame, MultiIndex
index = range(2), range(3)
value = range(2 * 3)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0))
print frame
which outputs
value
0 0 0
1 1
2 3
1 1 5
2 6
The index is a 2-level hierarchical index. I can extract a 2-D Numpy array from the data using
print frame.unstack().values
which outputs
[[ 0. 1. 2.]
[ nan 4. 5.]]
How does this generalize to an n-level index?
Playing with unstack(), it seems that it can only be used to massage the 2-D shape of the DataFrame, but not to add an axis.
I cannot use e.g. frame.values.reshape(x, y, z), since this would require that the frame contains exactly x * y * z rows, which cannot be guaranteed. This is what I tried to demonstrate by drop()ing a row in the above example.
Any suggestions are highly appreciated.
Edit. This approach is much more elegant (and two orders of magnitude faster) than the one I gave below.
# create an empty array of NaN of the right dimensions
shape = map(len, frame.index.levels)
arr = np.full(shape, np.nan)
# fill it using Numpy's advanced indexing
arr[frame.index.codes] = frame.values.flat
# ...or in Pandas < 0.24.0, use
# arr[frame.index.labels] = frame.values.flat
Original solution. Given a setup similar to above, but in 3-D,
from pandas import DataFrame, MultiIndex
from itertools import product
index = range(2), range(2), range(2)
value = range(2 * 2 * 2)
frame = DataFrame(value, columns=['value'],
index=MultiIndex.from_product(index)).drop((1, 0, 1))
print(frame)
we have
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 0 6
1 7
Now, we proceed using the reshape() route, but with some preprocessing to ensure that the length along each dimension will be consistent.
First, reindex the data frame with the full cartesian product of all dimensions. NaN values will be inserted as needed. This operation can be both slow and consume a lot of memory, depending on the number of dimensions and on the size of the data frame.
levels = map(tuple, frame.index.levels)
index = list(product(*levels))
frame = frame.reindex(index)
print(frame)
which outputs
value
0 0 0 0
1 1
1 0 2
1 3
1 0 0 4
1 NaN
1 0 6
1 7
Now, reshape() will work as intended.
shape = map(len, frame.index.levels)
print(frame.values.reshape(shape))
which outputs
[[[ 0. 1.]
[ 2. 3.]]
[[ 4. nan]
[ 6. 7.]]]
The (rather ugly) one-liner is
frame.reindex(list(product(*map(tuple, frame.index.levels)))).values\
.reshape(map(len, frame.index.levels))
This can be done quite nicely using the Python xarray package which can be found here: http://xarray.pydata.org/en/stable/. It has great integration with Pandas and is quite intuitive once you get to grips with it.
If you have a multiindex series you can call the built-in method multiindex_series.to_xarray() (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_xarray.html). This will generate a DataArray object, which is essentially a name-indexed numpy array, using the index values and names as coordinates. Following this you can call .values on the DataArray object to get the underlying numpy array.
If you need your tensor to conform to a set of keys in a specific order, you can also call .reindex(index_name = index_values_in_order) (http://xarray.pydata.org/en/stable/generated/xarray.DataArray.reindex.html) on the DataArray. This can be extremely useful and makes working with the newly generated tensor much easier!

Categories

Resources