Clustering dataframe after concatenation of x and y

Clustering dataframe after concatenation of x and y - python

I have x and y arrays, x consists of three arrays and y consists of three arrays that consist of seven values
x= [np.array([6.03437288]), np.array([6.39850922]), np.array([6.07835145])]
y= [np.array([[-1.06565856, -0.16222044, 7.85850477, -2.62498475, -0.46315498,
-0.33087472, -0.1394244 ]]),
np.array([[-1.41487104e+00, 5.81421750e-03, 7.92917001e+00,
-3.37987517e+00, 1.14685839e-01, -2.91779263e-01,
2.51753851e-01]]),
np.array([[-1.56496814, 0.2612637 , 7.60577761, -3.55727614, 0.18844392,
-0.75112678, -0.48055978]])]
I concatenate x and y into one dataframe
df = pd.DataFrame({'x': x,'y': y})
then I tried to cluster this dataframe by k-medoids
kmedoids = KMedoids(n_clusters=3, random_state=0).fit(df)
cluster_labels = kmedoids.predict(df)
but I faced this error
ValueError: setting an array element with a sequence.
I tried to search for a solution to this problem, haven't found a concrete solution. any suggestions even with modified the code

Given arrays x and y as provided in question:
import pandas as pd
from sklearn_extra.cluster import KMedoids
df = pd.DataFrame({'x': x,'y': y})
First concatenate x and y of dataframe into one array per row:
df2 = df.apply(lambda r: np.append(r.x, r.y), axis = 1)
Then create one X array:
X = np.array(df2.values.tolist())
that can be passed to clustering method:
kmedoids = KMedoids(n_clusters=3, random_state=0).fit(X)
cluster_labels = kmedoids.predict(X)
result of clustering:
array([2, 0, 1], dtype=int64)

Related

Sklearn complains about one-column dataframes

Consider the following minimal example:
from time import sleep # To (try to) get warnings printed at the right places
import pandas as pd
from sklearn.neural_network import MLPClassifier
from sklearn.dummy import DummyClassifier
df = pd.DataFrame([[1, 1, 1, 1], [0, 0, 0, 0]])
mlp = MLPClassifier(tol=10)
dummy = DummyClassifier(strategy='uniform')
for size in [1, 2]:
input_columns = [0, 1]
output_columns = [j + 2 for j in range(size)]
print('Dimension of output: ', len(output_columns)) # Is 1 or 2
X = df[input_columns]
Y = df[output_columns]
print('MLPClassifier')
mlp.fit(X, Y)
sleep(3)
print('DummyClassifier')
dummy.fit(X, Y)
sleep(3)
print('\n\n\n')
At the first iteration, during the training of the MLPClassifier, Sklearn complains:
lib/python3.6/site-packages/sklearn/neural_network/_multilayer_perceptron.py:934: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
y = column_or_1d(y, warn=True)
The second iteration runs fine. The DummyClassifier (dummy.fit) runs fine in both iterations.
The error is because I'm trying to send a one-column dataframe (Y) to mlp.fit. It doesn't happen on the second iteration, where Y is a two-column dataframe.
The question is: how can I properly pass the data to fit in the case of MLPClassifier? I've learned I can do Y = Y.values.ravel(), which works when the dataframe is one-column, but then it doesn't work for two-column dataframes. I'm looking for a consistent way to solve this generically for any number of columns.

One approach is checking if the number of columns ==1 beforehand.
if len(output_columns) == 1:
mlp.fit(X, Y.values.ravel())
else:
mlp.fit(X, Y)

Pandas subsetting returing different results to numpy

I am trying to subset a pandas dataframe using two conditions. However, I am not getting the same results as when done with numpy. What am I doing wrong?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(20,120,101)
y = np.linspace(-45,25,101)
xs,ys = np.meshgrid(x,y)
idx = (xs >=100) & (ys >= 0)
plt.scatter(xs,ys,s=2,c='b')
plt.scatter(xs[idx],ys[idx],s=2,c='r')
I need to remove the red block from my dataset, which I can do with numpy by using:
plt.scatter(xs[~idx],ys[~idx],s=2,c='b')
How do I replicate this with a pandas dataframe?
I've tried using the same logic as I used above:
data = {'x':x,'y':y}
df = pd.DataFrame(data)
mask = (df.x >=100) & (df.y >= 0)
df2 = df[~mask]
I've also tried using loc:
df.loc[(df.x >=100) & (df.y >= 0),['x','y']] = np.nan
Both of these methods give the following result:
How do I replicate the results from numpy?
Many thanks.

You don't obtain the same result because you didn't create all the couple of coordinates before passing them to pandas. Here is a quick solution:
data = {'x':xs.flatten(),'y':ys.flatten()}
df = pd.DataFrame(data)
mask = (df.x >=100) & (df.y >= 0)
df2 = df[~mask]
plt.scatter(df2.x,df2.y,s=2,c='b')
Flatten reshape your arrays to only have one dimension so that they can be used to construct a DF containing couple of coordinates and not lists.
Output:
Edit: Same result but with dataframe containing x and y
Split the df in chunks
data_x = np.linspace(20,120,101)
data_y = np.linspace(-45,25,101)
dataframe = pd.DataFrame({'x':data_x,'y':data_y})
chunk_size = 25
dfs = [dataframe[i:i+chunk_size] for i in range(0,dataframe.shape[0],chunk_size)]
Define the function that will give you the points you are interested in. Two loops because you need to get every configuration of x and y values
def generatorPoints(dfs):
for i in range(len(dfs)):
x = dfs[i].x
for j in range(len(dfs)):
y = dfs[j].y
xs, ys = np.meshgrid(x,y)
idx = (xs >=100) & (ys >= 0)
yield xs[~idx], ys[~idx]
x, y = [], []
for xs, ys in generatorPoints(dfs):
x.extend(xs), y.extend(ys)
plt.scatter(x,y,s=2,c='b')
This gives the same result as the previous code. There is certainly place to make some optimization but this is a start for your request :).

How to create a panda/pickle dataset wtih np arrays entries in columns so that I can plot them efficiently?

I have a list of dictionaries in python with values both floats and np arrays. I want to save it in a pandas dataframe so that I can plot the xarr, yarr for a given set of parameters.
dic1 = {'p1':1, 'p2': 34, 'xarr' : np.array([1,2,3]) , 'yarr': np.array([4,4,6])}
dic2 = {'p1':2, 'p2': 45 ,'xarr' : np.array([1,2,3]) , 'yarr': np.array([6,6,4])}
listdic = [dic1, dic2]
df = pd.DataFrame(listdic)
dfplot= df[df['p1'] > 1]
x, y = dfplot['xarr'], dfplot['yarr']
plt.plot(x, y)
But I get this error
ValueError: setting an array element with a sequence.
Because I am also getting the index of the pandas df.
Is there an efficient way of doing this?

The resulting dataframe you have created looks like this:
When you ask for: x, y = dfplot['xarr'], dfplot['yarr'], you are asking for the content of the columns xarr & yarr which both contains lists.
Thus for each of them you are getting a series of lists.
See for yourself:
type(x) Will result:
pandas.core.series.Series
You can access the series content using iloc:
idx = 0
x = x_series.iloc[idx]
In your example you have only 1 record so you can simply use:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
dic1 = {'p1':1, 'p2': 34, 'xarr' : np.array([1,2,3]) , 'yarr': np.array([4,4,6])}
dic2 = {'p1':2, 'p2': 45 ,'xarr' : np.array([1,2,3]) , 'yarr': np.array([6,6,4])}
listdic = [dic1, dic2]
df = pd.DataFrame(listdic)
dfplot= df[df['p1'] > 1]
x_series, y_series = dfplot['xarr'], dfplot['yarr']
x,y = x_series.iloc[0], y_series.iloc[0]
plt.plot(x, y)

python: increase performance of finding the best timeshift for a correlation between each X column and y

I have a dataframe X with several columns and a dataframe y with only one column (series). The rows in X represent timesteps and I want to find the interval I need to shift each column of X to obtain the highest correlation with y. I wrote a function that loops over all columns and then loops over all timesteps and correlates the X column with y. If the R² is better than before I store the timestep. However, with over 300 columns this routine is really taking some time and I need to increase the performance. Is there a nice way to simplify this code?
(In the example I used the iris data set which is of course not a timeseries...)
from sklearn import datasets
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt
from copy import deepcopy
def get_best_shift(dfX, dfy, ti=60, maxt=1440):
"""
determines the best correlation for the last maxt minutes based on a
timestep of ti minutes. Creates a dataframe with the shifted variables based on the
best match (strongest correlation).
"""
df_out = deepcopy(dfX)
for xcol in dfX:
bestshift = 0
Rmax = 0
for ishift in range(0, int(maxt / ti)):
xvals = dfX[xcol].iloc[0:(dfX.shape[0] - ishift)].values
yvals = np.array([val[0] for val in dfy.iloc[ishift:dfy.shape[0]].values])
selector = np.array([str(val)!="nan" for val in (xvals*yvals)],dtype=bool)
xvals = xvals[selector]
yvals = yvals[selector]
R = np.corrcoef(xvals,yvals)[0][1]
# plt.figure()
# plt.plot(xvals,yvals,'k.')
# plt.show()
if R ** 2 > Rmax:
Rmax = R ** 2
# print(Rmax)
bestshift = ishift
df_out[xcol] = list(np.zeros(bestshift)) + list(dfX[xcol].iloc[0:dfX.shape[0] - bestshift].values)
df_out = df_out.rename(columns={xcol: ''.join([str(xcol), '_t-', str(bestshift)])})
return df_out
iris = datasets.load_iris()
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)
df = get_best_shift(X,y)

Is there a way to vectorize this loop

Is there a way to vectorize this code to eliminate the for loop:
import numpy as np
Z = np.concatenate((X, labels[:,None]), axis=1)
centroids = np.empty([len(unique(labels))-1,2])
for i in unique(labels[labels>-1]):
centroids[i,:]=Z[Z[:,-1]==i][:,:-1].mean(0)
centroids
This code produces pseudo centroids from the DBSCAN scikit-learn example, in case you want to play with it to find a vectorized form, i.e. X and labels are defined in the example.
Thanks for your help!

You can use bincount() three times:
count = np.bincount(labels)
x = np.bincount(labels, X[:, 0])
y = np.bincount(labels, X[:, 1])
centroids = np.c_[x, y] / count[:, None]
print centroids
But if you can use pandas, this is very simple:
Z = np.concatenate((X, labels[:,None]), axis=1)
df = pd.DataFrame(Z, columns=("x", "y", "label"))
df[df['label']>-1].groupby("label").mean()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Clustering dataframe after concatenation of x and y - python

Related

Sklearn complains about one-column dataframes

Pandas subsetting returing different results to numpy

How to create a panda/pickle dataset wtih np arrays entries in columns so that I can plot them efficiently?

python: increase performance of finding the best timeshift for a correlation between each X column and y

Is there a way to vectorize this loop

Categories

Resources