Minimum reproducible example:
import cudf
from cuml.neighbors import KNeighborsRegressor
d = {
'id':['a','b','c','d','e','f'],
'latitude':[50,-22,13,37,43,14],
'longitude':[3,-43,100,27,-4,121],
}
df = cudf.DataFrame(d)
knn = KNeighborsRegressor(n_neighbors = 4, metric = 'haversine')
knn.fit(df[['latitude','longitude']],df.index)
dists, nears = knn.kneighbors(df[['latitude','longitude']], return_distance = True)
Throws an error number of landmark samples must be >= k
the whole trace is:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_33/1073358290.py in <module>
10 knn = KNeighborsRegressor(n_neighbors = 4, metric = 'haversine')
11 knn.fit(df[['latitude','longitude']],df.index)
---> 12 dists, nears = knn.kneighbors(df[['latitude','longitude']], return_distance = True)
/opt/conda/lib/python3.7/site-packages/cuml/internals/api_decorators.py in inner_get(*args, **kwargs)
584
585 # Call the function
--> 586 ret_val = func(*args, **kwargs)
587
588 return cm.process_return(ret_val)
cuml/neighbors/nearest_neighbors.pyx in cuml.neighbors.nearest_neighbors.NearestNeighbors.kneighbors()
cuml/neighbors/nearest_neighbors.pyx in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors()
cuml/neighbors/nearest_neighbors.pyx in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors_dense()
RuntimeError: exception occured! file=_deps/raft-src/cpp/include/raft/spatial/knn/detail/ball_cover.cuh line=326: number of landmark samples must be >= k
Obtained 64 stack frames
...
I have been trying hard to get around this error for days but the only way i know is to convert the cudf to pandas df and use sklearn. And it works perfectly:
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
d = {
'id':['a','b','c','d','e','f'],
'latitude':[50,-22,13,37,43,14],
'longitude':[3,-43,100,27,-4,121],
}
df = pd.DataFrame(d)
knn = KNeighborsRegressor(n_neighbors = 4, metric = 'haversine')
knn.fit(df[['latitude','longitude']],df.index)
dists, nears = knn.kneighbors(df[['latitude','longitude']], return_distance = True)
dists
gives us the distances array
Can you help me find a pure RAPIDS solution?
UPDATE: I found out that it works for number of neighbors <= length of the total data//2
UPDATE: Its a bug, and an appropriate issue has been opened here. We can pass algorithm='brute' as a work around until the issue gets resolved
so I am pretty new at Python, and I am trying to load a dataset from my computer using scikit. This is what my code looks like:
**whatever.py**
import numpy as np
import csv
from sklearn.datasets.base import Bunch
class Cortex_nuc:
def cortex_nuclear():
with open('C:/Users/User/Desktop/Data_Cortex_Nuclear4.csv') as csv_file:
data_file = csv.reader(csv_file)
temp = next(data_file)
n_samples = int(float(temp[0]))
n_features = int(float(temp[1]))
data = np.empty((n_samples, n_features))
target = np.empty((n_samples,), dtype=np.float64)
for i, sample in enumerate(data_file):
data[i] = np.asarray(sample[:-1], dtype=np.float64)
target[i] = np.asarray(sample[-1], dtype=np.float64)
return Bunch(data=data, target=target)
so then I import it into my project:
from whatever import Cortex_nuc
and after that I try to save it into df:
df = Cortex_nuc.cortex_nuclear()
Btw, this is what the dataset looks like:
this is just a part of the dataset, otherwise it has 77 columns and about a thousand rows.
But I get an error message and I can't seem to figure out why it's happening. Here's the error message:
IndexError Traceback (most recent call last)
<ipython-input-5-a4935f2c187f> in <module>
----> 1 df = Cortex_nuc.cortex_nuclear()
~\whatever.py in cortex_nuclear()
20
21 for i, sample in enumerate(data_file):
---> 22 data[i] = np.asarray(sample[:-1], dtype=np.float64)
23 target[i] = np.asarray(sample[-1], dtype=np.float64)
24
IndexError: index 0 is out of bounds for axis 0 with size 0
Can someone please help me? Thanks!
If you want to create a "sklearn-like" dataset in a Bunch object, you probably want something like this:
import pandas as pd
import numpy as np
from sklearn.utils import Bunch
# For reproducing
from io import StringIO
csv_file = StringIO("""
target,A,B
0,0,0
1,0,1
1,1,0
0,1,1
""")
def load_xor(*, return_X_y=False):
"""Describe your data here."""
_data_file = pd.read_csv(csv_file)
_data = Bunch()
_data["DESCR"] = load_xor.__doc__
_data["data"] = _data_file[["A", "B"]].to_numpy(dtype=np.float64)
_data["target"] = _data_file["target"].to_numpy(dtype=np.float64)
_data["target_names"] = np.array(["false", "true"])
_data["feature_names"] = np.array(list(_data_file.drop(["target"], axis=1)))
if return_X_y:
return _data.data, _data.target
return _data
if __name__ == "__main__":
# Return and unpack the `X`, `y` tuple
X, y = load_xor(return_X_y=True)
print(X, y)
This is because sklearn.datasets typically return Bunch objects with specific attributes/keys (for explanations, see the "Return" section of the load_iris documentation):
>>> from sklearn.datasets import load_iris
>>> data = load_iris()
>>> dir(data)
['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']
I am using the playerStat.csv which includes 8 columns from which I only need 2. So I`m trying to create a new DataFrame with only those 2 columns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_csv("HLTVData/playerStats.csv")
dataset.head(20)
I only need the ADR and the Rating.
So I first create a matrix with the data set.
mat = dataset.as_matrix()
#4 is the ADR and 6 is the Rating
newDAtaSet = pd.DataFrame(dataset, index=indexMatrix,columns=(mat[:,4],mat[:,6]) )
But it didn`t work, it threw an exception
NameError Traceback (most recent call last)
<ipython-input-10-1f975cc2514a> in <module>()
1 #4 is the ADR and 6 is the Rating
----> 2 newDataSet = pd.DataFrame(dataset, index=indexMatrix,columns=(mat[:,4],mat[:,6]) )
NameError: name 'indexMatrix' is not defined
I also tried using the dataset.
newDataSet = pd.DataFrame(dataset, index=np.array(range(dataset.shape[0])), columns=dataset['ADR'])
/home/tensor/miniconda3/envs/tensorflow35openvc/lib/python3.5/site-packages/pandas/core/internals.py in _make_na_block(self, placement, fill_value)
3984
3985 dtype, fill_value = infer_dtype_from_scalar(fill_value)
-> 3986 block_values = np.empty(block_shape, dtype=dtype)
3987 block_values.fill(fill_value)
3988 return make_block(block_values, placement=placement)
MemoryError:
I think you need parameter usecols in read_csv:
dataset = pd.read_csv("HLTVData/playerStats.csv", usecols=['ADR','Rating'])
Or:
dataset = pd.read_csv("HLTVData/playerStats.csv", usecols=[4,6])
I'm beginner in python and I am trying to plotting cluster's center, but can't do that. Here is my code:
import pandas as pd
import numpy as np
df = pd.read_csv("InputClusterModel.txt")
df.columns = ["Major","Quantity","rating","rating_2","RightWindoWeek","Ranking","CopiesQuant","Content","Trump","Movies","Carton","Serial","Before1014","categor","Purchase","Revenue"]
df.head()
from sklearn.cluster import KMeans
cluster = KMeans(n_clusters=2)
df['cluster'] = cluster.fit_predict(df[df.columns[:15]])
from sklearn.decomposition import PCA
x_cols = df.columns[1:]
pca = PCA()
df['x'] = pca.fit_transform(df[x_cols])[:,0]
df['y'] = pca.fit_transform(df[x_cols])[:,1]
df = df.reset_index()
clusters = df[['Purchase', 'cluster', 'x', 'y']]
clusters.head()
%matplotlib inline
from ggplot import *
ggplot(df, aes(x='x', y='y', color='cluster')) + \
geom_point(size=75) + \
ggtitle("Grouped by Cluster")
df.cluster.value_counts()
#after part which below I see mistake:
cluster_centers = pca.transform(cluster.cluster_centers_)
cluster_centers = pd.DataFrame(cluster_centers, columns=['x', 'y'])
cluster_centers['cluster'] = range(0, len(cluster_centers))
ggplot(cluster, aes(x='x', y='y', color='cluster')) + \
geom_point(size=100) + \
geom_point(cluster_centers, size=500) +\
ggtitle("Customers Grouped by Cluster")
print(pca.explained_variance_ratio_)
This is the error I get:
ValueError Traceback (most recent call
last) <ipython-input-18-c2ac22e32b75> in <module>()
----> 1 cluster_centers = pca.transform(cluster.cluster_centers_)
2 cluster_centers = pd.DataFrame(cluster_centers, columns=['x', 'y'])
3 cluster_centers['cluster'] = range(0, len(cluster_centers))
4
5 ggplot(cluster, aes(x='x', y='y', color='cluster')) + geom_point(size=100) + geom_point(cluster_centers, size=500) +
ggtitle("Customers Grouped by Cluster")
/home/belotelov/anaconda2/lib/python2.7/site-packages/sklearn/decomposition/base.pyc
in transform(self, X, y)
130 X = check_array(X)
131 if self.mean_ is not None:
--> 132 X = X - self.mean_
133 X_transformed = fast_dot(X, self.components_.T)
134 if self.whiten:
ValueError: operands could not be broadcast together with shapes
(2,15) (16,)
Structure of my data looks like in this header:
0,122,7,8,6,8,105.704,1,0,1,0,0,0,0,37426,11831762
1,278,8,8,12,2,2246,1,1,1,0,0,0,0,29316,7371029
1,275,6,6,14,1,1268,1,1,1,0,0,0,0,30693,7368787
0,125,5,5,5,1,105.704,1,0,1,0,0,0,0,20661,7337545
1,193,8,8,11,2,1063,1,1,1,0,0,0,0,29141,7279077
1,1,6,6,11,0,1236,1,1,0,1,0,0,0,879,325151
1,116,8,8,14,0,1209,1,1,0,1,0,0,0,17751,5529657
0,39,7,7,11,1,1128,1,1,1,0,0,0,0,15044,5643468
1,65,6,6,11,0,1209,1,1,0,1,0,0,0,9902,2612669
0,170,6,7,2,0,105.704,1,1,1,0,0,0,0,19167,5195321
p.s. Python 2.7.12 :: Anaconda custom (64-bit) on Debian Jessie
I have not reviewed your code line by line. Here's a comment on the error:
ValueError: operands could not be broadcast together with shapes
(2,15) (16,)
As the error implies, you're trying to broadcast X = X - self.mean_ with two incompatible vectors. The rule for broadcasting is that each vector's last dimension's axis lengths should match (here 15 and 1) or both should be 1.
I recommend searching for the generated error and to have a look on this
I try to solve next problem
import numpy as np
import pandas as pd
from scipy import sparse
X1 = sparse.rand(10, 10000)
df = pd.DataFrame({ 'a': range(10)})
In fact, I get X1 from TfidfVectorizer but let go of the code for the sake of brevity
I want to apply sparse.hstack to use both variables in a regression.
I convert pandas to numpy.ndarray as below
X2 = df['a'].as_matrix()
type(X2)
numpy.ndarray
X = sparse.hstack((X1,X2))
ValueError Traceback (most recent call last)
<ipython-input-38-9493e3833c5d> in <module>()
----> 1 X = sparse.hstack((X1,X2))
C:\Program Files\Anaconda3\lib\site-packages\scipy\sparse\construct.py in hstack(blocks, format, dtype)
462
463 """
--> 464 return bmat([blocks], format=format, dtype=dtype)
465
466
C:\Program Files\Anaconda3\lib\site-packages\scipy\sparse\construct.py in bmat(blocks, format, dtype)
579 elif brow_lengths[i] != A.shape[0]:
580 raise ValueError('blocks[%d,:] has incompatible '
--> 581 'row dimensions' % i)
582
583 if bcol_lengths[j] == 0:
ValueError: blocks[0,:] has incompatible row dimensions
What's wrong?
I've done as below. It works
import numpy as np
import pandas as pd
from scipy import sparse
X1 = sparse.rand(10, 10000)
df = pd.DataFrame({ 'a': range(10)})
X2 = df['a'].reset_index()
X2 = X2.iloc[:,[1]].values
X = sparse.hstack((X1,X2))
your arrays must have the same first dimension size and must contain at least 1 row each.
you can check that by X1.shape() and X2.shape()