RAPIDS cuml KNeighbors: number of landmark samples must be >= k - python

Minimum reproducible example:
import cudf
from cuml.neighbors import KNeighborsRegressor
d = {
'id':['a','b','c','d','e','f'],
'latitude':[50,-22,13,37,43,14],
'longitude':[3,-43,100,27,-4,121],
}
df = cudf.DataFrame(d)
knn = KNeighborsRegressor(n_neighbors = 4, metric = 'haversine')
knn.fit(df[['latitude','longitude']],df.index)
dists, nears = knn.kneighbors(df[['latitude','longitude']], return_distance = True)
Throws an error number of landmark samples must be >= k
the whole trace is:
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_33/1073358290.py in <module>
10 knn = KNeighborsRegressor(n_neighbors = 4, metric = 'haversine')
11 knn.fit(df[['latitude','longitude']],df.index)
---> 12 dists, nears = knn.kneighbors(df[['latitude','longitude']], return_distance = True)
/opt/conda/lib/python3.7/site-packages/cuml/internals/api_decorators.py in inner_get(*args, **kwargs)
584
585 # Call the function
--> 586 ret_val = func(*args, **kwargs)
587
588 return cm.process_return(ret_val)
cuml/neighbors/nearest_neighbors.pyx in cuml.neighbors.nearest_neighbors.NearestNeighbors.kneighbors()
cuml/neighbors/nearest_neighbors.pyx in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors()
cuml/neighbors/nearest_neighbors.pyx in cuml.neighbors.nearest_neighbors.NearestNeighbors._kneighbors_dense()
RuntimeError: exception occured! file=_deps/raft-src/cpp/include/raft/spatial/knn/detail/ball_cover.cuh line=326: number of landmark samples must be >= k
Obtained 64 stack frames
...
I have been trying hard to get around this error for days but the only way i know is to convert the cudf to pandas df and use sklearn. And it works perfectly:
import pandas as pd
from sklearn.neighbors import KNeighborsRegressor
d = {
'id':['a','b','c','d','e','f'],
'latitude':[50,-22,13,37,43,14],
'longitude':[3,-43,100,27,-4,121],
}
df = pd.DataFrame(d)
knn = KNeighborsRegressor(n_neighbors = 4, metric = 'haversine')
knn.fit(df[['latitude','longitude']],df.index)
dists, nears = knn.kneighbors(df[['latitude','longitude']], return_distance = True)
dists
gives us the distances array
Can you help me find a pure RAPIDS solution?
UPDATE: I found out that it works for number of neighbors <= length of the total data//2
UPDATE: Its a bug, and an appropriate issue has been opened here. We can pass algorithm='brute' as a work around until the issue gets resolved

Related

ValueError: Unable to coerce to Series, length must be 1: given n

I have been trying to use RF regression from scikit-learn, but I’m getting an error with my standard (from docs and tutorials) model. Here is the code:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
db = pd.read_excel('/home/artyom/myprojects//valuevo/field2019/report/segs_inventar_dataframe/excel_var/invcents.xlsx')
age = df[['AGE_1', 'AGE_2', 'AGE_3', 'AGE_4', 'AGE_5']]
hight = df [['HIGHT_','HIGHT_1', 'HIGHT_2', 'HIGHT_3', 'HIGHT_4', 'HIGHT_5']]
diam = df[['DIAM_', 'DIAM_1', 'DIAM_2', 'DIAM_3', 'DIAM_4', 'DIAM_5']]
za = df[['ZAPSYR_', 'ZAPSYR_1', 'ZAPSYR_2', 'ZAPSYR_3', 'ZAPSYR_4', 'ZAPSYR_5']]
tova = df[['TOVARN_', 'TOVARN_1', 'TOVARN_2', 'TOVARN_3', 'TOVARN_4', 'TOVARN_5']]
#df['average'] = df.mean(numeric_only=True, axis=1)
df['meanage'] = age.mean(numeric_only=True, axis=1)
df['meanhight'] = hight.mean(numeric_only=True, axis=1)
df['mediandiam'] = diam.mean(numeric_only=True, axis=1)
df['medianza'] = za.mean(numeric_only=True, axis=1)
df['mediantova'] = tova.mean(numeric_only=True, axis=1)
unite = df[['gapA_segA','gapP_segP', 'A_median', 'p_median', 'circ_media','fdi_median', 'pfd_median', 'p_a_median', 'gsci_media','meanhight']].dropna()
from sklearn.model_selection import train_test_split as ttsplit
df_copy = unite.copy()
trainXset = df_copy[['gapA_segA','gapP_segP', 'A_median', 'p_median', 'circ_media','fdi_median', 'pfd_median', 'p_a_median', 'gsci_media']]
trainYset = df_copy [['meanhight']]
trainXset_train, trainXset_test, trainYset_train, trainYset_test = ttsplit(trainXset, trainYset, test_size=0.3) # 70% training and 30% test
rf = RandomForestRegressor(n_estimators = 100, random_state = 40)
rf.fit(trainXset_train, trainYset_train)
predictions = rf.predict(trainXset_test)
errors = abs(predictions - trainYset_test)
mape = 100 * (errors / trainYset_test)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
But output doesn’t look ok:
---> 24 errors = abs(predictions - trainYset_test)
25 # Calculate mean absolute percentage error (MAPE)
26 mape = 100 * (errors / trainYset_test)
..... somemore track
ValueError: Unable to coerce to Series, length must be 1: given 780
How can I fix it? 780 is the shape of trainYset_test. I’m not asking for a solution (i.e. write code for me), but for advice on why this error happened. I followed everything as in tutorials.
by seeing in error it is cleared that, the array has to have the shape of one ,
so use reshape to make it in correct shape,
predictions=predictions.reshape(780,1)
I solved this by making sure the predictions were the same data type as the actual data. In my case, it was:
MSE = (sum((y_test-predictions)**2))/(len(newX)-len(newX.columns))
I resolved this by casting y_test to be a numpy array:
MSE = (sum((np.array(y_test)-predictions)**2))/(len(newX)-len(newX.columns))

pymc3 : Dirichlet with multidimensional concentration factor

I am struggling with implementing a model where the concentration factor of the Dirichlet variable is dependent on another variable.
The situation is the following:
A system fails due to faulty components (there are three components, only one fails at each test/observation).
The probability of failure of the components is dependent on the temperature.
Here is a (commented) short implementation of the situation:
import numpy as np
import pymc3 as pm
import theano.tensor as tt
# Temperature data : 3 cold temperatures and 3 warm temperatures
T_data = np.array([10, 12, 14, 80, 90, 95])
# Data of failures of 3 components : [0,0,1] means component 3 failed
F_data = np.array([[0, 0, 1], \
[0, 0, 1], \
[0, 0, 1], \
[1, 0, 0], \
[1, 0, 0], \
[1, 0, 0]])
n_component = 3
# When temperature is cold : Component 1 fails
# When temperature is warm : Component 3 fails
# Component 2 never fails
# Number of observations :
n_obs = len(F_data)
# The number of failures can be modeled as a Multinomial F ~ M(n_obs, p) with parameters
# - n_test : number of tests (Fixed)
# - p : probability of failure of each component (shape (n_obs, 3))
# The probability of failure of components follows a Dirichlet distribution p ~ Dir(alpha) with parameters:
# - alpha : concentration (shape (n_obs, 3))
# The Dirichlet distributions ensures the probabilities sum to 1
# The alpha parameters (and the the probability of failures) depend on the temperature alpha ~ a + b * T
# - a : bias term (shape (1,3))
# - b : describes temperature dependency of alpha (shape (1,3))
_
# The prior on "a" is a normal distributions with mean 1/2 and std 0.001
# a ~ N(1/2, 0.001)
# The prior on "b" is a normal distribution zith mean 0 and std 0.001
# b ~ N(0, 0.001)
# Coding it all with pymc3
with pm.Model() as model:
a = pm.Normal('a', 1/2, 1/(0.001**2), shape = n_component)
b = pm.Normal('b', 0, 1/(0.001**2), shape = n_component)
# I generate 3 alphas values (corresponding to the 3 components) for each of the 6 temperatures
# I tried different ways to compute alpha but nothing worked out
alphas = pm.Deterministic('alphas', a + b * tt.stack([T_data, T_data, T_data], axis=1))
#alphas = pm.Deterministic('alphas', a + b[None, :] * T_data[:, None])
#alphas = pm.Deterministic('alphas', a + tt.outer(T_data,b))
# I think I should get 3 probabilities (corresponding to the 3 components) for each of the 6 temperatures
#p = pm.Dirichlet('p', alphas, shape = n_component)
p = pm.Dirichlet('p', alphas, shape = (n_obs,n_component))
# Multinomial is observed and take values from F_data
F = pm.Multinomial('F', 1, p, observed = F_data)
with model:
trace = pm.sample(5000)
I get the following error in the sample function:
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/anaconda3/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 73, in run
self._start_loop()
File "/anaconda3/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 113, in _start_loop
point, stats = self._compute_point()
File "/anaconda3/lib/python3.6/site-packages/pymc3/parallel_sampling.py", line 139, in _compute_point
point, stats = self._step_method.step(self._point)
File "/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/arraystep.py", line 247, in step
apoint, stats = self.astep(array)
File "/anaconda3/lib/python3.6/site-packages/pymc3/step_methods/hmc/base_hmc.py", line 117, in astep
'might be misspecified.' % start.energy)
ValueError: Bad initial energy: inf. The model might be misspecified.
"""
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
ValueError: Bad initial energy: inf. The model might be misspecified.
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-5-121fdd564b02> in <module>()
1 with model:
2 #start = pm.find_MAP()
----> 3 trace = pm.sample(5000)
/anaconda3/lib/python3.6/site-packages/pymc3/sampling.py in sample(draws, step, init, n_init, start, trace, chain_idx, chains, cores, tune, nuts_kwargs, step_kwargs, progressbar, model, random_seed, live_plot, discard_tuned_samples, live_plot_kwargs, compute_convergence_checks, use_mmap, **kwargs)
438 _print_step_hierarchy(step)
439 try:
--> 440 trace = _mp_sample(**sample_args)
441 except pickle.PickleError:
442 _log.warning("Could not pickle model, sampling singlethreaded.")
/anaconda3/lib/python3.6/site-packages/pymc3/sampling.py in _mp_sample(draws, tune, step, chains, cores, chain, random_seed, start, progressbar, trace, model, use_mmap, **kwargs)
988 try:
989 with sampler:
--> 990 for draw in sampler:
991 trace = traces[draw.chain - chain]
992 if trace.supports_sampler_stats and draw.stats is not None:
/anaconda3/lib/python3.6/site-packages/pymc3/parallel_sampling.py in __iter__(self)
303
304 while self._active:
--> 305 draw = ProcessAdapter.recv_draw(self._active)
306 proc, is_last, draw, tuning, stats, warns = draw
307 if self._progress is not None:
/anaconda3/lib/python3.6/site-packages/pymc3/parallel_sampling.py in recv_draw(processes, timeout)
221 if msg[0] == 'error':
222 old = msg[1]
--> 223 six.raise_from(RuntimeError('Chain %s failed.' % proc.chain), old)
224 elif msg[0] == 'writing_done':
225 proc._readable = True
/anaconda3/lib/python3.6/site-packages/six.py in raise_from(value, from_value)
RuntimeError: Chain 1 failed.
Any suggestions ?
Misspecified model. The alphas are taking on nonpositive values under your current parameterization, whereas the Dirichlet distribution requires them to be positive, making the model misspecified.
In Dirichlet-Multinomial regression, one uses an exponential link function to mediate between the range of the linear model and the domain of the Dirichlet-Multinomial, namely,
alpha = exp(beta*X)
There are details on this in the MGLM package documentation.
Dirichlet-Multinomial Regression Model
If we implement this model we can achieve decent model convergence and sampling.
import numpy as np
import pymc3 as pm
import theano
import theano.tensor as tt
from sklearn.preprocessing import scale
T_data = np.array([10,12,14,80,90,95])
# standardize the data for better sampling
T_data_z = scale(T_data)
# transform to theano tensor, so it works with tt.outer
T_data_z = theano.shared(T_data_z)
F_data = np.array([
[0,0,1],
[0,0,1],
[0,0,1],
[1,0,0],
[1,0,0],
[1,0,0],
])
# N = num_obs, K = num_components
N, K = F_data.shape
with pm.Model() as dmr_model:
a = pm.Normal('a', mu=0, sd=1, shape=K)
b = pm.Normal('b', mu=0, sd=1, shape=K)
alpha = pm.Deterministic('alpha', pm.math.exp(a + tt.outer(T_data_z, b)))
p = pm.Dirichlet('p', a=alpha, shape=(N, K))
F = pm.Multinomial('F', 1, p, observed=F_data)
trace = pm.sample(5000, tune=10000, target_accept=0.9)
Model Outcomes
The sampling in this model isn't perfect. For example, there are still a number of divergences even with the increased target acceptance rate and additional tuning.
There were 501 divergences after tuning. Increase target_accept or reparameterize.
There were 477 divergences after tuning. Increase target_accept or reparameterize.
The acceptance probability does not match the target. It is 0.5858954056820339, but should be close to 0.8. Try to increase the number of tuning steps.
The number of effective samples is smaller than 10% for some parameters.
Trace Plots
We can see the traces for a and b look good, and the mean locations make sense with data.
Pair Plot
While correlation is less of a problem for NUTS, having uncorrelated posterior sampling is ideal. For the most part we're seeing low correlation, with some slight structure within the a components.
Posterior Plots
Finally, we can look at the posterior plots of p and confirm they make sense with the data.
Alternative Model
The advantage of the Dirichlet-Multinomial is handling overdispersion. It might be worth trying the simpler Multinomial Logisitic Regression / Softmax Regression, since it runs significantly faster and doesn't exhibit any of the sampling problems coming up in the DMR model.
In the end, you could run both and perform model comparison to see if the Dirichlet-Multinomial really is adding explanatory value.
Model
with pm.Model() as softmax_model:
a = pm.Normal('a', mu=0, sd=1, shape=K)
b = pm.Normal('b', mu=0, sd=1, shape=K)
p = pm.Deterministic('p', tt.nnet.softmax(a + tt.outer(T_data_z, b)))
F = pm.Multinomial('F', 1, p, observed = F_data)
trace_sm = pm.sample(5000, tune=10000)
Posterior Plots

"RecursionError: maximum recursion depth exceeded" when using statsmodels OLS?

I am trying to do a linear regression for a fairly large dataset where I can actually get p-values for each of my coefficients. This is fairly straightforward when the dataset is smaller but everything breaks when I use it on my actual dataset. This code replicates the problem with a toy dataset. I could do the linear regression w/ sklearn but I can't get p-values using this method and I also prefer statsmodels for this task in particular b/c the way it handles categorical data.
How can I make statsmodels run linear models on more complex datasets? I know this is not statistically recommended b/c there are more attributes than observations but I am trying out some exercises.
This happens using OLS, GLM, and MixedLM as well.
I even tried setting my recursion limit higher but it did not work...
A few posts cover this topic but none deal with datasets that yield a recursion error:
Find p-value (significance) in scikit-learn LinearRegression
https://datascience.stackexchange.com/questions/15398/how-to-get-p-value-and-confident-interval-in-logisticregression-with-sklearn
# Make dataset
from sklearn.datasets import make_regression
import numpy as np
import pandas as pd
X, y = make_regression(n_features = 4000)
X = pd.DataFrame(X,
index=[*map(lambda i:f"sample_{i}", range(X.shape[0]))],
columns=[*map(lambda j:f"attr_{j}", range(X.shape[1]))],
)
y = pd.Series(y,index=X.index)
# X.iloc[:5,:5]
# attr_0 attr_1 attr_2 attr_3 attr_4
# sample_0 -2.077675 -0.222409 -0.782709 1.265239 1.606933
# sample_1 0.040124 -1.427598 -0.595388 0.403271 2.098169
# sample_2 -0.864165 0.465151 0.636452 -0.127071 -0.405423
# sample_3 -1.725911 0.148566 0.343320 -0.351172 1.755546
# sample_4 0.695828 1.313974 1.149156 1.846968 -0.009125
# Import statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf
data = X.copy()
data["y"] = y
formula = "y ~ " + " + ".join(X.columns)
model = smf.ols(formula=formula, data=data).fit()
# ---------------------------------------------------------------------------
# RecursionError Traceback (most recent call last)
# <ipython-input-11-4479099d07d7> in <module>()
# 24 data["y"] = y
# 25 formula = "y ~ " + " + ".join(X.columns)
# ---> 26 model = smf.ols(formula=formula, data=data)
# ...
# ~/anaconda/envs/python3/lib/python3.6/site-packages/patsy/desc.py in eval(self, tree, require_evalexpr)
# 398 "'%s' operator" % (tree.type,),
# 399 tree.token)
# --> 400 result = self._evaluators[key](self, tree)
# 401 if require_evalexpr and not isinstance(result, IntermediateExpr):
# 402 if isinstance(result, ModelDesc):
# RecursionError: maximum recursion depth exceeded
# https://pastebin.com/JhmqPKp4
Alterantively, I tried tweaking some code I found for using sklearn but I got the same error:
# Sklearn method
# https://gist.github.com/rspeare/77061e6e317896be29c6de9a85db301d
from sklearn.linear_model import LinearRegression
class LinearRegression:
"""
Wrapper Class for Logistic Regression which has the usual sklearn instance
in an attribute self.model, and pvalues, z scores and estimated
errors for each coefficient in
self.z_scores
self.p_values
self.sigma_estimates
as well as the negative hessian of the log Likelihood (Fisher information)
self.F_ij
"""
def __init__(self,*args,**kwargs):#,**kwargs):
self.model = LinearRegression(*args,**kwargs)#,**args)
def fit(self,X,y):
self.model.fit(X,y)
#### Get p-values for the fitted model ####
denom = (2.0*(1.0+np.cosh(self.model.decision_function(X))))
F_ij = np.dot((X/denom[:,None]).T,X) ## Fisher Information Matrix
Cramer_Rao = np.linalg.inv(F_ij) ## Inverse Information Matrix
sigma_estimates = np.array([np.sqrt(Cramer_Rao[i,i]) for i in range(Cramer_Rao.shape[0])]) # sigma for each coefficient
z_scores = self.model.coef_[0]/sigma_estimates # z-score for eaach model coefficient
p_values = [stat.norm.sf(abs(x))*2 for x in z_scores] ### two tailed test for p-values
self.z_scores = z_scores
self.p_values = p_values
self.sigma_estimates = sigma_estimates
self.F_ij = F_iJ
model = LinearRegression().fit(X,y)
# RecursionError Traceback (most recent call last)
# <ipython-input-18-6f8d228c181e> in <module>()
# 35 self.F_ij = F_iJ
# 36
# ---> 37 model = LinearRegression().fit(X,y)
# <ipython-input-18-6f8d228c181e> in __init__(self, *args, **kwargs)
# 18
# 19 def __init__(self,*args,**kwargs):#,**kwargs):
# ---> 20 self.model = LinearRegression(*args,**kwargs)#,**args)
# 21
# 22 def fit(self,X,y):
# ... last 1 frames repeated, from the frame below ...
# <ipython-input-18-6f8d228c181e> in __init__(self, *args, **kwargs)
# 18
# 19 def __init__(self,*args,**kwargs):#,**kwargs):
# ---> 20 self.model = LinearRegression(*args,**kwargs)#,**args)
# 21
# 22 def fit(self,X,y):
# RecursionError: maximum recursion depth exceeded

Python rpy2 - nls regression RRuntimeError

I am trying to do some nls regression using R within Python. I am getting stuck with a RRuntimeError and am getting to a point where I am way outside my expertise and have struggled for a few days to get it to work so would appreciate some help.
This is my csv of data:
http://www.sharecsv.com/s/4cdd4f832b606d6616260f9dc0eedf38/ratedata.csv
This is my code:
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
dfData = pd.read_csv('C:\\Users\\nick\\Desktop\\ratedata.csv')
rdf = pandas2ri.py2ri(dfData)
a = 0.5
b = 1.1
count = rdf.rx(True, 'Trials')
rates = rdf.rx(True, 'Successes')
base = importr('base', robject_translations={'with': '_with'})
stats = importr('stats', robject_translations={'format_perc': '_format_perc'})
my_formula = stats.as_formula('rates ~ 1-(1/(10^(a * count ^ (b-1))))')
d = ro.ListVector({'a': a, 'b': b})
fit = stats.nls(my_formula, weights=count, start=d)
Everything is compiling apart from:
fit = stats.nls(my_formula, weights=count, start=d)
I am getting the following traceback:
---------------------------------------------------------------------------
RRuntimeError Traceback (most recent call last)
<ipython-input-12-3f7fcd7d7851> in <module>()
6 d = ro.ListVector({'a': a, 'b': b})
7
----> 8 fit = stats.nls(my_formula, weights=count, start=d)
~\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2\robjects\functions.py in __call__(self, *args, **kwargs)
176 v = kwargs.pop(k)
177 kwargs[r_k] = v
--> 178 return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
179
180 pattern_link = re.compile(r'\\link\{(.+?)\}')
~\AppData\Local\Continuum\anaconda3\lib\site-packages\rpy2\robjects\functions.py in __call__(self, *args, **kwargs)
104 for k, v in kwargs.items():
105 new_kwargs[k] = conversion.py2ri(v)
--> 106 res = super(Function, self).__call__(*new_args, **new_kwargs)
107 res = conversion.ri2ro(res)
108 return res
RRuntimeError: Error in (function (formula, data = parent.frame(), start, control = nls.control(), :
parameters without starting value in 'data': rates, count
I would be eternally thankful if anyone can see where I am going wrong, or can offer advice. All I want is the two numbers from that formula back in Python so I can use those to construct some confidence intervals.
Thank you
Consider incorporating all your formula variables into a single dataframe and use the data argument. The as_formula call looks in the R environment but rates and count are in the Python scope. Hence, contain all items in same object. Then run your nls with either the Pandas dataframe or R dataframe:
import pandas as pd
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
base = importr('base', robject_translations={'with': '_with'})
stats = importr('stats', robject_translations={'format_perc': '_format_perc'})
a = 0.05
b = 1.1
d = ro.ListVector({'a': a, 'b': b})
dfData = pd.read_csv('Input.csv')
dfData['count'] = dfData['Trials'].astype('float')
dfData['rates'] = dfData['Successes'] / dfData['Trials']
dfData['a'] = a
dfData['b'] = b
pandas2ri.activate()
rdf = pandas2ri.py2ri(dfData)
my_formula = stats.as_formula('rates ~ 1-(1/(10^(a * count ^ (b-1))))')
# WITH PANDAS DATAFRAME
fit = stats.nls(formula=my_formula, data=dfData, weights=dfData['count'], start=d)
print(fit)
# WITH R DATAFRAME
fit = stats.nls(formula=my_formula, data=rdf, weights=rdf.rx(True, 'count'), start=d)
print(fit)
Alternatively, you can use robjects.globalenv and not use data argument:
ro.globalenv['rates'] = dfData['rates']
ro.globalenv['count'] = dfData['count']
ro.globalenv['a'] = dfData['a']
ro.globalenv['b'] = dfData['b']
fit = stats.nls(formula=my_formula, weights=dfData['count'], start=d)
print(fit)
# Nonlinear regression model
# model: rates ~ 1 - (1/(10^(a * count^(b - 1))))
# data: parent.frame()
# a b
# 0.01043 1.24943
# weighted residual sum-of-squares: 14.37
# Number of iterations to convergence: 6
# Achieved convergence tolerance: 9.793e-07
# To return parameters
num = fit.rx('m')[0].names.index('getPars')
obj = fit.rx('m')[0][num]()
print(obj[0])
# 0.010425686223717435
print(obj[1])
# 1.2494303314553932
Equivalently in R:
dfData <- read.csv('Input.csv')
a <- .05
b <- 1.1
d <- list(a=a, b=b)
dfData$count <- dfData$Trials
dfData$rates <- dfData$Successes / dfData$Trials
dfData$a <- a
dfData$b <- b
my_formula <- stats::as.formula("rates ~ 1-(1/(10^(a * count ^ (b-1))))")
fit <- stats::nls(my_formula, data=dfData, weights=dfData$count, start=d)
print(fit)
# Nonlinear regression model
# model: rates ~ 1 - (1/(10^(a * count^(b - 1))))
# data: dfData
# a b
# 0.01043 1.24943
# weighted residual sum-of-squares: 14.37
# Number of iterations to convergence: 6
# Achieved convergence tolerance: 9.793e-07
# To return parameters
fit$m$getPars()['a']
# 0.01042569
fit$m$getPars()['b']
# 1.24943

Scipy in Jupiter

Can someone help me to figure out why i'm having this error code : ValueError: n_components must be < n_features; got 10 >= 0
import pandas as pd
from scipy.sparse import csr_matrix
users = pd.read_table(open('ml-1m/users.dat', encoding = "ISO-8859-1"), sep=':', header=None, names=['user_id', 'gender', 'age', 'occupation', 'zip'])
ratings = pd.read_table(open('ml-1m/ratings.dat', encoding = "ISO-8859-1"), sep=':', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'])
movies = pd.read_table(open('ml-1m/movies.dat', encoding = "ISO-8859-1"), sep=':', header=None, names=['movie_id', 'title', 'genres'])
MovieLens = pd.merge(pd.merge(ratings, users), movies)
ratings_mtx_df = MovieLens.pivot_table(values='rating', index='user_id', columns='title', fill_value=0)
movie_index = ratings_mtx_df.columns
from sklearn.decomposition import TruncatedSVD
recom = TruncatedSVD(n_components=10, random_state=101)
R = recom.fit_transform(ratings_mtx_df.values.T)
ValueError Traceback (most recent call last)
<ipython-input-8-0bd6c9bda95a> in <module>()
1 from sklearn.decomposition import TruncatedSVD
2 recom = TruncatedSVD(n_components=10, random_state=101)
----> 3 R = recom.fit_transform(ratings_mtx_df.values.T)
C:\Users\renau\Anaconda3\lib\site-packages\sklearn\decomposition\truncated_svd.py in fit_transform(self, X, y)
168 if k >= n_features:
169 raise ValueError("n_components must be < n_features;"
--> 170 " got %d >= %d" % (k, n_features))
171 U, Sigma, VT = randomized_svd(X, self.n_components,
172 n_iter=self.n_iter,
ValueError: n_components must be < n_features; got 10 >= 0
You're trying to split your data into 10 dimensions, but as per the documentation for TruncatedSVD, the number of features (columns) in your ratings_mtx_df data needs to be greater than the number of dimensions/components you're looking to extract. Try n_components=3 (assuming you've got at least 3 features in your data) and see if that's any better.
Also, you're turning your input data sideways, with the .T argument in:
R = recom.fit_transform(ratings_mtx_df.values.T)
That may result in switching features (columns) for observations(rows) which might explain why the fit_transform method isn't working.

Categories

Resources