Lightgbm can't access data from Dataset get_field method - python

I have got a simple lgbm dataset:
import lightgbm as lgbm
dataset = lgbm.Dataset(data=X, label=y, feature_name=X.columns.tolist())
Where X is a pandas df, and y a pandas series. I want to access a specific column of X in my custom objective function. But when I try:
data = dataset.get_field('data')
I get this error message:
Traceback (most recent call last):
File "<ipython-input-71-34d27860b9e3>", line 1, in <module>
data = dataset.get_field('data')
File "/Users/***/anaconda3/envs/py3k/lib/python3.6/site-packages/lightgbm/basic.py", line 1007, in get_field
ctypes.byref(out_type)))
File "/Users/***/anaconda3/envs/py3k/lib/python3.6/site-packages/lightgbm/basic.py", line 48, in _safe_call
raise LightGBMError(_LIB.LGBM_GetLastError())
LightGBMError: b'Field not found'
Whereas this works well:
y = dataset.get_field('label')
Thank you!

It doesn't seem to be possible.
The data seems to be the core of a dataset, whereas the rest of lgb.Dataset constructor arguments are handled as additional features. You can see all of them other than the data end up in lgb.Dataset.set_field function as can be tracked in the _lazy_init function. Filed setting in C back-end is handled by SetXXXField functions as handled by the LGBM_DatasetSetField function. You will see that those calls do not appear elsewhere in c_api.cpp

Related

Python - Rolling Function (Step - Pandas 1.5.0)

I updated all the relevant libraries to the latest version, hence I thought that I could now use the new feature (step) in the rolling-function:
print(df['600028.SS'].rolling(window=125, step=20).corr(df['600121.SS']))
However, when executing the commande, I always get the following error message:
NotImplementedError: step not implemented for corr
How can I implement the step-feature for corr, or is there any way to circumvent the error message. By the way '600028.SS' and '600121.SS' are just the names of two columns in the dataframe
I want to get the correlation coefficient for those two stocks on a rolling basis. Every correlation coefficient should include the last 125 observations, and the step-size should be 20. And with the new step feature since the latest pandas update (1.5.0) I thought it should be fine to use now, however, I still receive the massage that step would not be implemented for corr.
The symptom is easily reproduced:
>>> df = pd.DataFrame([dict(a=1, b=2)])
>>> df.a.rolling(window=125, step=20).corr(df.b)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/jhanley/miniconda3/envs/problems/lib/python3.10/site-packages/pandas/core/window/rolling.py", line 2829, in corr
return super().corr(
File "/Users/jhanley/miniconda3/envs/problems/lib/python3.10/site-packages/pandas/core/window/rolling.py", line 1757, in corr
raise NotImplementedError("step not implemented for corr")
NotImplementedError: step not implemented for corr
I am reading the 1.5.1 documentation, https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.window.rolling.Rolling.corr.html .
It lists four possible args:
other
pairwise
ddof
numeric_only
It doesn't mention a step parameter.
It explains that additional args
are accepted
"for NumPy compatibility and will not have an effect on the result."
The diagnostic error message is accurate.
You are attempting to use something that
is not implemented.
That won't work.

TypeError: ratio() missing 1 required positional argument: 'metric_fun'

I'm trying to use the aif360 library of ibm for debiasing.
I'm working on a linear regression model and want to try out a metric to calculate the difference between the priviliged and unpriviliged groups.
However when this code is run I get the following error:
TypeError: difference() missing 1 required positional argument: 'metric_fun'
I've looked into the class for this function but they are referring to a metric_fun, also read the docs but didn't get any further.
The function is missing an argument, but I don't know which argument it expects.
A short snippit of the code is:
train_pp_bld = StructuredDataset(df=pd.concat((x_train, y_train),
axis=1),
label_names=['decile_score'],
protected_attribute_names=['sex_Male'],
privileged_protected_attributes=1,
unprivileged_protected_attributes=0)
privileged_groups = [{'sex_Male': 1}]
unprivileged_groups = [{'sex_Male': 0}]
# Create the metric object
metric_train_bld = DatasetMetric(train_pp_bld,
unprivileged_groups=unprivileged_groups,
privileged_groups=privileged_groups)
# Metric for the original dataset
metric_orig_train = DatasetMetric(train_pp_bld,
unprivileged_groups=unprivileged_groups,
privileged_groups=privileged_groups)
display(Markdown("#### Original training dataset"))
print("Difference in mean outcomes between unprivileged and privileged groups = %f" % metric_orig_train.difference())
The stack trace that was given is:
Traceback (most recent call last):
File "/Users/sef/Desktop/Thesis/Python Projects/Stats/COMPAS_Debias_AIF360_Continuous_Variable.py", line 116, in <module>
print("Difference in mean outcomes between unprivileged and privileged groups = %f" % metric_orig_train.difference())
File "/Users/sef/opt/anaconda3/envs/AI/lib/python3.8/site-packages/aif360/metrics/metric.py", line 37, in wrapper
result = func(*args, **kwargs)
TypeError: difference() missing 1 required positional argument: 'metric_fun'
After creating a function:
def privileged_value(self, privileged=False):
if privileged:
return unprivileged_groups['sex_Male']
else:
return privileged_groups['sex_Male']
display(Markdown("#### Original training dataset"))
print("Difference in mean outcomes between unprivileged and privileged groups = %f" % metric_orig_train.difference(privileged_value))
still get a similar error traceback:
Traceback (most recent call last):
File "/Users/sef/Desktop/Thesis/Python Projects/Stats/COMPAS_Debias_AIF360_Continuous_Variable.py", line 123, in <module>
print("Difference in mean outcomes between unprivileged and privileged groups = %f" % metric_orig_train.difference(privileged_value))
File "/Users/sef/opt/anaconda3/envs/AI/lib/python3.8/site-packages/aif360/metrics/metric.py", line 37, in wrapper
result = func(*args, **kwargs)
File "/Users/sef/opt/anaconda3/envs/AI/lib/python3.8/site-packages/aif360/metrics/dataset_metric.py", line 77, in difference
return metric_fun(privileged=False) - metric_fun(privileged=True)
File "/Users/youssefennali/Desktop/Thesis/Python Projects/Stats/COMPAS_Debias_AIF360_Continuous_Variable.py", line 120, in privileged_value
return privileged_groups['sex_Male']
TypeError: list indices must be integers or slices, not str
Could someone please point me in the right direction?
There are no examples available of similar code online.
Regards,
Sef
Looking at the source code for the library on GitHub a reference to a function needs to be passed into difference(self, metric_fun). All difference does is subtract the output of your function with privileged=False as the input with the output of your function with privileged=True as the input.
def difference(self, metric_fun):
"""Compute difference of the metric for unprivileged and privileged
groups.
"""
return metric_fun(privileged=False) - metric_fun(privileged=True)
Create a function like this and pass it into difference.
def privilege_value(privileged=False) -> int:
if privileged:
return unprivileged_groups[0]['sex_male']
else:
return privileged_groups[0]['sex_male']
metric_orig_train.difference(privilege_value)
Well, without knowing anything about the library you're using, the error message still seems pretty clear, especially since you only call difference once, like this:
metric_orig_train.difference()
The error message is telling you that you should be passing an argument in this call. The name of the argument is metric_fun, which suggests to me that you are supposed to pass it a function reference.
NOTE: It is possible that difference() is being called outside your code. When you supply an error message, please always submit the stack trace that came along with it, if there is one. Then we can see exactly where in the code the problem occurred.

Error in loading model in PyTorch

I Have the following code snippet
from train import predict
import random
import torch
ann=torch.load('ann.pt') #importing trained model
while True:
k=raw_input("User:")
intent,top_value,top_index = predict(str(k),ann)
print(intent)
when I run the script it is throwing the error as below:
Traceback (most recent call last):
File "test.py", line 6, in <module>
ann=torch.load('ann.pt') #importing trained model
File "/home/local/ZOHOCORP/raghav-5305/miniconda2/lib/python2.7/site-packages/torch/serialization.py", line 261, in load
return _load(f, map_location, pickle_module)
File "/home/local/ZOHOCORP/raghav-5305/miniconda2/lib/python2.7/site-packages/torch/serialization.py", line 409, in _load
result = unpickler.load()
AttributeError: 'module' object has no attribute 'ANN'
I have ann.pt file in the same folder as my script is.
Kindly help me identify fix the error and load the model.
Thanks in advance.
When trying to save both parameters and model, pytorch pickles the parameters but only store path the model Class. For instance, changing tree structure or refactoring can break loading.
Therefore as the documentation points out, it is not recommended, prefer only save/load parameters:
...the serialized data is bound to the specific classes and the exact directory structure used, so it can break in various ways when used in other projects, or after some serious refactors.
For more help, it'll be useful to show your saving code.

Which unsupervised clustering algorithm from the sklearn library can I use with custom distance?

I have a function that takes as input two samples and return their distance and from this function I have defined a metric
def TwoPointsDistance(x1, x2):
cord1 = f.rf.apply(x1)
cord2 = f.rf.apply(x2)
return 1 - (cord1==cord2).sum()/f.n_trees
metric = sk.neighbors.DistanceMetric.get_metric('pyfunc',
func=TwoPointsDistance)
Now I would like to cluster my data according to this metric. I would like to see some examples of algorithms for unsupervised clustering that use this as a distance metric.
EDIT: I am particularly interested in this algorithm:
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html#sklearn.cluster.DBSCAN
EDIT: I have tried
DBSCAN(metric=metric, algorithm='brute').fit(Xor)
but I receive an error:
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python3.4/dist-packages/sklearn/cluster/dbscan_.py", line 249, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python3.4/dist-packages/sklearn/cluster/dbscan_.py", line 100, in dbscan
metric=metric, p=p)
File "/usr/local/lib/python3.4/dist-packages/sklearn/neighbors/unsupervised.py", line 83, in __init__
leaf_size=leaf_size, metric=metric, **kwargs)
File "/usr/local/lib/python3.4/dist-packages/sklearn/neighbors/base.py", line 127, in _init_params
% (metric, algorithm))
ValueError: Metric '<sklearn.neighbors.dist_metrics.PyFuncDistance object at 0x7ff5c299f358>' not valid for algorithm 'brute'
>>>
I've tried to figure out why this error arises... I first thought sklearn.neighbors.NearestNeighbors (which is what DBSCAN is based upon) would be constrained to those distances listed in sklearn.neighbors.base.VALID_METRICS["brute"]. But judging from the source code, any callable function should be okay - so it seems your distance isn't callable?
Please try this:
DBSCAN(metric=TwoPointsDistance, algorithm='brute').fit(Xor)
i.e. without wrapping your distance as neighbors.DistanceMetric. It seems a bit inconsistent to me to now allow these to be used here...
Myself, I have used ELKI with great success with a custom distance function, and there is a short tutorial on how to write these available: http://elki.dbs.ifi.lmu.de/wiki/Tutorial/DistanceFunctions
Today, years later, I still stumbled over this in a different context. The solution is simple: pass the function directly as a metric.
BSCAN(metric=TwoPointsDistance, algorithm='brute').fit(Xor)

Trying to parallelize parameter search in scikit-learn leads to "SystemError: NULL result without error in PyObject_Call"

I'm using the sklearn.grid_search.RandomizedSearchCV class from scikit-learn 14.1, and I get an error when running the following code:
X, y = load_svmlight_file(inputfile)
min_max_scaler = preprocessing.MinMaxScaler()
X_scaled = min_max_scaler.fit_transform(X.toarray())
parameters = {'kernel':'rbf', 'C':scipy.stats.expon(scale=100), 'gamma':scipy.stats.expon(scale=.1)}
svr = svm.SVC()
classifier = grid_search.RandomizedSearchCV(svr, parameters, n_jobs=8)
classifier.fit(X_scaled, y)
When I set the n_jobs parameter to more than 1, I get the following error output:
Traceback (most recent call last):
File "./svm_training.py", line 185, in <module>
main(sys.argv[1:])
File "./svm_training.py", line 63, in main
gridsearch(inputfile, kerneltype, parameterfile)
File "./svm_training.py", line 85, in gridsearch
classifier.fit(X_scaled, y)
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux- x86_64.egg/sklearn/grid_search.py", line 860, in fit
return self._fit(X, y, sampled_params)
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/grid_search.py", line 493, in _fit
for parameters in parameter_iterable
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/externals/joblib/parallel.py", line 519, in __call__
self.retrieve()
File "/usr/local/lib/python2.7/dist-packages/scikit_learn-0.14.1-py2.7-linux-x86_64.egg/sklearn/externals/joblib/parallel.py", line 419, in retrieve
self._output.append(job.get())
File "/usr/lib/python2.7/multiprocessing/pool.py", line 558, in get
raise self._value
SystemError: NULL result without error in PyObject_Call
It seems to have something to do with the python multiprocessing functionality, but I'm not sure how to work around it other than just implement the parallelization for the parameter search by hand. Has anyone had a similar issue with trying to parallelize the randomized parameter search in that they were able to solve?
It turns out the problem was with the use of MinMaxScaler. Since MinMaxScaler only accepts dense arrays, I was translating the sparse representation of the feature vector to a dense array before scaling. Since the feature vector has thousands of elements, my assumption is that the dense arrays caused a memory error when trying to parallelize the parameter search. Instead, I switched to StandardScaler, which accepts sparse arrays as input, and should be better for use with my problem space anyway.

Categories

Resources