Initialization of Xgboost DMatrix reduce features number - python

I am trying to understand following case:
when I create new xgbost DMatrix
xgX = xgb.DMatrix(X, label=Y, missing=np.nan)
based on input data X with 64 features
I got the new DMatrix with 55 features
What the magic is doing here? Any advise would be great!

Take a look at
xgboost issue #1223
There, khotilov makes the comment:
The problem with CSR is that when you have completely sparse columns at the end, you cannot figure out that they exist by just looking at CSR's indices and pointers.
The consequence of this is that the function that creates the DMatrixfrom X, XGDMatrixCreateFromCSR, does not account for the empty columns at the end, which in your case is 9 columns. You may want to check that in your case and determine whether or not you really have 64 features in X.

Related

PCA of stock returns

I have a particular stock returns and want to find which of these returns can be used to explain the whole set of returns. Hence I am using PCA to the top 2 returns to explain the returns of a stock. I have taken the log return of the stock.
My code looks like this:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pcadata = stock['lr']
pca.fit(pcadata)
first_pc= pca.components_[0]
second_pc = pca.components_[1]
When i run this, I get this error:
ValueError: Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
How do i resolve this error?
PCA is a dimension-reduction procedure therefore you need a 2D array of samples x variables. PCA will then look for the combinations of variables that vary the most within these samples. It looks like you are only including one variable which is stock['lr']; therefore you receive the error. Perhaps you could give us a little more explanation about your data so that we could deduce how you should input your data.
Reading your comments (I can't reply because I need 50 reputations to do that...), I think you might have mistaken the use of PCA. You are looking for representative sample while PCA gives you 'representative' variables.

Convert Categorical Features (Enum) in H2o to Boolean

in my Pandas Dataframe I have loads of boolean Features (True/False). Pandas correctly represents them as bool if I do df.dtypes. If I pass my data frame to h2o (h2o.H2OFrame(df)) the boolean features are represented as enum. So they are interpreted as categorical features with 2 categories.
Is there a way to change the type of the features from enum to bool? In Pandas I can use df.astype('bool'), is there an equivalent in H2o?
One idea was to encode True/False to their numeric representation (1/0) before converting df to a H2o-Frame. But H2o now recognises this as int64.
Thanks in Advance for help!
The enum type is used for categorical variables with two or more categories. So it includes boolean. I.e. there is no distinct bool category in H2O, and there is nothing you need to fix here.
By the way, if you have a lot of boolean features because you have manually done one-hot encoding, don't do that. Instead give H2O the original (multi-level categorical) data, and it will do one-hot encoding when needed, behind the scenes. This is better because for algorithms like decision trees) they can use multi-level categorical data directly, so it will be more efficient.
See http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/categorical_encoding.html for some alternatives you can try. The missing category is added for when that column is missing in production.
(But "What happens when you try to predict on a categorical level not seen during training?" at http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/deep-learning.html#faq does not seem to describe the behaviour you see?)
Also see http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/use_all_factor_levels.html (I cannot work out from that description if you want it to be true or false, so try both ways!)
UPDATE: set use_all_factor_levels = F and it will only have one input neuron (plus the NA one) for each boolean input, instead of two. If your categorical inputs are almost all boolean types I'd recommend setting this. If your categorical inputs mostly have quite a lot levels I wouldn't (because, overall, it won't make much difference in the number of input neurons, but it might make the network easier to train).
WHY MISSING(NA)?
If I have a boolean input, e.g. "isBig", there will be 3 input neurons created for it. If you look at varimp() you can see there are named:
isBig.1
isBig.0
isBig.missing(NA)
Imagine you now put it into production, and the user does not give a value (or gives an NA, or gives an illegal value such as "2") for the isBig input. This is when the NA input neuron gets fired, to signify that we don't know if it is big or not.
To be honest, I think this cannot be any more useful than firing both the .0 and the .1 neurons, or firing neither of them. But if you are using use_all_factor_levels=F then it is useful. Otherwise all NA data gets treated as "not-big" rather than "could be big or not-big".

h2o python balance classes

I'm having problems implementing a simple balancing for an H2ORandomForestEstimator, I'm trying to reproduce a simple example found in Darren Cook's book written in R ('Practical Machine Learning with H2O - pag. 107).
Working on the Iris Dataset, firstly I artificially unbalance the target variable cutting out a good share of virginica keeping first 120 rows.
Then I build 3 models, a vanilla one, one where I set balance_classes as True, and a last one where I set balance_classes as True and I input a list for class_sampling_factors to oversample the virginica one. List is [1.0,1.0,2.5], referred to columns sorted alphabetically.
I train them, and then output confusion matrix for train for each one.
I'm expecting an unbalanced output for the first one, and a balanced one for the last two, while I have always the same result. I checked the documentation example in Python, and I can't see anything wrong (I may be tired as well).
This is my code:
data_unb = data[1:120,:] # messing up with target variable
train, valid = data_unb.split_frame([0.8], seed=12345)
m1 = h2o.estimators.random_forest.H2ORandomForestEstimator(seed=12345)
m2 = h2o.estimators.random_forest.H2ORandomForestEstimator(balance_classes=True, seed=12345)
m3 = h2o.estimators.random_forest.H2ORandomForestEstimator(balance_classes=True, class_sampling_factors=[1.0,1.0,2.5], seed=12345)
m1.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_defaults')
m2.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_balanced')
m3.train(x=list(range(4)),y=4,training_frame=train,validation_frame=valid,model_id='RF_class_sampling',)
m1.confusion_matrix(train)
m2.confusion_matrix(train)
m3.confusion_matrix(train)
This is my output:
my confusion matrices (wrong)
this is my expected output.
expected confusion matrices
What am I evidently missing? Thanks in advance.
You're not missing anything. The offset_column is available in H2O Random Forest, but it's not actually functional. The bug is documented here and should be fixed in the next stable release of H2O. Sorry about the confusion!
It should work for the rest of the H2O algos (except XGBoost). If you wanted to try on a GBM, for example, you'd see it working.

CountVectorizer matrix varies with new test data for classification?

I have created a model for text classification using python. I have CountVectorizer and it results in a document term matrix of 2034 rows and 4063 columns ( unique words ). I saved the model I used for new test data. My new test data
test_data = ['Love', 'python', 'every','time']
But the problem is I converted the above test data tokens into a feature vector, but it differs in shape. Because the model expect a 4063 vector. I know how to solve it by taking vocabulary of CountVectorizer and search for each token in test data and putting it in that index. But is there any easy way to handle this problem in scikit-learn itself.
You should not fit a new CountVectorizer on the test data, you should use the one you fit on the training data and call transfrom(test_data) on it.
You have two ways to solve this
1. you can use the same CountVectorizer that you used for your train features like this
cv = CountVectorizer(parameters desired)
X_train = cv.fit_transform(train_data)
X_test = cv.transform(test_data)
2. You can also creat another CountVectorizer, if you really want to(but not advisable since you would be wasting space and you'd still want to use the same parameters for your CV), and use the same feature.
cv_train = CountVectorizer(parameters desired)
X_train = cv_train.fit_transform(train_data)
cv_test = CountVectorizer(vocabulary=cv_train.get_feature_names(),desired params)
X_test = cv_test.fit_transform(test_data)
try to use:
test_features = inverse_transform(test_data)
this should return you what you wish for.
I added .toarray() to the wole command in order to see the results as a matrix.
so you should write:
X_test_analyst = Pipeline.named_steps['count_vectorizer'].transform(X_test).toarray()
I'm mega late for this discussion, but I just want to leave something for people come from the search engine.
Sorry for my bad English.
;)
As mention by #Andreas Mueller, you shouldn't create a new CountVectorizer with your new data(set), u can imagine what count vectorizer do is make a 2d array(or think as a excel table), every column is a unique word, every row representing a document(or sentence), and the value (i,j) means in i^th sentence, the frequency of j^th word.
If you make a new CountVectorizer using your new data, the unique word probably(if not must) be different. When u make model.predict using this data, it will report some sort of error telling u the dim are not correct.
What I did in my code is the following:
If you train your model in different .py / .ipynb file, you can use import pickle followed by dump function for your fitted count vectorizer. You can follow the detail in this post.
If you train your model in same .py/.ipynb file, you can directly follow what #Andreas Mueller said.
code:
import pickle
pk.dump(vectorizer,open(r'/relative path','wb'))
pk.dump(pca,open(r'/relative path','wb'))
# ...
# When you want to use:
import pickle
vectoriser = pk.load(open(r'/relative path','rb'))
pea = pk.load(open(r'/relative path','rb'))
#...
Side note:
If I remember correctly, you can also export class or other things using pickle, but when you did so, make sure the class is already defined when you load the object. Not sure if this matters in this case, but I still import PCA and CountVectorizer before I did the pk.load function.
I'm just a beginner in coding so please test my code before use it in your project.

Why does netCDF4 give different results depending on how data is read?

I am coding in python, and trying to use netCDF4 to read in some floating point netCDF data. Mt original code looked like
from netCDF4 import Dataset
import numpy as np
infile='blahblahblah'
ds = Dataset(infile)
start_pt = 5 # or whatever
x = ds.variables['thedata'][start_pt:start_pt+2,:,:,:]
Now, because of various and sundry other things, I now have to read 'thedata' one slice at a time:
x = np.zeros([2,I,J,K]) # I,J,K match size of input array
for n in range(2):
x[n,:,:,:] = ds.variables['thedata'][start_pt+n,:,:,:]
The thing is that the two methods of reading give slightly different results. Nothing big, like one part in 10 to the fifth, but still ....
So can anyone tell me why this is happening and how I can guarantee the same results from the two methods? My thought was that the first method perhaps automatically establishes x as being the same type as the input data, while the second method establishes x as the default type for a numpy array. However, the input data is 64 bit and I thought the default for a numpy array was also 64 bit. So that doesn't explain it. Any ideas? Thanks.
The first example pulls the data into a NetCDF4 Variable object, while the second example pulls the data into a numpy array. Is it possible that the Variable object is just displaying the data with a different amount of precision?

Categories

Resources