How feature columns work in tensorflow? - python

I read that feature columns in tensorflow are used to define our data but how and why? How do feature columns work and why they even exist if we can make a custom estimator without them too?
And if they are necessary, why libraries like keras don't use them?

Broadly Speaking
This may be too general to answer. You may want to watch some videos or do more reading on machine learning, because this is a broad topic.
I will try to explain what features of data are used for.
A "feature" of the data is a meaningful variable that should separate two classes from each other. For example, if we choose the feature "weight", we can tell elephants apart from squirrels. They have very different weights, and our machine learning algorithm can learn to "understand" that an animal with a heavy weight is more likely to be an elephant than it is to be a squirrel. In a real scenario you would generally have more than one feature.
I'm not sure why you would say that Keras does not use features. They are a fundamental aspect of many classification problems. Some datasets may contain labelled data or labelled features, like this one: https://keras.io/datasets/#cifar100-small-image-classification
When we "don't use features", I think a more accurate way to state that would be that the data is unlabelled. In this case, a machine learning algorithm can still find relationships in the data, but without human labels applied to the data.
If you Ctrl+F for the word "features" on this page you will see places where Keras accepts them as an argument: https://keras.io/layers/core/
I am not a machine learning expert so if anyone is able to correct my answer, I would appreciate that too.
In Tensorflow
My understanding of Tensorflow's feature columns implementation in particular is that they allow you to cast raw data into a typed column that allow the algorithm to better distinguish what type of data you are passing. For example Latitude and Longitude could be passed as two numerical columns, but as the docs say here, using a Crossed Column for Latitude X Longitude may allow the model to train on the data in a more meaningful/effective way. After all, what "Latitude" and "Longitude" really mean is "Location." As for why Keras does not have this functionality, I am not sure, hopefully someone else can offer insight on this topic.

Related

For a binary classification model based on multiple continuous variable what model should be used?

I am working on a waste water data. The data is collected every 5 min. This is the sample data.
The threshold of the individual parameters is provided. My question is what kind of models should I go for to classify it as usable or not useable and also output the anomaly because of which it is unusable (if possible since it is a combination of the variables). The column for yes/no is yet to be and will be provided to me.
The other question I have is how do I keep it running since the data is collected every 5 minutes?
Your data and use case seem fit for a decision tree classifier. Decision trees are easy to train and interpret (which is one of your requirements, since you want to know why a given sample was classified as usable or not usable), do not require large amounts of labeled data, can be trained and used for prediction on most haedware, and are well suited for structured data with no missing values and low dimensionality. They also work well without normalizing your variables.
Scikit learn is super mature and easy to use, so you should be able to get something working without too much trouble.
As regards time, I'm not sure how you or your employee will be taking samples, so I don't know. If you will be getting and reading samples at that rate, using your model to label data should not be a problem, but I'm not sure if I understood your situation.
Note stackoverflow is aimed towards questions of the form "here's my code, how do I fix this?", and not so much towards general questions such as this. There are other stackexhange sites specially dedicated to statistics and data science. If you don't find here what you need, maybe you can try those other sites!

Training a model from multiple corpus

Imagine I have a fasttext model that had been trained thanks to the Wikipedia articles (like explained on the official website).
Would it be possible to train it again with another corpus (scientific documents) that could add new / more pertinent links between words? especially for the scientific ones ?
To summarize, I would need the classic links that exist between all the English words coming from Wikipedia. But I would like to enhance this model with new documents about specific sectors. Is there a way to do that ? And if yes, is there a way to maybe 'ponderate' the trainings so relations coming from my custom documents would be 'more important'.
My final wish is to compute cosine similarity between documents that can be very scientific (that's why to have better results I thought about adding more scientific documents)
Adjusting more-generic models with your specific domain training data is often called "fine-tuning".
The gensim implementation of FastText allows an existing model to expand its known-vocabulary via what's seen in new training data (via build_vocab(..., update=True)) and then for further training cycles including that new vocabulary to occur (through train()).
But, doing this particular form of updating introduces murky issues of balance between older and newer training data, with no clear best practices.
As just one example, to the extent there are tokens/ngrams in the original model that don't recur in the new data, new training is pulling those in the new data into new positions that are optimal for the new data... but potentially arbitrarily far from comparable compatibility with the older tokens/ngrams.)
Further, it's likely some model modes (like negative-sampling versus hierarchical-softmax), and some mixes of data, have a better chance of net-benefiting from this approach than others – but you pretty much have to hammer out the tradeoffs yourself, without general rules to rely upon.
(There may be better fine-tuning strategies for other kinds models; this is just speaking to the ability of the gensim FastText to update-vocabulary and repeat-train.)
But perhaps, your domain of interest is scientific texts. And maybe you also have a lot of representative texts – perhaps even, at training time, the complete universe of papers you'll want to compare.
In that case, are you sure you want to deal with the complexity of starting with a more-generic word-model? Why would you want to contaminate your analysis with any of the dominant word-senses in generic reference material, like Wikipedia, if in fact you already have sufficiently-varied and representative examples of your domain words in your domain contexts?
So I would recommend 1st trying to train your own model, from your own representative data. And only if you then fear you're missing important words/senses, try mixing in Wikipedia-derived senses. (At that point, another way to mix in that influence would be to mix Wikipedia texts with your other corpus. And you should also be ready to test whether that really helps or hurts – because it could be either.)
Also, to the extent your real goal is comparing full papers, you might want to look into other document-modeling strategies, including bag-of-words representations, the Doc2Vec ('Paragraph Vector') implementation in gensim, or others. Those approaches will not necessarily require per-word vectors as an input, but might still work well for quantifying text-to-text similarities.

How does DAI handle new (unseen in training) categorical values within a production environment?

I would like confirmation that DAI follows a similar structure for dealing with categorical variables it didn't encounter within training, as in this answer h2o DRF unseen categorical values handling. I could not find it explicitly within the H2O Driverless AI documentation.
Please also state if parts of that link are outdated (as mentioned in the answer) and how it's being processed if this is happening differently. Please note the version of h2o DAI. Thank you!
EDIT this information is now detailed in the documentation here
Below is a description of what happens when you try to predict on a categorical level not seen during training. Depending on the version of DAI you use, you may not have access to a certain algorithm, but given an algorithm, the details should apply to your version of DAI.
XGBoost, LightGBM, RuleFit, TensorFlow, GLM
Driverless AI's feature engineering pipeline will compute a numeric value for every categorical level present in the data, whether it's a previously seen value or not. For frequency encoding, unseen levels will be replaced by 0. For target encoding, the global mean of the target value will be used. Etc.
and
FTRL
FTRL model doesn't distinguish between categorical and numeric values. Whether or not FTRL saw a particular value during training, it will hash all the data, row by row, to numeric and then make predictions. Since you can think of FTRL as learning all the possible values in the dataset by heart, there is no guarantee it will make accurate predictions for unseen data. Therefore, it is important to ensure that the training dataset has a reasonable "overlap", in terms of unique values, with the ones used to make predictions.
Since DAI uses different algorithms than H2O-3 (except for XGBoost), it's best to consider these as separate products with potentially different handling of unseen levels or missing values - though in some cases there are similarities.
As mentioned in the comment, the DRF documentation for H2O-3 should be up to date now.
Hope this explanation helps!

Predicting Energy Consumption of different buildings

I have the dataset which you can find the (updated) file here , containing many different characteristics of different office buildings, including their surface area and number of people working in there. In total there are about 200 records. I want to use an algorithm, that can be trained using the dataset above, in order to be able to predict the electricity consumption(given in the column 'kwh') of a the building that is not in the set.
I have tried most of the possible machine learning algorithms using the scikit library in python (linear regression, Ridge, Lasso, SVC etc) in order to predict a continuous variable. Surface_area and number of workers had a coorelation value with the target variable between 0.3-0.4 so I assumed them to be good features for the model and included them in the training of the model. However I had about 13350 mean absolute error and R-squared value of about 0.22-0.35, which is not good at all.
I would be very grateful, if someone could give me some advice, or if you could examine a little the dataset and run some algorithms on it. What type of preprocessing should I use, and what type of algorithm? Is the number of datasets too low to train a regression model for predicting continuous variables?
Any feedback would be helpful as I am new to machine learning :)
The first thing that should be done in these kinds of Machine Learning Problems is to understand the data. Yes, the number of features in your dataset is small, yes, the number of data samples are very less, but it is important to do the best we can with what we have.
The data set header is in a language other than English, it is important to convert it to a language most of the people in the community would understand (in this case English). After doing a bit of tinkering, I found out that the language being used is Dutch.
There are some key features missing in the dataset. From something as obvious as the number of floors in the building to something not obvious like the number of working hours. Surface Area and the number of workers seems to me are the most important features, but you are missing out on a feature called building_function which (after using Google Translate) tells what the purpose of the building is. Intuitively, this is supposed to have a large correlation with the power consumption. Industries tend to use more power than normal Households. After translation, I found out that the main types were Residential, Office, Accommodation and Meeting. This feature thus has to be encoded as a nominal variable to train the model.
Another feature hoofsbi also seems to have some variance. But I do not know what that feature means.
If you could translate the headers in the data and share it, I will be able to provide you some code to perform this regression task. It is very important in such tasks to understand what the data is and thus perform feature engineering.

twitter/facebook comments classification into various categories

I have some comments dataset which I want to classify into five categories :-
jewelries, clothes, shoes, electronics, food & beverages
So if someones talking about pork, steak, wine, soda, eat : its classified into f&b
Whereas if somebodys talking about say - gold, pendent, locket etc : its classified into jewelries
I want to know , what tags/tokens should I be looking for in a comment/tweet so as to classify it into any of these categories. Finally which classifier to use. I just need some guidance and suggestions , Ill take it from there.
Please help. Thanks
This answer can be a bit long and perhaps I abstract a few things away, but it's just to give you an idea and some advice.
Supervised Vs Unsupervised
As others already mentioned, in the land of machine learning there are 2 main roads: Supervised and Unsupervised learning. As you probably already know by now, if your corpus(documents) are labeled, you are talking about supervised learning. The labels are the categories and are in this case boolean values.
For instance if a text is related to clothes and shoes the labels for those categories should be true.
Since a text can be related to multiple categories (multiple labels), we are looking at multiclassifiers.
What to use?
I presume that the dataset is not yet labeled, since twitter does not do this categorisation for you. So here comes a big decision on your part.
You label the data manually, which means you try to look at as much tweets/fb messages in your dataset and for each of them you consider the 5 categories and answer them by True/False.
You decide to use a unsupervised learning algorithm and hope that you discover these 5 categories. Since approaches like clustering will just try to find categories on their own and these don't have to match your 5 predefined categories by default.
I've used quite some supervised learning in the past and have had good experience with this type of learning, therefore I will continue explaining this path.
Feature Engineering
You have to come up with the features that you want to use. For text classification, a good approach is to use each possible word in the document as a feature. A value of True represents if the word is present in the document, false represents absence.
Before doing this, you need to do some preprocessing. This can be done by using various features provided by the NLTK library.
Tokenization this will break your text up into a list of words. You can use this module.
Stopword removal this will remove common words out of the tokens. Words likes 'a',the',... You can take a look at this.
Stemming stemming will transform words to their stem-form. For example: the words 'working','worked','works' will be transformed to 'work'. Take a look at this.
Now if you have preprocessed the data, then generate a featureset for each word that exists in the documents. There exist automatic methods and filters for this, but I'm not sure how to do this in Python.
Classification
There are multiple classifiers that you can use for this purpose. I suggest to take a deeper look at the ones that exist and their benefits.You can user the nltk classifier which supports multiclassification, but to be honest I never tried that one before. In the past I've used Logistic Regression and SVM.
Training & testing
You will use a part of your data for training and a part for validating if the trained model performs well. I suggest you to use cross-validation, because you will have a small dataset (you have to manually label the data, which is cumbersome). The benefit of cross-validation is that you don't have to split your dataset in a training set and testing set. Instead it will run in multiple rounds and iterate through the data for a part training data and a part testing data. Resulting in all the data being used at least once in your training data.
Predicting
Once your model is built and the outcome of the predictions on 'test-data' is plausible. You can use your model in the wild to predict the categories of the new Facebook messages/tweets.
Tools
The NLTK library is great for preprocessing and natural language processing, but I never used it before for classification. I've heard a lot of great things about the scikit python library. But to be fair honest, I prefer to use Weka, which is a data mining tool written in java, offering a great UI and which speeds up your task a lot!
From a different angle: Topic modelling
In your question you state that you want to classify the dataset into five categories. I would like to show you the idea of topic modelling. It might not be useful in your scenario if you are really only targeting those categories (that's why I leave this part at the end of my answer). However if your goal is to categorise the tweets/fb messages into non-predefined categories, topic modelling is the way to go.
Topic modeling is an unsupervised learning method, where you decide in advance the amount of topics(categories) you want to 'discover'. This number can be high (e.g. 40) Now the cool thing is that the algorithm will find 40 topics that contain words that have something related. It will also output for each document a distribution that indicates to which topics the document is related. This way you can discover a lot more categories than your 5 predefined ones.
Now I'm not gonna go much deeper into this, but just google it if you want more information. In addition you could consider to use MALLET which is an excellent tool for topic modelling.
Well this is kind of a big subject.
You mentioned Python, so you should have a look at the NLTK library which allows you to process natural language, such as your comments.
After this step, you should have a classifier which will map the words you retrieved to a certain class. NTLK also have tools for classification which is linked to knowledge databases. If you are lucky, the categories you are looking for are already available; otherwise you may have to build them yourself. You can have a look at this example which uses NTLK and the WordNet database. You can have access to the Synset, which seems to be pretty broad; and you can also have a look at the hypersets (see for example list(dog.closure(hyper)) ).
Basically you should consider using a multiclassifier on the whole tokenized text (comments on Facebook and tweets are usually short. You might also decide to only consider FB comments below 200 characters, your choice). The choice of a multiclassifier is motivated by non-orthogonality of your classification set (clothes, shoes and jewelries can be the same object; you could have electronic jewelry [ie smartwatches], etc.). This is a fairly simple setup but it's an interesting first step, whose strengths and weaknesses will allow you to iterate easily (if needed).
Good luck!
What you're looking for is in the subject of
Natural Language Processing (NLP) : processing text data and
Machine learning (where the classification models are built)
First I would suggesting going through NLP tutorials and then text classification tutorials, the most appropriate being https://class.coursera.org/nlp/lecture
If you're looking for libraries available in python or java, take a look at Java or Python for Natural Language Processing
If you're new to text processing, please take a look at the NLTK library that provides a nice introduction to doing NLP, see http://www.nltk.org/book/ch01.html
Now to the hard core details:
First, ask yourself whether you have twitter/facebook comments (let's call them documents from now on) that are manually labelled with the categories you want.
1a. If YES, look at supervised machine learning, see http://scikit-learn.org/stable/tutorial/basic/tutorial.html
1b. If NO, look at UNsupervised machine learning, i suggest clustering and topic modelling, http://radimrehurek.com/gensim/
After knowing which kind of machine learning you need, split the documents up into at least training (70-90%) and testing (10-30%) set, see
Note. I suggest at least because there are other ways to split up your documents, e.g. for development or cross-validation. (if you don't understand this, it's all right, just follow step 2)
Finally, Train and Test your model
3a. If supervised, use the training set to train your supervised model. Apply your model onto the test set and then see how well you performed.
3b. If unsupervised, use the training set to generate documents clusters (that means to group similar documents) but they still have no labels. So you need to think of some smart way to label the groups of documents correctly. (To this date, there is no real good solution to this, even super effective neural networks cannot know what the neurons are firing, they just know each neuron is firing something specific)

Categories

Resources