Multiple Y and Multiple X regression

Multiple Y and Multiple X regression - python

I am trying to solve a problem. A production plant has an extensive data set on 20 inputs (independent variable, feedstock and process conditions) and 6 outputs (dependent variables, production yield). We are trying to find out the relationship between the 20 inputs and 6 outputs, and also apply some constraints to the model (e.g. the sum of outputs must not exceed 100%).
I am still a learner of Python. May I ask what type of problem is this and how can it be analysed using Python? I've been searching for answers online, seems like it might be a kind of "multivariate regression", but I am not sure.
Thank you in advance for your advice!

This is a "Multivariate Multiple Regression" problem. Such a problem aims at modelling multiple outputs/dependent variables with the same set of inputs/features/independent variables. It is basically creating several regressors to model each output with the set of inputs and then combining them into one single model.
I would like to link an article for further information:
https://data.library.virginia.edu/getting-started-with-multivariate-multiple-regression/#:~:text=Multivariate%20Multiple%20Regression%20is%20the,single%20set%20of%20predictor%20variables.&text=And%20in%20fact%20that's%20pretty,variable%20separately%20on%20the%20predictors.

Related

I need ideas to solve the overfitting

I'm working on a clinical problem. I need to make a model to predict a numerical variable. I extracted 76 features from an ECG, these features are correlated with my label, and some signals were filtered for some features. I know that these features are good to predict my label.
I have more than 200 patients. I made different data sets with different types of normalization to try models. I made different ML models trying different parameters, for example, SVR, Random Forest, and XGB. I tried different types of feature selection too.
My problem is that I have a big problem with overfitting, I tried a lot of things but I can't solve this problem. I need more ideas to solve this problem, in the next months I will try to solve this problem with Deep Learning, but I want to upgrade my ml models too.
I attach some screenshots.
I need some Ideas. Thanks!

For a binary classification model based on multiple continuous variable what model should be used?

I am working on a waste water data. The data is collected every 5 min. This is the sample data.
The threshold of the individual parameters is provided. My question is what kind of models should I go for to classify it as usable or not useable and also output the anomaly because of which it is unusable (if possible since it is a combination of the variables). The column for yes/no is yet to be and will be provided to me.
The other question I have is how do I keep it running since the data is collected every 5 minutes?

Your data and use case seem fit for a decision tree classifier. Decision trees are easy to train and interpret (which is one of your requirements, since you want to know why a given sample was classified as usable or not usable), do not require large amounts of labeled data, can be trained and used for prediction on most haedware, and are well suited for structured data with no missing values and low dimensionality. They also work well without normalizing your variables.
Scikit learn is super mature and easy to use, so you should be able to get something working without too much trouble.
As regards time, I'm not sure how you or your employee will be taking samples, so I don't know. If you will be getting and reading samples at that rate, using your model to label data should not be a problem, but I'm not sure if I understood your situation.
Note stackoverflow is aimed towards questions of the form "here's my code, how do I fix this?", and not so much towards general questions such as this. There are other stackexhange sites specially dedicated to statistics and data science. If you don't find here what you need, maybe you can try those other sites!

Bayesian network for continuous variables

I have search and saw some questions on the matter but without answer (due to the fact that the questions were asked more than 1 year ago, I. hoped something has changed)
I am looking for a library to infer bayesian network from a file of continious variables is there anything simple\out of the box that any one has encountered? I have tried pyAgrum for example but when i run
pyAgrum.BNLearner(numdata).learnDAG()
I get
Exception: [pyAgrum] Wrong type: Counts cannot be performed on continuous variables. Unfortunately the following variable is continuous: V0
Have tried serval libraries but they all seem to work only on discrete variables would love some help in advance.

The main question is what kind of model do you want for your continuous variables.
1- Do you want them to be discretized : you can have a look for instance at http://webia.lip6.fr/~phw/aGrUM/docs/last/notebooks/Discretizer.ipynb.html.
2- Do you want to assume a linear gaussian model : you can have a look for instance at bnlearn (https://haipengu.github.io/Rmd/GBN.html)
3- Do you want to learn more general continuous model : You can have a look at for instance otagrum (http://openturns.github.io/otagrum/master/) which learns copula bayesian network.
4- etc.

Random Forest or other machine learning techniques [need advice]

I am trying to get a sense of the rationale between some independent variables and quantify their importance on a dependent variable. I came across methods like the random forest that can quantify the importance of variables and then predict the outcome. However, I have an issue with the nature of the data to be used with the random forest or similar methods. An example of data structure is provided below, and as you can see the time series have some variables like population and Age that do not change with time, though different among the different city. While other variables such as temperature and #internet users are changing through time and within the cities. My question is: how can I quantify the importance of these variables on the “Y” variable? BTW, I prefer to apply the method in python environment.

"How can I quantity the importance" is very common question also known as "feature-importance".
The feature importance depends on your model; with a regression you have importance in your coefficients, in random forest you can use (but, some would not recommend) the build-in feature_importances_ or better the SHAP-values. Further more you can use som correlaion i.e Spearman/Pearson correlation between your features and your target.
Unfortunately there is no "free lunch", you will need to decide that based on what you want to use it for, how your data looks like etc.
I think the one you came across might be Boruta where you shuffle up your variables, add them to your data set and then create a threshold based on the "best shuffled variable" in a Random Forest.

My idea is as follows. Your outcome variable 'Y' has only a few possible values. You can build a classifier (Random Forest is one of many existing classifiers), to predict say 'Y in [25-94,95-105,106-150]'. You will here have three different outcomes that rule out each other. (Other interval limits than 95 and 105 are possible, if that better suits your application).
Some of your predictive variables are time series whereas others are constant, as you explain. You should use a sliding window technique where your classifier predicts 'Y' based on the time-related variables in say the month January. It doesn't matter that some variables are constant, as the actual variable 'City' has the four outcomes: '[City_1,City_2,City_3,City_4]'. Similarly, use 'Population' and 'Age_mean' as the actual variables.
Once you use classifiers, many approaches to feature ranking and feature selection have been developed. You can use a web service like insight classifiers to do it for you, or download a package like Weka for that.
Key point is that you organize your model and its predictive variables such that a classifier can learn correctly.

If city and month are also your independent variables, you should convert them from index into columns. Using pandas to read your file, then use df.reset_index() can do the job for you.

How feature columns work in tensorflow?

I read that feature columns in tensorflow are used to define our data but how and why? How do feature columns work and why they even exist if we can make a custom estimator without them too?
And if they are necessary, why libraries like keras don't use them?

Broadly Speaking
This may be too general to answer. You may want to watch some videos or do more reading on machine learning, because this is a broad topic.
I will try to explain what features of data are used for.
A "feature" of the data is a meaningful variable that should separate two classes from each other. For example, if we choose the feature "weight", we can tell elephants apart from squirrels. They have very different weights, and our machine learning algorithm can learn to "understand" that an animal with a heavy weight is more likely to be an elephant than it is to be a squirrel. In a real scenario you would generally have more than one feature.
I'm not sure why you would say that Keras does not use features. They are a fundamental aspect of many classification problems. Some datasets may contain labelled data or labelled features, like this one: https://keras.io/datasets/#cifar100-small-image-classification
When we "don't use features", I think a more accurate way to state that would be that the data is unlabelled. In this case, a machine learning algorithm can still find relationships in the data, but without human labels applied to the data.
If you Ctrl+F for the word "features" on this page you will see places where Keras accepts them as an argument: https://keras.io/layers/core/
I am not a machine learning expert so if anyone is able to correct my answer, I would appreciate that too.
In Tensorflow
My understanding of Tensorflow's feature columns implementation in particular is that they allow you to cast raw data into a typed column that allow the algorithm to better distinguish what type of data you are passing. For example Latitude and Longitude could be passed as two numerical columns, but as the docs say here, using a Crossed Column for Latitude X Longitude may allow the model to train on the data in a more meaningful/effective way. After all, what "Latitude" and "Longitude" really mean is "Location." As for why Keras does not have this functionality, I am not sure, hopefully someone else can offer insight on this topic.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.