I have search and saw some questions on the matter but without answer (due to the fact that the questions were asked more than 1 year ago, I. hoped something has changed)
I am looking for a library to infer bayesian network from a file of continious variables is there anything simple\out of the box that any one has encountered? I have tried pyAgrum for example but when i run
pyAgrum.BNLearner(numdata).learnDAG()
I get
Exception: [pyAgrum] Wrong type: Counts cannot be performed on continuous variables. Unfortunately the following variable is continuous: V0
Have tried serval libraries but they all seem to work only on discrete variables would love some help in advance.
The main question is what kind of model do you want for your continuous variables.
1- Do you want them to be discretized : you can have a look for instance at http://webia.lip6.fr/~phw/aGrUM/docs/last/notebooks/Discretizer.ipynb.html.
2- Do you want to assume a linear gaussian model : you can have a look for instance at bnlearn (https://haipengu.github.io/Rmd/GBN.html)
3- Do you want to learn more general continuous model : You can have a look at for instance otagrum (http://openturns.github.io/otagrum/master/) which learns copula bayesian network.
4- etc.
Related
I am trying to solve a problem. A production plant has an extensive data set on 20 inputs (independent variable, feedstock and process conditions) and 6 outputs (dependent variables, production yield). We are trying to find out the relationship between the 20 inputs and 6 outputs, and also apply some constraints to the model (e.g. the sum of outputs must not exceed 100%).
I am still a learner of Python. May I ask what type of problem is this and how can it be analysed using Python? I've been searching for answers online, seems like it might be a kind of "multivariate regression", but I am not sure.
Thank you in advance for your advice!
This is a "Multivariate Multiple Regression" problem. Such a problem aims at modelling multiple outputs/dependent variables with the same set of inputs/features/independent variables. It is basically creating several regressors to model each output with the set of inputs and then combining them into one single model.
I would like to link an article for further information:
https://data.library.virginia.edu/getting-started-with-multivariate-multiple-regression/#:~:text=Multivariate%20Multiple%20Regression%20is%20the,single%20set%20of%20predictor%20variables.&text=And%20in%20fact%20that's%20pretty,variable%20separately%20on%20the%20predictors.
This might sound silly but I'm just wondering about the possibility of modifying a neural network to obtain a probability density function rather than a single value when you are trying to predict a scalar. I know that when you are trying to classify images or words you can get a probability for each class, so I'm thinking there might be a way to do something similar with a continuous value and plot it. (Similar to the posterior plot with bayesian optimisation)
Such details could be interesting when deploying a model for prediction and could provide more flexibility than a single value.
Does anyone knows a way to obtain such an output?
Thanks!
Ok So I found a solution to this issue, though it adds a lot of overhead.
Initially I thought the keras callback could be of use but despite the fact that it provided the flexibility that I wanted i.e.: train only on test data or only a subset and not for every test. It seems that callbacks are only given summary data from the logs.
So the first step what to create a custom metric that would do the same calculation as any metric with the 2 arrays ( the true value and the predicted value) and once those calculations are done, output them to a file for later use.
Then once we found a way to gather all the data for every sample, the next step was to implement a method that could give a good measure of error. I'm currently implementing a handful of methods but the most fitting one seem to be bayesian bootstraping ( user lmc2179 has a great python implementation). I also implemented ensemble methods and gaussian process as alternatives or to use as other metrics and some other bayesian methods.
I'll try to find if there are internals in keras that are set during the training and testing phases to see if I can set a trigger for my metric. The main issue with using all the data is that you obtain a lot of unreliable data points at the start since the network is not optimized. Some data filtering could be useful to remove a good amount of those points to improve the results of the error predictors.
I'll update if I find anything interesting.
I think this is kind of "blasphemy" for someone who comes from the AI world, but since I come from the world where we program and get a result, and there is the concept of storing something un memory, here is my question :
Machine learning works by iterations, the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere ? because if I think as a programmer, if I re-run the program, I must store previous results somewhere, or they will be overwritten ? or I need to use an array for example to store my results.
For example, if I train my image recognition algorithme with a bunch of cats pictures data sets, what are the variables I need to add to my algorithme, so if I use it with an image library, it will always success everytime I find a cat, but I will use what? since there is nothing saved for my next step ?
All videos and tutorials I have seen, they only draw a graph as decision making visualy, and not applying something to use it in future program ?
For example, this example, kNN is used to teach how to detect a written digit, but where is the explicit value to use ?
https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/2_BasicModels/nearest_neighbor.py
NB: people clicking on close request or downvoting at least give a reason.
the more there are iterations, the best our algorithm becomes, but after those iterations, there is a result stored somewhere
What you're alluding to here is the optimization part.
However to optimize a model, we first have to represent it.
For example, if I'm creating a very simple linear model to predict house prices using its surface in square meters I might go for this model:
price = a * surface + b
That's the representation.
Now that you have represented the model, you want to optimize it, i.e. find the params a and b that minimize the prediction error.
there is a result stored somewhere ?
In the above, we say that we have learned the params or weights a and b.
That's what you keep, the weights which come from optimization (also called training) and of course the model itself.
I think there is some confusion. Let's clear it up.
Machine Learning models usually have parameters, and these parameters are trainable. This means a training algorithm find the "right" values of these parameters in order to properly work for a given task.
This is the learning part. The actual parameter values are "inferred" from training data.
What you would call the result of the training process is a model. The model is represented by formulas with parameters, and these parameters must be stored. Typically when you use a ML/DL framework (like scikit-learn or Keras), the parameters are stored alongside some information about the type of model, so it can be reconstructed at runtime.
I need to write a program that, given an object with certain attributes, it knows how to classify it. It should know how to classify new objects by being trained with a list of known objects with known attributes.
For example, I have object A with the following attributes: a=10 and b=1. I also trained the program so that it knows that values between 5..15 for a and 0..2 for b classify the given object as label1.
As the program evolves, I need to further train it with known data so that the attribute intervals will get more accurate (hence the classification).
Now, I haven't got any experience with machine learning or any of this kind and I would like to know how should I start with this. I've seen plenty of tutorials, but only for text classification. And only for 2-ways classification (that is, positive or negative, yes or no...only two values to choose from). I would have 5-6 labels to start with and their number will soon increase. Also, the object attributes are integers.
Any tip is highly appreciated!
Machine learning is a very broad field, so the first step would be knowing exactly what you're looking for and familiarizing yourself with the subproblem you're trying to solve.
By your description, you're trying to solve a classification problem in a supervised learning approach.
I'll paraphrase a bit from here:
The classification problem consists in identifying to which class a observation belongs to.
Supervised learning is a way of "teaching" a machine. Basically, an algorithm is trained through examples (i.e.: this particular object belongs to class X). After training, the machine should be able to apply its aquired knowledge to new data.
The k-NN algorithm is one of the simplest algorithms for solving this kind of problem. I suggest you familiarize yourself with it.
You have an implementation of k-NN in scipy. Here's a link to a tutorial on using it.
Now, answering your specific questions:
only for 2-ways classification (that is, positive or negative, yes or
no...only two values to choose from)
k-NN can handle any (finite) number of classes, so you're clear
Also, the object attributes are integers
K-NN usually uses a continuous space - so you'll have to convert those to floats.
Mapping the attributes values into points in the algorithm space is not a trivial problem (see Data pre-processing, especially the articles on normalization, feature extraction and selection)
Is there a way to have an x,y pair dataset given to a function that will return a list of curve fit models and the coeff. The program DataFit does this with about 200 different models, but we are looking for a pythonic way. From exponential to inverse polynomial etc.
I have seen many posts of manually using scipy to type each model, but this is not feasible for the number of models we want to test.
The closest I found was pyeq2, but this is not returning the list of functions, and seems to be a rabbit hole to code for.
If R has this available, we could use that but python is really the goal
Below is an example of the data, we want to find the best way to describe this curve
You can try library splines in R. I have used this for higher order curve fitting to some univariate data. You can try to change and achieve similar thing with corresponding R^2 errors.
You can either decide to do the following:
Choose a model to fit a parameters. This model should be based on a single independent variable. This can be done by python's scipy.optimize curve_fit function. You can choose something like a hyberbola.
Choose a model that is complex and likely represents an underlying mechanism of something at work. Like the system of ODE's from a disease SIR model. Fitting the parameters will be no easy task. This will be done by Markov Chain Monte Carlo (MCMC) methods. This is VERY difficult.
Realise that you have data and can use machine learning via scikit learn to predict from your data. This is a method that doesn't require parameters.
Machine learning and neural networks don't fit something and can't really tell you about the underlying mechanism but can make predicitions just as a best fit model would...dare I say even better.
In the end, we found that Eureqa software was able to achieve this. https://www.nutonian.com/products/eureqa/