I'd like to predict the remaining survival time with time-varying covariates using Python. I already used lifelines' CoxTimeVaryingFitter and would like to compare it to a decision tree based approach, such as Random Survival Forest. From this paper I understand, that the "normal" Random Survival Forest is not able to cope with time-varying covariates, but there are extensions to solve that. I could not find any solutions implemented in Python. Have I missed something? I'd also appreciate advice for other modules that can cope with time-varying covariates.
Related
Is there an easy(ish) way to fit a two phase Coxian distribution, preferable in R or if necessary Python? This is a distribution with two transient states in sequence, each described by an exponential distribution, that each can lead to the absorbing state with some probability. I have some real world data that I think is best described by this distribution and I would like to be able to estimate the exponential parameters of the two phases, ideally as a linear function of some covariates I have. If there is a package or library or any sort of resources about fitting a model like this I would really appreciate it. Thank you for your time.
I have a data set consisting of census data (age, sex, employment type, race, education level etc.). My task is to write an algorithm that predicts whether a data point (30, male, white etc.) will have a gross annual income of above $50k.
So far I implemented a KNN algorithm that runs for 30 hours, but achieves ~90% accuracy on test data. I was hoping to achieve higher accuracy using a SVM algorithm, or Naive Bayes, or anything else that might work here.
I'm looking for an algorithm that will be relatively simple to implement(about as hard as KNN) in python, and is likely to achieve good accuracy. What is the best choice in this case? If KNN is the best choice, which algorithm will be easiest to implement for comparison purposes?
It is hard to tell a priori which algorithm will perform better. Usually, for traditional classification tasks such as yours, random forest, gradient boosted machines and SVM are often giving the best results.
I dont' know what you mean by looking for an algorithm that is "relatively simple to implement", but if you use scikit-learn, a lot of algorithms are already implemented and will fit in one or two lines of code so you can try them all!
I am working with a complex system, the system has five variables - depending upon values of these five variables, the response of the system is measured. There are seven output variables that are measured in-order to completely define the response.
I have been using artificial neural network to model relationship between the five variables and the seven output parameters. This has been successful so far.. The ANNs can predict really well the output (I have tested the trained network on a validation set of testcases also). I used python Keras/tensor flow for the same.
BTW, I also tried the linear regression as function approximator but it produces large errors. These errors are expected considering that the system is highly non-linear and may not be continuous everywhere.
Now, I would like to predict the values of the five variables from a vector of the seven output parameters (target vector). Tried using Genetic algorithm for the same. After a lot of effort in designing the GA, I still end up getting high differences between target vector and the GA prediction. I just try to minimize the mean squared error between ANN prediction (function approximator) and target vector.
Is this the right approach to use ANN as function approximator and GA for design space exploration?
Yes, it is a good approach to do search space exploration using GA. But designing the crossover, mutation, generation evolution logic, etc. plays a major role in the determining the performance of the Genetic algo.
If your search space is limited, you can use exact methods (which solves to optimality).
There are few implementation in python-scipy itself
If you prefer to go with meta-heuristics,
there is a wide range of options other than Genetic algorithm
Memetic algorithm
Tabu Search
Simulated annealing
Particle swarm optimization
Ant colony optimization
Has there been work in featuretools (or an additional Python package) that would integrate it with a common ML library, such as sklearn? E.g., it would be nice to test a feature for its predictive power, and if it's high enough, generate more features like it (e.g., use the same initial variable). In other words, can the process of generating new features be guided by their predictive power?
I have the dataset which you can find the (updated) file here , containing many different characteristics of different office buildings, including their surface area and number of people working in there. In total there are about 200 records. I want to use an algorithm, that can be trained using the dataset above, in order to be able to predict the electricity consumption(given in the column 'kwh') of a the building that is not in the set.
I have tried most of the possible machine learning algorithms using the scikit library in python (linear regression, Ridge, Lasso, SVC etc) in order to predict a continuous variable. Surface_area and number of workers had a coorelation value with the target variable between 0.3-0.4 so I assumed them to be good features for the model and included them in the training of the model. However I had about 13350 mean absolute error and R-squared value of about 0.22-0.35, which is not good at all.
I would be very grateful, if someone could give me some advice, or if you could examine a little the dataset and run some algorithms on it. What type of preprocessing should I use, and what type of algorithm? Is the number of datasets too low to train a regression model for predicting continuous variables?
Any feedback would be helpful as I am new to machine learning :)
The first thing that should be done in these kinds of Machine Learning Problems is to understand the data. Yes, the number of features in your dataset is small, yes, the number of data samples are very less, but it is important to do the best we can with what we have.
The data set header is in a language other than English, it is important to convert it to a language most of the people in the community would understand (in this case English). After doing a bit of tinkering, I found out that the language being used is Dutch.
There are some key features missing in the dataset. From something as obvious as the number of floors in the building to something not obvious like the number of working hours. Surface Area and the number of workers seems to me are the most important features, but you are missing out on a feature called building_function which (after using Google Translate) tells what the purpose of the building is. Intuitively, this is supposed to have a large correlation with the power consumption. Industries tend to use more power than normal Households. After translation, I found out that the main types were Residential, Office, Accommodation and Meeting. This feature thus has to be encoded as a nominal variable to train the model.
Another feature hoofsbi also seems to have some variance. But I do not know what that feature means.
If you could translate the headers in the data and share it, I will be able to provide you some code to perform this regression task. It is very important in such tasks to understand what the data is and thus perform feature engineering.