I am trying to get some insights from a dataset using Logistic Regression. The starting dataframe has the information of if something was removed unscheduled (1 = Yes and 0 = No) and some data which was provided with this removal.
This looks like this:
This data is then 'dummyfied' using pandas.get_dummies, where the result is as expected.
Then I normalized the found coefficients (using coef_) to scale everything the same. I put this in a dataframe with the column 'Parameter' (which is the column name of the dummy dataframe) and the column 'Value' (which is the value obtained using the coefficients).
Now I will get the following result.
Now this result shows that the On-wing Time is the biggest contributor in the unscheduled removal.
Now the question: how can I predict what the chance is that there will be an unscheduled removal for this reason (this column)? So, what is the chance that I will get another unscheduled removal which is caused by the On-wing Time?
Note that these parameters can change since this data is fake data and data may be added later on. So when the biggest contributor changes, the prediction should also focus on the new biggest contributor.
I hope you understand the question.
EDIT
The complete code and dataset (fake one) can be found here: 1drv.ms/u/s!AjkQWQ6EO_fMiSEfu3vYgSTBR0PZ
Ganesh
Related
I have a data frame like this Source Data Frame and using that, I will need to get a data frame with conditional probability over time given images in specific years like the target dataframe.
I have used melt and then merge to get the probability, however, I am not getting the correct probability using my code, for example, if I run the below code from my inspiration for my source data frame above, given 'At least one mountain', the probability of 'At least one mountain' gives me 0.32301, however, I should be getting a probability of 1 for all years. Am I calculating the conditional probability differently:
I tried taking some inspiration from the Conditional Probability in Python solution, however, it is still giving me the wrong answer.
Here is how I am populating my final dataframe, which is wrong as per the above target given data frame.
I stumbled about the following problem:
I'm working on a beginners project in data science. I got my test and train data splits and right now I'm analyzing every feature, then adding it to either a dataframe for discretised continuous variables or a dataframe for continuous variables.
Doing so I encountered a feature with big outliers. If I would to delete them, other features I already added to my sub dataframes would have more column entries than this one.
Should I just find a strategy to overwrite the outliers with "better" values or should I reconsider my strategy to split the train data for both types of variables in the beginning? I don't think that
getting rid of the outlier rows in the real train_data would be useful though...
There are many ways to deal with outliers.
In my cours for datascience we used "data imputation":
But before you start to replace or remove data, its important to analyse what difference the outlier makes and if the outlier is valid ofcours.
If the outlier is invalid, you can delete the outlier and use data imputation as explained below.
If your outlier is valid, check the differnce in outcome with and without the outlier. If the difference is very small then there ain't a problem. If the differnce is significant you can use standardization and normalization.
You can replace the outlier with:
a random value (not recommended)
a value based on hueristic logic
a value based on its neighbours
the median, mean or modus.
a value based on interpolation (making a prediction with a certain ml model)
I recommend using the strategy with the best outcome.
Statquest explains datascience and machinelearning concepts in a very easy and understandable way, so refer to him if you encounter more theoritical questions: https://www.youtube.com/user/joshstarmer
My data set has one output that I call Y and 5 inputs called X. I read the output and input in from my system in python which are stored in an array called FinalArray.
Later one I use StandardScaler from sklearn to scale my data set as follows:
scaler = StandardScaler()
scaler.fit(FinalArray)
FinalArrayScaled = scaler.transform(FinalArray)
FinalArrayScaled is later on divided into train and test as it is usually recommended for regression techniques. Since I am using surrogate models (more specifically Kriging) I am implementing infill to add more points to my sample domain to improve the confidence in my model (RSME and r^2). The infill method returns the values needed scaled. Remember that the input used for the Surrogate model has been scaled previously.
Here is a short example of how the output looks like for 4 samples and 5 features (Image 1)
The first column (0) represents the output of the system and the other columns represent the input of my system. So, each raw represent a specific experiment.
In order to know the values with the appropriate dimensions, I implemented the scaler.inverse_transform on the output of the method. However, the output seems weird because once I apply the method scaler.inverse_transform I obtained very different values for the same input, refer to Figure 2.
Notice the elements (0,1) and (0,2) from Figure 1. Although they are exactly the same the lead to totally different values in Figure 2. The same applies to many others. Why is this?
I found a bug on my source code. For the sake of completeness, I am sharing the final results herein.
I had an error in a for loop. I wanted to erase the question to avoid confusions in the future but I was not able to do it.
Thanks
I have created a 4-cluster k-means customer segmentation in scikit learn (Python). The idea is that every month, the business gets an overview of the shifts in size of our customers in each cluster.
My question is how to make these clusters 'durable'. If I rerun my script with updated data, the 'boundaries' of the clusters may slightly shift, but I want to keep the old clusters (even though they fit the data slightly worse).
My guess is that there should be a way to extract the paramaters that decides which case goes to their respective cluster, but I haven't found the solution yet.
Got the answer in a different topic:
Just record the cluster means. Then when new data comes in, compare it to each mean and put it in the one with the closest mean.
[]
Hi ,
Attached is the data, can you please help me to handle the missing data in the "Outlet_Size" column.
So that i can use this complete data for preparing the datascience models.
Thanks,
These are one of the major challenges of Data Mining problems (or Machine Learning). YOU decide what to do with the missing data based on PURE EXPERIENCE. You mustn't look at Data Science as a blackbox that follows a series of steps to be successful at it!
Some guidelines about missing data.
A. If more than 40% of the data is missing from a column, drop it! (Again, the 40% depends on what type of problem you're working with! If the data is super crucial or its very trivial that you can ignore it).
B. Check if there is someway you can impute the missing data from the internet. You're looking at item weight! If there is anyway you could know which product you're dealing with instead of hashed coded Item_Identifier, then you can always literally Google it and figure it out.
C. Missing data can be classified into two types:
MCAR: missing completely at random. This is the desirable scenario in case of missing data.
MNAR: missing not at random. Missing not at random data is a more serious issue and in this case it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?
Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function
D. As posted in the comments, you can simply drop the rows using df.dropna() or fill them with infinity, or fill them with mean using df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
This groups the column value from dataframe df by category name, finds the mean in each category and fills the missing value in value with the corresponding mean of that category!
E. Apart from just either dropping missing values, replacing with mean or median, there are other advanced regression techniques you can use that has a way to predict missing values and fill it, E.G (mice: Multivariate Imputation by Chained Equations), you should browse and read more about where advanced imputation technique will be helpful.
The accepted answer is really nice.
In your specific case I'd say either drop the column or assign a new value called Missing. Since that's a Categorical variable, there's a good chance it ends up going into a OneHot or Target Encoder (or being understandable by the model as a category directly). Also, the fact the value is NaN is an info itself, it can come from multiple factors (from bad data to technical difficulties getting an answer, etc). Be careful and watch this doesn't brings bias or some information you shouldn't know (example : the products have NaN due to not being into a certain base, thing that will never happen in a real situation, which will make your result non-representative of a true situation)
The column "Outlet_Size" contains the categorical data, so instead of dropping the data use measures to fill data.
Since it is categorical data use Measures of Central Tendency, Mode.
Use mode to find which category occurs more or frequently and fill the column with the corresponding value.
Code:
Dataframe['Outlet_Size'].mode()
Datarame['Outlet_Size'].fillna(Dataframe['Outlet_Size'].mode(), inplace=True)