Statsmodels Python Missing Values - python

I'm currenlty working on a project and I have to fill some missing values.
I use Python, and I saw that there is an algorithm which can do missing data imputation.
This algorithm is called Nipals. So, I decided to search a way to use it and I saw that statsmodels.multivariate.pca.PCA could help me.
I have a numpy array named A with n rows and p columns. A has some missing values which are NaN values. I would like to use PCA to fill A. But, there are no examples which can help me to do it.
Can someone help me to fill A using Nipals algorithm ?
Thank you.
N.B. Sorry, I'm a French beginner, it's not easy for me to use english documentations..

I find a way to fill missing values.
Let's assume you have a numpy array named A
from statsmodels.multivariate.pca import PCA
pc = PCA(data=A,ncomp=1, missing='fill-em')
A=pc._adjusted_data
Enjoy !
You can also use another way to fill missing values : mean, median, k-neighbors, mcmc (Monte Carlo Markov Chain), most frequent value...

Related

How should I handle NaN values in a Finance DF?

I am a beginner in Machine Learning, my point is..how should i encode the column "OECDSTInterbkRate"? I don't know how to replace the missing values and especially with what. Should I just delete them? Or replace them with the mean / median of the values?
There are many approaches to this issue.
The simplest: if you have huge amount of data - drop NaNs.
Replace the NaNs with mean/median/etc of the whole non-NaN dataset or the dataset grouped by one or several columns. E.g. for you dataset you can fill the Australia NaNs with a mean for Australian non-NaNs. And the same for other countries.
A common approach is to create another indicator column after the imputation of NaNs which keeps the indices where the missing data was replaced with a value. This column then is taken as yet another input to your ML algorithm.
Take a look at the docs (assuming you work with Pandas) - the developers of the library have already created some tools for the missing data: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
There's no specific answer to your question, it's a general problem in statistics which is called "imputation". Depending on the application the answer could be many things.
There are few alternatives that comes to mind first to solve your problem, but don't forget that "no data" is almost always better than "bad/wrong data". If you have more than enough rows without the rows with NaNs, you may simply drop them. Otherwise you can consider the following:
Can you mathematically calculate the column that you need by the other columns that you already have in your dataset? If so, you have your answer.
Check the correlation of the particular column by using it's non-missing valued rows with the other columns and see if they are highly correlated. If so, you might just as well try dropping the whole column(might not be always a good idea but it's generally a good idea).
Can you create an estimator(such as a regression model) to predict the missing values by learning the pattern using the values that you already have and by using the other columns with a really good accuracy? Well you might have an answer (need benchmarking with the following). Please keep in mind that this is a very risky operation that could give bad estimations and decrease the performance of your overall model. Try this only if your estimations are really good!
Is it a regression problem? Using the statistical mean could be a good idea.
Is it a classification problem? Using median could be a good idea.
In some cases using mode might also be a good idea depending on the distribution.
I suggest that you try all the things out and see which one works better because there's really not a concrete answer to your problem. You can create a machine learning model without using the column and use it's performance as a baseline, and carry out a performance(accuracy) benchmarking for all the steps compared to the baseline.
Note: I am just a graduate student with some insights, please comment out if anything I said is not correct!

How can I save filled missing data after using XGBClassifier?

I have a dataset which has missing values in it, however it is not a problem for XGBClassifier. It can dynamically fill the value for you. I want to save the features as XGBClassifier fill them. My aim is to use XGBoost to impute missing data, then I will try another algorithms which don't allow NaN values. Is this possible ?
XGBoost can handle missing values, but it does not fill them. So the answer is no, you cannot use it to some how populate missing values in a feature.
On training time, the way it handles missing data is by choosing the direction that will minimise the loss at each split. So all the process that is involved in the handling of missing data is in selecting the optimal path based on how much the loss function is minimized, but there is no value imputation involved.
This is mentioned in the publication:
The optimal default directions are learnt from the data. The key improvement is to only visit the non-missing
entries Ik. The presented algorithm treats the non-presence
as a missing value and learns the best direction to handle
missing values

How to select only missing values for testing the model?

I am working on logistic regression project where I have 850 observations and 8 variables and in this, I found 150 missing values and I have decided to use these values as test data. How can I take only missing values as test data in python?
I am still learning data science if there's a mistake in this approach please let me know.
Thank you :)
You could use the pd.isna() from pandas library.
It will return a boolean array that you can use for filtering your data.
You can select all rows, having any missing value in that, using following code
df[df.isnull().values.any(axis=1)]
I do not recommend you to use all data with missing values for testing. You should either fill the missing values completely or at least partial values should be filled in the test dataset.
Let's see what other Machine Learning experts advise you.

Missing data in Dataframe using Python

[]
Hi ,
Attached is the data, can you please help me to handle the missing data in the "Outlet_Size" column.
So that i can use this complete data for preparing the datascience models.
Thanks,
These are one of the major challenges of Data Mining problems (or Machine Learning). YOU decide what to do with the missing data based on PURE EXPERIENCE. You mustn't look at Data Science as a blackbox that follows a series of steps to be successful at it!
Some guidelines about missing data.
A. If more than 40% of the data is missing from a column, drop it! (Again, the 40% depends on what type of problem you're working with! If the data is super crucial or its very trivial that you can ignore it).
B. Check if there is someway you can impute the missing data from the internet. You're looking at item weight! If there is anyway you could know which product you're dealing with instead of hashed coded Item_Identifier, then you can always literally Google it and figure it out.
C. Missing data can be classified into two types:
MCAR: missing completely at random. This is the desirable scenario in case of missing data.
MNAR: missing not at random. Missing not at random data is a more serious issue and in this case it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?
Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function
D. As posted in the comments, you can simply drop the rows using df.dropna() or fill them with infinity, or fill them with mean using df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
This groups the column value from dataframe df by category name, finds the mean in each category and fills the missing value in value with the corresponding mean of that category!
E. Apart from just either dropping missing values, replacing with mean or median, there are other advanced regression techniques you can use that has a way to predict missing values and fill it, E.G (mice: Multivariate Imputation by Chained Equations), you should browse and read more about where advanced imputation technique will be helpful.
The accepted answer is really nice.
In your specific case I'd say either drop the column or assign a new value called Missing. Since that's a Categorical variable, there's a good chance it ends up going into a OneHot or Target Encoder (or being understandable by the model as a category directly). Also, the fact the value is NaN is an info itself, it can come from multiple factors (from bad data to technical difficulties getting an answer, etc). Be careful and watch this doesn't brings bias or some information you shouldn't know (example : the products have NaN due to not being into a certain base, thing that will never happen in a real situation, which will make your result non-representative of a true situation)
The column "Outlet_Size" contains the categorical data, so instead of dropping the data use measures to fill data.
Since it is categorical data use Measures of Central Tendency, Mode.
Use mode to find which category occurs more or frequently and fill the column with the corresponding value.
Code:
Dataframe['Outlet_Size'].mode()
Datarame['Outlet_Size'].fillna(Dataframe['Outlet_Size'].mode(), inplace=True)

How to find the minimum value of netCDF array excluding zero

I am using Python 2, and dealing with a netcdf data.
This array is a variable called cloud water mixing ratio, which is an output from WRF climate model that has 4 dimensions:
QC(time (25), vertical level (69), latitude (119), longitude (199))
I'm trying to get the minimum value of the values in this array. From initial analysis using NCVIEW visualisation, I found that the minimum value is approximately 1x10-5 and the maximum is 1x10-3.
I've used
var = fh.variables['QC']
var[:].max()
var[:].min()
The max works fine, but the min gives me 0.0.
Then I tried a solution from here , which is
var[var>0].min()
but I also get zero. Then I realised that the above code works for arrays with negatives, while mine doesn't have negatives.
I've tried looking for solutions here and there but found nothing that works for my situation. Please, if anyone could point me to the right directions, I'd appreciate it a lot.
Thanks.
var[var>0].min is a function, you need to call it using ()
var[var>0].min() should work much better
sorry for not being able to post the data as I don't have the privilege to share the data. I have tried creating a random 4d array that is similar to the data, and used all the solution you all provided, especially by #Joao Abrantes, they all seemed to work fine. So I thought maybe there is some problem with the data.
Fortunately, there is nothing wrong with the data. I have discussed this with my friend and we have finally found the solution.
The solution is
qc[:][qc[:]>0].min()
I have to specify the [:] after the variable instead of just doing
qc[qc>0].min()
There is also another way, which is to specify the array into numpy array because, qc = fh.variables['QC']
returns a netCDF4.Variable. By adding the second line qc2 = qc[:], it has become numpy.ndarray.
qc = fh.variables['QC']
qc2 = qc[:] # create numpy array
qc2[qc2>0].min()
I'm sorry if my question was not clear when I posted it yesterday. As I have only learned about this today.

Categories

Resources