I'm trying to do a bootstrap analysis using pandas bootstrap_plot. My dataset has Nans. I get the following error message:
AttributeError: max must be larger than min in range parameter.
If I fill the data with fillna(0), it works, but then I'm changing my data set. Is there a reason why bootstrap (and autocorrelation_plot, for that matter), don't do the Right Thing about the Nans?
It's a little clunky, but maybe this:
bootstrap_plot( df[ df['x'].notnull() ]['x'] )
Re your question about bootstrap_plot doing the Right Thing: well, this is an area where pandas is still improving in general, but there's often going to be a bit of manual labor in this area and it's not generally that hard to do something with fillna or notnull. And honestly, it's often a feature to be forced to do this rather than have missing values handled automatically in a way you might not have liked or even been aware of.
Related
I am a beginner in Machine Learning, my point is..how should i encode the column "OECDSTInterbkRate"? I don't know how to replace the missing values and especially with what. Should I just delete them? Or replace them with the mean / median of the values?
There are many approaches to this issue.
The simplest: if you have huge amount of data - drop NaNs.
Replace the NaNs with mean/median/etc of the whole non-NaN dataset or the dataset grouped by one or several columns. E.g. for you dataset you can fill the Australia NaNs with a mean for Australian non-NaNs. And the same for other countries.
A common approach is to create another indicator column after the imputation of NaNs which keeps the indices where the missing data was replaced with a value. This column then is taken as yet another input to your ML algorithm.
Take a look at the docs (assuming you work with Pandas) - the developers of the library have already created some tools for the missing data: https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html
There's no specific answer to your question, it's a general problem in statistics which is called "imputation". Depending on the application the answer could be many things.
There are few alternatives that comes to mind first to solve your problem, but don't forget that "no data" is almost always better than "bad/wrong data". If you have more than enough rows without the rows with NaNs, you may simply drop them. Otherwise you can consider the following:
Can you mathematically calculate the column that you need by the other columns that you already have in your dataset? If so, you have your answer.
Check the correlation of the particular column by using it's non-missing valued rows with the other columns and see if they are highly correlated. If so, you might just as well try dropping the whole column(might not be always a good idea but it's generally a good idea).
Can you create an estimator(such as a regression model) to predict the missing values by learning the pattern using the values that you already have and by using the other columns with a really good accuracy? Well you might have an answer (need benchmarking with the following). Please keep in mind that this is a very risky operation that could give bad estimations and decrease the performance of your overall model. Try this only if your estimations are really good!
Is it a regression problem? Using the statistical mean could be a good idea.
Is it a classification problem? Using median could be a good idea.
In some cases using mode might also be a good idea depending on the distribution.
I suggest that you try all the things out and see which one works better because there's really not a concrete answer to your problem. You can create a machine learning model without using the column and use it's performance as a baseline, and carry out a performance(accuracy) benchmarking for all the steps compared to the baseline.
Note: I am just a graduate student with some insights, please comment out if anything I said is not correct!
So, I am trying to pivot this data (link here) to where all the metrics/numbers are in a column with another column being the ID column. Obviously having a ton of data/metrics in a bunch of columns is much harder to compare and do calculations off of than having it all in one column.
So, I know what tools I need for this; Pandas, Findall, Wide_to_long (or melt) and maybe stack. However, I am having a bit of difficulty putting them all in the right place.
I easily can import the data into the df and view it, but when it comes to using findall with wide_to_long to pivot the data I get pretty confused. I am basing my idea off of this example (about half way down when they use the findall / regex to define new column names). I am looking to create a new column for each category of the metrics (ie. population estimate is one column and then % change is another, they should not be all one column)
Can someone help me set up the syntax correctly for this part? I am not good at the expressions dealing with pattern recognition.
Thank you.
I have a function that creates new columns and data of a pandas dataframe. I am now trying to move these testing method to dask to be able to test on larger sets of data. I am having issue finding the problem as my function does not throw any errors,just that the data is wrong. I came to the conclusion that it must be a an issue with the fuctions I am calling. What am I missing? I think its here but if it was, I would think that python would give me an error but its not. I recently saw that transform is not supported. I also believe between_time is not supported.
validSignalTime=(df1.index.time>=en)&(df1.index.time<=NoSignalAfter)
time_condition=df1.index.isin(df1.between_time(st, en, include_start=True,
include_end=False).index)
df1['Entry_Price']=df1[time_condition].groupby(df1[time_condition].index.date)['High'].transform('cummax')
[]
Hi ,
Attached is the data, can you please help me to handle the missing data in the "Outlet_Size" column.
So that i can use this complete data for preparing the datascience models.
Thanks,
These are one of the major challenges of Data Mining problems (or Machine Learning). YOU decide what to do with the missing data based on PURE EXPERIENCE. You mustn't look at Data Science as a blackbox that follows a series of steps to be successful at it!
Some guidelines about missing data.
A. If more than 40% of the data is missing from a column, drop it! (Again, the 40% depends on what type of problem you're working with! If the data is super crucial or its very trivial that you can ignore it).
B. Check if there is someway you can impute the missing data from the internet. You're looking at item weight! If there is anyway you could know which product you're dealing with instead of hashed coded Item_Identifier, then you can always literally Google it and figure it out.
C. Missing data can be classified into two types:
MCAR: missing completely at random. This is the desirable scenario in case of missing data.
MNAR: missing not at random. Missing not at random data is a more serious issue and in this case it might be wise to check the data gathering process further and try to understand why the information is missing. For instance, if most of the people in a survey did not answer a certain question, why did they do that? Was the question unclear?
Assuming data is MCAR, too much missing data can be a problem too. Usually a safe maximum threshold is 5% of the total for large datasets. If missing data for a certain feature or sample is more than 5% then you probably should leave that feature or sample out. We therefore check for features (columns) and samples (rows) where more than 5% of the data is missing using a simple function
D. As posted in the comments, you can simply drop the rows using df.dropna() or fill them with infinity, or fill them with mean using df["value"] = df.groupby("name").transform(lambda x: x.fillna(x.mean()))
This groups the column value from dataframe df by category name, finds the mean in each category and fills the missing value in value with the corresponding mean of that category!
E. Apart from just either dropping missing values, replacing with mean or median, there are other advanced regression techniques you can use that has a way to predict missing values and fill it, E.G (mice: Multivariate Imputation by Chained Equations), you should browse and read more about where advanced imputation technique will be helpful.
The accepted answer is really nice.
In your specific case I'd say either drop the column or assign a new value called Missing. Since that's a Categorical variable, there's a good chance it ends up going into a OneHot or Target Encoder (or being understandable by the model as a category directly). Also, the fact the value is NaN is an info itself, it can come from multiple factors (from bad data to technical difficulties getting an answer, etc). Be careful and watch this doesn't brings bias or some information you shouldn't know (example : the products have NaN due to not being into a certain base, thing that will never happen in a real situation, which will make your result non-representative of a true situation)
The column "Outlet_Size" contains the categorical data, so instead of dropping the data use measures to fill data.
Since it is categorical data use Measures of Central Tendency, Mode.
Use mode to find which category occurs more or frequently and fill the column with the corresponding value.
Code:
Dataframe['Outlet_Size'].mode()
Datarame['Outlet_Size'].fillna(Dataframe['Outlet_Size'].mode(), inplace=True)
I have been trying to get a proper documentation for the freq arguments associated with pandas. For example to resample a dataframe we can do something like
df.resample(rule='W', how='sum')
which will resample this weekly. I was wondering what are the other options and how can I define custom frequency/rules.
EDIT : To clarify I am looking at what are the other legal options for rule
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases
And, almost immediately below that: W-SAT and others.
I'll admit, links to this particular piece of documentation are pretty scarce. More general frequencies can be represented by supplying a DateOffset instance. Even more general resamplings can be done via groupby.