Take a "nanmean" in xarray - python

I know that I can use the numpy nanmean function to take the mean of a numpy array, while ignoring NaN values. Is there an analogous way to accomplish this with xarray? I will give an example...
numpy_array=[1,2,3,4,float('nan'),5]
np.mean(numpy_array)
> NaN
np.nanmean(numpy_array)
> 3.0
In xarray, I can do
xarray_example.mean(dim='dimension')
How can I change this to a nanmean?
Thanks.

According to the xarray docs for the mean function, you can set the skipna parameter as True (it will skip missing float values by default). So:
xarray_example.mean(dim='dimension', skipna=True)

Related

'ignore nan' in numpy functions

The elementary functions in numpy, like mean() and std() returns np.nan when encounter np.nan. Can I make them ignore it?
The "normal" functions like np.mean and np.std evalutates the NaN i.e the result you've provided evaluates to NaN.
If you want to avoid that, use np.nanmean and np.nanstd. Note that since you have only one non-nan element the std is 0, thus you are dividing by zero.

why numpy max function(np.max) return wrong output?

I have pandas DataFrame and I turn it to numpy ndarray.I use max function for one column in my DataFrame like this:
print('column: ',df[:,3])
print('max: ',np.max(df[:,3]))
And the output was:
column: [0.6559999999999999 0.48200000000000004 0.9990000000000001 ..., 1.64 nan 0.07]
max: 0.07
But as you can see for example first value is greater than 0.07!!
What is the problem?
There are two problems here
It looks like column you are trying to find maximum for has the data type object. It's not recommended if you are sure that your column should contain numerical data since it may cause unpredictable behaviour not only in this particular case. Please check data types for your dataframe(you can do this by typing df.dtypes) and change it so that it corresponds to data you expect(for this case df[column_name].astype(np.float64)). This is also the reason for np.nanmax not working properly.
You don't want to use np.max on arrays, containing nans.
Solution
If you are sure about having object data type of column:
1.1. You can use the max method of Series, it should cast data to float automatically.
df.iloc[3].max()
1.2. You can cast data to propper type only for nanmax function.
np.nanmax(df.values[:,3].astype(np.float64)
1.3 You can drop all nan's from dataframe and find max[not recommended]:
np.max(test_data[column_name].dropna().values)
If type of your data is float64 and it shouldn't be object data type [recommended]:
df[column_name] = df[column_name].astype(np.float64)
np.nanmax(df.values[:,3])
Code to illustrate problem
#python
import pandas as pd
import numpy as np
test_data = pd.DataFrame({
'objects_column': np.array([0.7,0.5,1.0,1.64,np.nan,0.07]).astype(object),
'floats_column': np.array([0.7,0.5,1.0,1.64,np.nan,0.07]).astype(np.float64)})
print("********Using np.max function********")
print("Max of objects array:", np.max(test_data['objects_column'].values))
print("Max of floats array:", np.max(test_data['floats_column'].values))
print("\n********Using max method of series function********")
print("Max of objects array:", test_data["objects_column"].max())
print("Max of floats array:", test_data["objects_column"].max())
Returns:
********Using np.max function********
Max of objects array: 0.07
Max of floats array: nan
********Using max method of series function********
Max of objects array: 1.64
Max of floats array: 1.64
np.max is an alias for the function np.amax which according to documentation doesn't play well with NaN values. In order to ignore NaN values you should use np.nanmax instead

pandas - vectorized formula computation with nans

I have a DataFrame (Called signal) that is a simple timeseries with 5 columns. This is what its .describe() looks like:
ES NK NQ YM
count 5294.000000 6673.000000 4798.000000 3415.000000
mean -0.000340 0.000074 -0.000075 -0.000420
std 0.016726 0.018401 0.023868 0.015399
min -0.118724 -0.156342 -0.144667 -0.103101
25% -0.008862 -0.010297 -0.011481 -0.008162
50% -0.001422 -0.000590 -0.001747 -0.001324
75% 0.007069 0.009163 0.009841 0.006304
max 0.156365 0.192686 0.181245 0.132630
I want to apply a simple function on every single row, and receive back a matrix with the same dimensions:
weights = -2*signal.subtract( signal.mean(axis=1), axis=0).divide( signal.sub( signal.mean(axis=1), axis=0).abs().sum(axis=1), axis=0 )
However, when I run this line, the program gets stuck. I believe this issue comes from the difference in length/presence of nans. Dropping the nans/filling it is not an option, for any given row that has a nan I want that nan to simply be excluded from the computation. A temporary solution would be to do this iteratively using .iterrows(), but this is not an efficient solution.
Are there any smart solutions to this problem?
The thing is, the pandas mean and sum methods already exclude NaN values by default (see the description of the skipna keyword in the linked docs). Additionally, subtract and divide allow for the use of a fill_value keyword arg:
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
So you may be able to get what you want by setting fill_value=0 in the calls to subtract, and fill_value=1 in the calls to divide.
However, I suspect that the default behavior (NaN is ignored in mean and sum, NaN - anything = NaN, NaN\anything = NaN) is what you actually want. In that case, your problem isn't directly related to NaNs, and you're going to have to clarify your statement "when I run this line, the program gets stuck" in order to get a useful answer.

How to impute each categorical column in numpy array

There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:
nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame
nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column
x_nominal=nomDF.values #convert back pd.DataFrame to np.array
Is there a way to directly impute in numpy array?
We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.
So, the implementation would look something like this -
from scipy.stats import mode
R,C = np.where(np.isnan(x_nominal))
vals = mode(x_nominal,axis=0)[0].ravel()
x_nominal[R,C] = vals[C]
Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.
If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -
x_nominal_U3 = x_nominal.astype('U3')
R,C = np.where(x_nominal_U3=='nan')
vals = mode(x_nominal_U3,axis=0)[0].ravel()
This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
"values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.

mean() of column in pandas DataFrame returning inf: how can I solve this?

I'm trying to implement some machine learning algorithms, but I'm having some difficulties putting the data together.
In the example below, I load a example data-set from UCI, remove lines with missing data (thanks to the help from a previous question), and now I would like to try to normalize the data.
For many datasets, I just used:
valores = (valores - valores.mean()) / (valores.std())
But for this particular dataset the approach above doesn't work. The problem is that the mean function is returning inf, perhaps due to a precision issue. See the example below:
bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
for col in bcw.columns:
if bcw[col].dtype != 'int64':
print "Removendo possivel '?' na coluna %s..." % col
bcw = bcw[bcw[col] != '?']
valores = bcw.iloc[:,1:10]
#mean return inf
print valores.iloc[:,5].mean()
My question is how to deal with this. It seems that I need to change the type of this column, but I don't know how to do it.
not so familiar with pandas but if you convert to a numpy array it works, try
np.asarray(valores.iloc[:,5], dtype=np.float).mean()
NaN values should not matter when computing the mean of a pandas.Series. Precision is also irrelevant. The only explanation I can think of is that one of the values in valores is equal to infinity.
You could exclude any values that are infinite when computing the mean like this:
import numpy as np
is_inf = valores.iloc[:, 5] == np.inf
valores.ix[~is_inf, 5].mean()
If the elements of the pandas series are strings you get inf and the mean result. In this specific case you can simply convert the pandas series elements to float and then calculate the mean. No need to use numpy.
Example:
valores.iloc[:,5].astype(float).mean()
I had the same problem with a column that was of dtype 'o', and whose max value was 9999. Have you tried using the convert_objects method with the convert_numeric=True parameter? This fixed the problem for me.
For me, the reason was an overflow: my original data was in float16 and calling .mean() on that would return inf. After converting my data to float32 (e.g. via .astype("float32")), .mean worked as expected.

Categories

Resources