How to impute each categorical column in numpy array

How to impute each categorical column in numpy array - python

There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:
nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame
nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column
x_nominal=nomDF.values #convert back pd.DataFrame to np.array
Is there a way to directly impute in numpy array?

We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.
So, the implementation would look something like this -
from scipy.stats import mode
R,C = np.where(np.isnan(x_nominal))
vals = mode(x_nominal,axis=0)[0].ravel()
x_nominal[R,C] = vals[C]
Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.
If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -
x_nominal_U3 = x_nominal.astype('U3')
R,C = np.where(x_nominal_U3=='nan')
vals = mode(x_nominal_U3,axis=0)[0].ravel()
This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
"values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.

Related

why numpy max function(np.max) return wrong output?

I have pandas DataFrame and I turn it to numpy ndarray.I use max function for one column in my DataFrame like this:
print('column: ',df[:,3])
print('max: ',np.max(df[:,3]))
And the output was:
column: [0.6559999999999999 0.48200000000000004 0.9990000000000001 ..., 1.64 nan 0.07]
max: 0.07
But as you can see for example first value is greater than 0.07!!
What is the problem?

There are two problems here
It looks like column you are trying to find maximum for has the data type object. It's not recommended if you are sure that your column should contain numerical data since it may cause unpredictable behaviour not only in this particular case. Please check data types for your dataframe(you can do this by typing df.dtypes) and change it so that it corresponds to data you expect(for this case df[column_name].astype(np.float64)). This is also the reason for np.nanmax not working properly.
You don't want to use np.max on arrays, containing nans.
Solution
If you are sure about having object data type of column:
1.1. You can use the max method of Series, it should cast data to float automatically.
df.iloc[3].max()
1.2. You can cast data to propper type only for nanmax function.
np.nanmax(df.values[:,3].astype(np.float64)
1.3 You can drop all nan's from dataframe and find max[not recommended]:
np.max(test_data[column_name].dropna().values)
If type of your data is float64 and it shouldn't be object data type [recommended]:
df[column_name] = df[column_name].astype(np.float64)
np.nanmax(df.values[:,3])
Code to illustrate problem
#python
import pandas as pd
import numpy as np
test_data = pd.DataFrame({
'objects_column': np.array([0.7,0.5,1.0,1.64,np.nan,0.07]).astype(object),
'floats_column': np.array([0.7,0.5,1.0,1.64,np.nan,0.07]).astype(np.float64)})
print("********Using np.max function********")
print("Max of objects array:", np.max(test_data['objects_column'].values))
print("Max of floats array:", np.max(test_data['floats_column'].values))
print("\n********Using max method of series function********")
print("Max of objects array:", test_data["objects_column"].max())
print("Max of floats array:", test_data["objects_column"].max())
Returns:
********Using np.max function********
Max of objects array: 0.07
Max of floats array: nan
********Using max method of series function********
Max of objects array: 1.64
Max of floats array: 1.64

np.max is an alias for the function np.amax which according to documentation doesn't play well with NaN values. In order to ignore NaN values you should use np.nanmax instead

loss of precision when operating on a pandas dataframe with NaN values

I have a pandas dataframe where i would like to subtract two column values:
df = pd.DataFrame({"Label":["NoPrecisionLoss"],
"FirstNsae":[1577434369549916003],
"SecondNsae":[1577434369549938679]})
print(df.SecondNsae - df.FirstNsae)
The result of subraction is the correct 22676.
Now, when the input dataframe gets a second row with a nan value in it:
df2 = pd.DataFrame({"Label":["PrecisionLoss","NeedsToBeRemoved"],
"FirstNsae":[1577434369549916003,np.nan],
"SecondNsae":[1577434369549938679,66666666666666]})
This nan value is nasty so we will remove the row that contains it:
df2 = df2[np.isfinite(df2.FirstNsae) & np.isfinite(df2.SecondNsae)]
Let's convert the FirstNsae column back to being an int (FirstNsae is assigned to be float because of the nan value in the second row):
df2 = df2.astype({"FirstNsae":int}) # this is futile since precision as already been lost
print(df2.SecondNsae - df2.FirstNsae)
Printing the difference between the two columns produces 22775.
How can i avoid losing precision when constructing dataframes with extremely large integers in
possible presence of nan's?
Thank you!

To elaborate on piRSquared's answer (in the comments to the original question), here is am approach that has solved the original issue:
df2 = pd.DataFrame({"Label":["PrecisionLoss","NeedsToBeRemoved"],
"FirstNsae":[1577434369549916003,np.nan],
"SecondNsae"[1577434369549938679,66666666666666]},
dtype=object)
df2 = df2[np.isfinite(df2.FirstNsae.astype(float)) &
np.isfinite(df2.SecondNsae.astype(float)]
print(df2.SecondNsae - df2.FirstNsae)
prints 22676!
Update: Since Panda's version 1.0.0, this is not an issue anymore. Integer values are allowed to be NaN. https://pandas.pydata.org/pandas-docs/version/1.0.0/user_guide/missing_data.html#missing-data-na

pandas - vectorized formula computation with nans

I have a DataFrame (Called signal) that is a simple timeseries with 5 columns. This is what its .describe() looks like:
ES NK NQ YM
count 5294.000000 6673.000000 4798.000000 3415.000000
mean -0.000340 0.000074 -0.000075 -0.000420
std 0.016726 0.018401 0.023868 0.015399
min -0.118724 -0.156342 -0.144667 -0.103101
25% -0.008862 -0.010297 -0.011481 -0.008162
50% -0.001422 -0.000590 -0.001747 -0.001324
75% 0.007069 0.009163 0.009841 0.006304
max 0.156365 0.192686 0.181245 0.132630
I want to apply a simple function on every single row, and receive back a matrix with the same dimensions:
weights = -2*signal.subtract( signal.mean(axis=1), axis=0).divide( signal.sub( signal.mean(axis=1), axis=0).abs().sum(axis=1), axis=0 )
However, when I run this line, the program gets stuck. I believe this issue comes from the difference in length/presence of nans. Dropping the nans/filling it is not an option, for any given row that has a nan I want that nan to simply be excluded from the computation. A temporary solution would be to do this iteratively using .iterrows(), but this is not an efficient solution.
Are there any smart solutions to this problem?

The thing is, the pandas mean and sum methods already exclude NaN values by default (see the description of the skipna keyword in the linked docs). Additionally, subtract and divide allow for the use of a fill_value keyword arg:
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
So you may be able to get what you want by setting fill_value=0 in the calls to subtract, and fill_value=1 in the calls to divide.
However, I suspect that the default behavior (NaN is ignored in mean and sum, NaN - anything = NaN, NaN\anything = NaN) is what you actually want. In that case, your problem isn't directly related to NaNs, and you're going to have to clarify your statement "when I run this line, the program gets stuck" in order to get a useful answer.

Values being altered in numpy array

So I have a 2D numpy array (256,256), containing values between 0 and 10, which is essentially an image. I need to remove the 0 values and set them to NaN so that I can plot the array using a specific library (APLpy). However whenever I try and change all of the 0 values, some of the other values get altered, in some cases to 100 times their original value (no idea why).
The code I'm using is:
for index, value in np.ndenumerate(tex_data):
if value == 0:
tex_data[index] = 'NaN'
where tex_data is the data array from which I need to remove the zeros. Unfortunately I can't just use a mask for the values I don't need, as APLpy wont except masked arrays as far as I can tell.
Is there anyway I can set the 0 values to NaN without changing the other values in the array?

Use fancy-indexing. Like this:
tex_data[tex_data==0] = np.nan
I don't know why your original code was failing. It looks correct to me, although terribly inefficient.

Using float rules,
tex_data/tex_data*tex_data
make the job here also.

Comparaison of two list with NaN python

I believed this a simple question and looked for relative topics but I didn't find the right thing. Here is the problem:
I have two NumPy arrays for which I need to make statistic analysis by calculating some criterions, for exemple the correlation coefficient and the Nash criterion (for who are familiar with Nash). Since in the first array are observation data (the second is simulation results), I have some NaNs. I would like my programme to calculate the criterions in ignoring the value couples where the value in the first array is NaN.
I tried the mask method. It worked well if I need only to deal with the first array (for calculation its average for exemple), but didn't work for comparisons of the two arrays value by value.
Could anyone give some help? Thanks!

Just answered a similar question Numpy only on finite entries. You can replace the NaN values in you array with Numpy's isnan function, which is a common way to deal with NaN values.
import numpy as np
replace_NaN = np.isnan(array_name)
array_name[replace_NaN] = 0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to impute each categorical column in numpy array - python

Related

why numpy max function(np.max) return wrong output?

loss of precision when operating on a pandas dataframe with NaN values

pandas - vectorized formula computation with nans

Values being altered in numpy array

Comparaison of two list with NaN python

Categories

Resources