So I have a 2D numpy array (256,256), containing values between 0 and 10, which is essentially an image. I need to remove the 0 values and set them to NaN so that I can plot the array using a specific library (APLpy). However whenever I try and change all of the 0 values, some of the other values get altered, in some cases to 100 times their original value (no idea why).
The code I'm using is:
for index, value in np.ndenumerate(tex_data):
if value == 0:
tex_data[index] = 'NaN'
where tex_data is the data array from which I need to remove the zeros. Unfortunately I can't just use a mask for the values I don't need, as APLpy wont except masked arrays as far as I can tell.
Is there anyway I can set the 0 values to NaN without changing the other values in the array?
Use fancy-indexing. Like this:
tex_data[tex_data==0] = np.nan
I don't know why your original code was failing. It looks correct to me, although terribly inefficient.
Using float rules,
tex_data/tex_data*tex_data
make the job here also.
Related
I've got a dataframe that looks like this;
[index, Data]
[1, [5,3,6,8,4,5,7etc]]
The data in my "data"column stays in an array. I need to have at least 75 values in each array. The dataframe is 438 rows long.
I need to make a filter where all the arrays that contains less than 75 values, will be replaced by NaN.
I thought of something like this:
for i in range(len(df_window)):
if len(df_window['Data'][i][0])<75:
I don't know if this is right and how to continue. The dataframe called df_window
can someone help me quick please?
You can use lengths = df_window['Data'].apply(len) to get the serie of array lengths. Then by using df_window.loc[(lengths < 75), 'Data'] = np.nan you should get what you want.
EDIT: Corrected first line.
I am trying to find the maximum value in a 1D array using the max function in python. However, these arrays may contain NAN as consequence of missing data (flagged astronomical data). Every time I try to find the max value in the array, it gives me NAN as the maximum value. I was wondering if there is a way to find the maximum real number in the array.
I don't believe the two functions 'min', 'max' in Python are affected by 'nan' values. Something is wrong with your code logic. As tested with both Python 2 and 3, min/max functions give correct output values.
There's no code in your question but I can guess out you may misconcept between NAN (not a number value), and "NAN" a string constant. Here's a possible case that 'max' function gives output result as "NAN":
There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:
nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame
nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column
x_nominal=nomDF.values #convert back pd.DataFrame to np.array
Is there a way to directly impute in numpy array?
We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.
So, the implementation would look something like this -
from scipy.stats import mode
R,C = np.where(np.isnan(x_nominal))
vals = mode(x_nominal,axis=0)[0].ravel()
x_nominal[R,C] = vals[C]
Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.
If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -
x_nominal_U3 = x_nominal.astype('U3')
R,C = np.where(x_nominal_U3=='nan')
vals = mode(x_nominal_U3,axis=0)[0].ravel()
This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
"values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.
I have a largish 2d numpy array, and I want to extract the lowest 10 elements of each row as well as their indexes. Since my array is largish, I would prefer not to sort the whole array.
I heard about the argpartition() function, with which I can get the indexes of the lowest 10 elements:
top10indexes = np.argpartition(myBigArray,10)[:,:10]
Note that argpartition() partitions axis -1 by default, which is what I want. The result here has the same shape as myBigArray containing indexes into the respective rows such that the first 10 indexes point to the 10 lowest values.
How can I now extract the elements of myBigArray corresponding to those indexes?
Obvious fancy indexing like myBigArray[top10indexes] or myBigArray[:,top10indexes] do something quite different. I could also use list comprehensions, something like:
array([row[idxs] for row,idxs in zip(myBigArray,top10indexes)])
but that would incur a performance hit iterating numpy rows and converting the result back to an array.
nb: I could just use np.partition() to get the values, and they may even correspond to the indexes (or may not..), but I don't want to do the partition twice if I can avoid it.
You can avoid using the flattened copies and the need to extract all the values by doing:
num = 10
top = np.argpartition(myBigArray, num, axis=1)[:, :num]
myBigArray[np.arange(myBigArray.shape[0])[:, None], top]
For NumPy >= 1.9.0 this will be very efficient and comparable to np.take().
I have a dataframe where some entries in column_1 have NaN values. I want to replace these by the corresponding values in column_2. Both columns hold float64 values.
I tried the following but strangely it does not update the values.
ix = np.isnan(mydf.loc[:,'column_1'])
mydf[ix]['column_1'] = tchart[ix]['column_2']
Really strange, since I can perfectly see that:
mydf[ix]['column_1']
is the series with the NaN values
and that
mydf[ix]['column_2']
has valid values.
Why isn't it working?
I can't even do:
mydf[ix]['column_1'] = 45
This is an example of chained indexing. For getting values, this is generally ok; however for setting values, it may or may not work as you may be trying to set values on a copy. It is always better to set via the indexers ix/loc for multi-dimensional setting.
In this example, use mydf.loc[ix,'columns_1'] = 45
See here for a more complete explanation.