Comparaison of two list with NaN python - python

I believed this a simple question and looked for relative topics but I didn't find the right thing. Here is the problem:
I have two NumPy arrays for which I need to make statistic analysis by calculating some criterions, for exemple the correlation coefficient and the Nash criterion (for who are familiar with Nash). Since in the first array are observation data (the second is simulation results), I have some NaNs. I would like my programme to calculate the criterions in ignoring the value couples where the value in the first array is NaN.
I tried the mask method. It worked well if I need only to deal with the first array (for calculation its average for exemple), but didn't work for comparisons of the two arrays value by value.
Could anyone give some help? Thanks!

Just answered a similar question Numpy only on finite entries. You can replace the NaN values in you array with Numpy's isnan function, which is a common way to deal with NaN values.
import numpy as np
replace_NaN = np.isnan(array_name)
array_name[replace_NaN] = 0

Related

How to get p-value for each row of two columns in pandas DataFrame?

I would like to ask for any suggestion how to calculate p-value for each row in my pandas DataFrame. My dataframe looks like this - there are columns with means of Data1 and Data2, and then also columns with standard error of the means. Each row represent one atom. Thus I need calculate p-value for each row (= it means, e.g., compare mean of atom 1 from Data1 with mean of atom 1 from Data2).
SEM-DATA1 MEAN-DATA1 SEM-DATA2 MEAN-DATA2
0 0.001216 0.145842 0.000959 0.143103
1 0.002687 0.255069 0.001368 0.250505
2 0.005267 0.321345 0.003722 0.305767
3 0.027265 0.906731 0.033637 0.731638
4 0.029974 0.773725 0.150025 0.960804
I found here on Stack that many people recommend using scipy. But I dont know how to apply it in the way I need it.
Is it possible?
Thank You.
You are comparing two samples df['MEAN...1'] and df['MEAN...2'], so, you should do this:
from scipy import stats
stats.ttest_ind(df['MEAN-DATA1'],df['MEAN-DATA2'])
which return:
Ttest_indResult(statistic=0.01001479441863673, pvalue=0.9922547232600507)
or if you only want to p-value
a = stats.ttest_ind(df['MEAN-DATA1'],df['MEAN-DATA2'])
a[1]
which gives
0.9922547232600507
EDIT
A clarification is in order here. A t-test (or the aquisition of a "p-value" is aimed at finding out is two distributions are coming from the same population (or sample). Testing for two single values will give NaN.

Speed up iteration over DataFrame items

I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these

Find max values of 1D arrays that may contain NAN

I am trying to find the maximum value in a 1D array using the max function in python. However, these arrays may contain NAN as consequence of missing data (flagged astronomical data). Every time I try to find the max value in the array, it gives me NAN as the maximum value. I was wondering if there is a way to find the maximum real number in the array.
I don't believe the two functions 'min', 'max' in Python are affected by 'nan' values. Something is wrong with your code logic. As tested with both Python 2 and 3, min/max functions give correct output values.
There's no code in your question but I can guess out you may misconcept between NAN (not a number value), and "NAN" a string constant. Here's a possible case that 'max' function gives output result as "NAN":

How to impute each categorical column in numpy array

There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:
nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame
nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column
x_nominal=nomDF.values #convert back pd.DataFrame to np.array
Is there a way to directly impute in numpy array?
We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.
So, the implementation would look something like this -
from scipy.stats import mode
R,C = np.where(np.isnan(x_nominal))
vals = mode(x_nominal,axis=0)[0].ravel()
x_nominal[R,C] = vals[C]
Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.
If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -
x_nominal_U3 = x_nominal.astype('U3')
R,C = np.where(x_nominal_U3=='nan')
vals = mode(x_nominal_U3,axis=0)[0].ravel()
This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
"values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.

pandas: check whether an element is in dataframe or given column leads to strange results

I am doing some data handling based on a DataFrame with the shape of (135150, 12) so double checking my results manually is not applicable anymore.
I encountered some 'strange' behavior when I tried to check if an element is part of the dataframe or a given column.
This behavior is reproducible with even smaller dataframes as follows:
import numpy as np
import pandas as pd
start = 1e-3
end = 2e-3
step = 0.01e-3
arr = np.arange(start, end+step, step)
val = 0.0019
df = pd.DataFrame(arr, columns=['example_value'])
print(val in df) # prints `False`
print(val in df['example_value']) # prints `True`
print(val in df.values) # prints `False`
print(val in df['example_value'].values) # prints `False`
print(df['example_value'].isin([val]).any()) # prints `False`
Since I am a very beginner in data analysis I am not able to explain this behavior.
I know that I am using different approaches involving different datatypes (like pd.Series, np.ndarray or np.array) in order to check if the given value exists in the dataframe. Additionally when using np.array or np.ndarray the machine accuracy comes in play which I am aware of in mind.
However, at the end, I need to implement several functions to filter the dataframe and count the occurrences of some values, which I have done several times before based on boolean columns in combination with performed operations like > and < successfully.
But in this case I need to filter by the exact value and count its occurrences which after all lead me to the issue described above.
So could anyone explain, what's going on here?
The underlying issue, as Divakar suggested, is floating point precision. Because DataFrames/Series are built on top of numpy, there isn't really a penalty for using numpy methods though, so you can just do something like:
df['example_value'].apply(lambda x: np.isclose(x, val)).any()
or
np.isclose(df['example_value'], val).any()
both of which correctly return True.

Categories

Resources