pandas - vectorized formula computation with nans - python

I have a DataFrame (Called signal) that is a simple timeseries with 5 columns. This is what its .describe() looks like:
ES NK NQ YM
count 5294.000000 6673.000000 4798.000000 3415.000000
mean -0.000340 0.000074 -0.000075 -0.000420
std 0.016726 0.018401 0.023868 0.015399
min -0.118724 -0.156342 -0.144667 -0.103101
25% -0.008862 -0.010297 -0.011481 -0.008162
50% -0.001422 -0.000590 -0.001747 -0.001324
75% 0.007069 0.009163 0.009841 0.006304
max 0.156365 0.192686 0.181245 0.132630
I want to apply a simple function on every single row, and receive back a matrix with the same dimensions:
weights = -2*signal.subtract( signal.mean(axis=1), axis=0).divide( signal.sub( signal.mean(axis=1), axis=0).abs().sum(axis=1), axis=0 )
However, when I run this line, the program gets stuck. I believe this issue comes from the difference in length/presence of nans. Dropping the nans/filling it is not an option, for any given row that has a nan I want that nan to simply be excluded from the computation. A temporary solution would be to do this iteratively using .iterrows(), but this is not an efficient solution.
Are there any smart solutions to this problem?

The thing is, the pandas mean and sum methods already exclude NaN values by default (see the description of the skipna keyword in the linked docs). Additionally, subtract and divide allow for the use of a fill_value keyword arg:
fill_value : None or float value, default None
Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing
So you may be able to get what you want by setting fill_value=0 in the calls to subtract, and fill_value=1 in the calls to divide.
However, I suspect that the default behavior (NaN is ignored in mean and sum, NaN - anything = NaN, NaN\anything = NaN) is what you actually want. In that case, your problem isn't directly related to NaNs, and you're going to have to clarify your statement "when I run this line, the program gets stuck" in order to get a useful answer.

Related

'ignore nan' in numpy functions

The elementary functions in numpy, like mean() and std() returns np.nan when encounter np.nan. Can I make them ignore it?
The "normal" functions like np.mean and np.std evalutates the NaN i.e the result you've provided evaluates to NaN.
If you want to avoid that, use np.nanmean and np.nanstd. Note that since you have only one non-nan element the std is 0, thus you are dividing by zero.

Find max values of 1D arrays that may contain NAN

I am trying to find the maximum value in a 1D array using the max function in python. However, these arrays may contain NAN as consequence of missing data (flagged astronomical data). Every time I try to find the max value in the array, it gives me NAN as the maximum value. I was wondering if there is a way to find the maximum real number in the array.
I don't believe the two functions 'min', 'max' in Python are affected by 'nan' values. Something is wrong with your code logic. As tested with both Python 2 and 3, min/max functions give correct output values.
There's no code in your question but I can guess out you may misconcept between NAN (not a number value), and "NAN" a string constant. Here's a possible case that 'max' function gives output result as "NAN":

AttributeError: 'float' object has no attribute 'split'

I am calling this line:
lang_modifiers = [keyw.strip() for keyw in row["language_modifiers"].split("|") if not isinstance(row["language_modifiers"], float)]
This seems to work where row["language_modifiers"] is a word (atlas method, central), but not when it comes up as nan.
I thought my if not isinstance(row["language_modifiers"], float) could catch the time when things come up as nan but not the case.
Background: row["language_modifiers"] is a cell in a tsv file, and comes up as nan when that cell was empty in the tsv being parsed.
You are right, such errors mostly caused by NaN representing empty cells.
It is common to filter out such data, before applying your further operations, using this idiom on your dataframe df:
df_new = df[df['ColumnName'].notnull()]
Alternatively, it may be more handy to use fillna() method to impute (to replace) null values with something default.
E.g. all null or NaN's can be replaced with the average value for its column
housing['LotArea'] = housing['LotArea'].fillna(housing.mean()['LotArea'])
or can be replaced with a value like empty string "" or another default value
housing['GarageCond']=housing['GarageCond'].fillna("")
You might also use df = df.dropna(thresh=n) where n is the tolerance. Meaning, it requires n Non-NA values to not drop the row
Mind you, this approach will remove the row
For example: If you have a dataframe with 5 columns, df.dropna(thresh=5) would drop any row that does not have 5 valid, or non-Na values.
In your case you might only want to keep valid rows; if so, you can set the threshold to the number of columns you have.
pandas documentation on dropna

How to impute each categorical column in numpy array

There are good solutions to impute panda dataframe. But since I am working mainly with numpy arrays, I have to create new panda DataFrame object, impute and then convert back to numpy array as follows:
nomDF=pd.DataFrame(x_nominal) #Convert np.array to pd.DataFrame
nomDF=nomDF.apply(lambda x:x.fillna(x.value_counts().index[0])) #replace NaN with most frequent in each column
x_nominal=nomDF.values #convert back pd.DataFrame to np.array
Is there a way to directly impute in numpy array?
We could use Scipy's mode to get the highest value in each column. Leftover work would be to get the NaN indices and replace those in input array with the mode values by indexing.
So, the implementation would look something like this -
from scipy.stats import mode
R,C = np.where(np.isnan(x_nominal))
vals = mode(x_nominal,axis=0)[0].ravel()
x_nominal[R,C] = vals[C]
Please note that for pandas, with value_counts, we would be choosing the highest value in case of many categories/elements with the same highest count. i.e. in tie situations. With Scipy's mode, it would be lowest one for such tie cases.
If you are dealing with such mixed dtype of strings and NaNs, I would suggest few modifications, keeping the last step unchanged to make it work -
x_nominal_U3 = x_nominal.astype('U3')
R,C = np.where(x_nominal_U3=='nan')
vals = mode(x_nominal_U3,axis=0)[0].ravel()
This throws a warning for the mode calculation : RuntimeWarning: The input array could not be properly checked for nan values. nan values will be ignored.
"values. nan values will be ignored.", RuntimeWarning). But since, we actually want to ignore NaNs for that mode calculation, we should be okay there.

How to optimize code that iterates on a big dataframe in Python

I have a big pandas dataframe. It has thousands of columns and over a million rows. I want to calculate the difference between the max value and the min value row-wise. Keep in mind that there are many NaN values and some rows are all NaN values (but I still want to keep them!).
I wrote the following code. It works but it's time consuming:
totTime = []
for index, row in date.iterrows():
myRow = row.dropna()
if len(myRow):
tt = max(myRow) - min(myRow)
else:
tt = None
totTime.append(tt)
Is there any way to optimize it? I tried with the following code but I get an error when it encounters all NaN rows:
tt = lambda x: max(x.dropna()) - min(x.dropna())
totTime = date.apply(tt, axis=1)
Any suggestions will be appreciated!
It is usually a bad idea to use a python for loop to iterate over a large pandas.DataFrame or a numpy.ndarray. You should rather use the available build in functions on them as they are optimized and in many cases actually not written in python but in a compiled language. In your case you should use the methods pandas.DataFrame.max and pandas.DataFrame.min that both give you an option skipna to skip nan values in your DataFrame without the need to actually drop them manually. Furthermore, you can choose a axis to minimize along. So you can specifiy axis=1 to get the minimum along columns.
This will add up to something similar as what #EdChum just mentioned in the comments:
data.max(axis=1, skipna=True) - data.min(axis=1, skipna=True)
I have the same problem about iterating. 2 points:
Why don't you replace NaN values with 0? You can do it with this df.replace(['inf','nan'],[0,0]). It replaces inf and nan values.
Take a look at this This. Maybe you can understand, I have a similar question about how to optimize the loop to calculate de difference between actual row with the previous one.

Categories

Resources