How to not impute NaN values with pandas cut function? - python

I'm trying to use the cut function to convert numeric data into categories. My input data may have NaN values, which I would like to stay NaN after the cut. From what I understand reading the documentation, this is the default behavior and the following code should work:
intervals = [(i, i+1) for i in range(101)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]),bins)
However, the output I get is:
>(49,50]
(0,1]
(9,10]
Notice that the NaN value is converted to the middle interval.
One strange thing is that it appears as though once the number of intervals is 100 or less, I get the desired output:
intervals = [(i, i+1) for i in range(100)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]),bins)
output:
>NaN
(0,1]
(9,10]
Is there a way to specify that I don't want NaN values to be imputed?

This seems like a bug that originates from numpy.searchsorted():
pandas-dev/pandas#31586 - pd.cut returning incorrect output in some cases
numpy/numpy#15499 - BUG: searchsorted with object arrays containing nan
As a workaround, you could replace np.nan with some other guaranteed missing value, e.g. .replace(np.nan,'foo'):
intervals = [(i, i+1) for i in range(101)]
bins = pd.IntervalIndex.from_tuples(intervals)
pd.cut(pd.Series([np.nan,0.5,10]).replace(np.nan,'foo'),bins)
0 NaN
1 (0.0, 1.0]
2 (9.0, 10.0]
dtype: category

Related

Pandas: calculating ratio between values of dataset for some subset

I have a dataset which looks like
value
34
45
3
-3
I want to calculate ratio of values for this dataset, i.e. ratio of value to next value:
34/45 , 45/3, 3/-3
I can do it via myDataset["value"]/myDataset["value"].shift(-1)
Next step is more complex and that is where I an struggling. I need to calculate same ratio, but for selected set of values. The criteria of selection is that value should be greater than previous one. I.e. this time resulting dataset should contains 45/3 only.
I started with
myDataset.loc[(myDataset["value"] > myDataset["value"].shift(1)),"value] / myDataset.loc[(myDataset["value"] > myDataset["value"].shift(1)),"value].shift(-1)
But its not what I want because myDataset.loc changes the dataset itself, so next value found by this is not really next, but next which fits condition in (). While I need really next value from original dataset.
How can I do it?
UPDATE
It looks like my description was a bit misleading. If I have a list of
a,b,c,d then I want to return
c/b if b>a
d/c if c>b
I dont want to return c/b if c>b, its pretty straightforward.
Just make a new shifted column, and work with that:
df['shifted'] = df.value.shift(-1)
print(df.value/df.shifted)
print()
print(df.apply(lambda x: x.value/x.shifted if x.value > x.shifted else None, axis=1))
Output:
0 0.755556
1 15.000000
2 -1.000000
3 NaN
dtype: float64
0 NaN
1 15.0
2 -1.0
3 NaN
dtype: float64
Code
value = df.value
diff = value.diff()
div = value.shift(-1) / value
div[value.diff() > 1]
Output
1 0.066667
Explain
diff() equivalent to value - value.shift()
div keep the ratio c/b
Following the updated question, you can zip three items instead of two.
You can do this with a list comprehension (this will return only ratios that meet the condition):
[b/c for a, b, c in zip(df["value"].shift(2), df["value"].shift(1), df["value"]) if a>b]
#Out:
#[-1.0]
The only valid option is 43/3, so the only ratio is 3/-3=-1

Trying to compare to values in a pandas dataframe for max value

I've got a pandas dataframe, and I'm trying to fill a new column in the dataframe, which takes the maximum value of two values situated in another column of the dataframe, iteratively. I'm trying to build a loop to do this, and save time with computation as I realise I could probably do it with more lines of code.
for x in ((jac_input.index)):
jac_output['Max Load'][x] = jac_input[['load'][x],['load'][x+1]].max()
However, I keep getting this error during the comparison
IndexError: list index out of range
Any ideas as to where I'm going wrong here? Any help would be appreciated!
Many things are wrong with your current code.
When you do ['abc'][x], x can only take the value 0 and this will return 'abc' as you are slicing a list. Not at all what you expect it to do (I imagine, slicing the Series).
For your code to be valid, you should do something like:
jac_input = pd.DataFrame({'load': [1,0,3,2,5,4]})
for x in jac_input.index:
print(jac_input['load'].loc[x:x+1].max())
output:
1
3
3
5
5
4
Also, when assigning, if you use jac_output['Max Load'][x] = ... you will likely encounter a SettingWithCopyWarning. You should rather use loc: jac_outputLoc[x, 'Max Load'] = .
But you do not need all that, use vectorial code instead!
You can perform rolling on the reversed dataframe:
jac_output['Max Load'] = jac_input['load'][::-1].rolling(2, min_periods=1).max()[::-1]
Or using concat:
jac_output['Max Load'] = pd.concat([jac_input['load'], jac_input['load'].shift(-1)], axis=1).max(1)
output (without assignment):
0 1.0
1 3.0
2 3.0
3 5.0
4 5.0
5 4.0
dtype: float64

How do I fill NaN values with different random numbers on Python?

I want to replace the missing values from a column with people's ages (which also contains numerical values, not only NaN values) but everything I've tried so far either doesn't work how I want it to or it doesn't work at all.
I wish to apply a random variable generator which follows a normal distribution using the mean and standard deviation obtained with that column.
I have tried the following:
Replacing with numpy, replaces NaN values but with the same number for all of them
df_travel['Age'] = df_travel['Age'].replace(np.nan, round(rd.normalvariate(age_mean, age_std),0))
Fillna with pandas, also replaces NaN values but with the same number for all of them
df_travel['Age'] = df_travel['Age'].fillna(round(rd.normalvariate(age_mean, age_std),0))
Applying a function on the dataframe with pandas, replaces NaN values but also changes all existing numerical values (I only wish to fill the NaN values)
df_travel['Age'] = df_travel['Age'].where(df_travel['Age'].isnull() == True).apply(lambda v: round(rd.normalvariate(age_mean, age_std),0))
Any ideas would be appreciated. Thanks in advance.
Series.fillna can accept a Series, so generate a random array of size len(df_travel):
rng = np.random.default_rng(0)
mu = df_travel['Age'].mean()
sd = df_travel['Age'].std()
filler = pd.Series(rng.normal(loc=mu, scale=sd, size=len(df_travel)))
df_travel['Age'] = df_travel['Age'].fillna(filler)
I would go with it the following way:
# compute mean and std of `Age`
age_mean = df['Age'].mean()
age_std = df['Age'].std()
# number of NaN in `Age` column
num_na = df['Age'].isna().sum()
# generate `num_na` samples from N(age_mean, age_std**2) distribution
rand_vals = age_mean + age_std * np.random.randn(num_na)
# replace missing values with `rand_vals`
df.loc[df['Age'].isna(), 'Age'] = rand_vals

Pandas - change cell value based on conditions from cell and from column

I have a Dataframe with a lot of "bad" cells. Let's say, they have all -99.99 as values, and I want to remove them (set them to NaN).
This works fine:
df[df == -99.99] = None
But actually I want to delete all these cells ONLY if another cell in the same row is market as 1 (e.g. in the column "Error").
I want to delete all -99.99 cells, but only if df["Error"] == 1.
The most straight-forward solution I thin is something like
df[(df == -99.99) & (df["Error"] == 1)] = None
but it gives me the error:
ValueError: cannot reindex from a duplicate axis
I tried every given solutions on the internet but I cant get it to work! :(
Since my Dataframe is big I don't want to iterate it (which of course, would work, but take a lot of time).
Any hint?
Try using broadcasting while passing numpy values:
# sample data, special value is -99
df = pd.DataFrame([[-99,-99,1], [2,-99,2],
[1,1,1], [-99,0, 1]],
columns=['a','b','Errors'])
# note the double square brackets
df[(df==-99) & (df[['Errors']]==1).values] = np.nan
Output:
a b Errors
0 NaN NaN 1
1 2.0 -99.0 2
2 1.0 1.0 1
3 NaN 0.0 1
At least, this is working (but with column iteration):
for i in df.columns:
df.loc[df[i].isin([-99.99]) & df["Error"].isin([1]), i] = None

Keras fitting ignoring nan values

I am training a neural network to do regression, (1 input and 1 output). Let's x and y be the usual input and output dataset, respectively.
My problem is that the y dataset (not the x) have some values set to nan, so the fitting goes to nan. I wonder if there is an option to ignore the nan values in the fitting, in a similar way to the numpy functions np.nanmean to calculate the mean ignoring nans and so on.
If that option does not exist I suppose I would have to find the nan values and erase them manually, and at the same time erase the values in x corresponding to the nan position in y.
x y
2 4
3 2
4 np.nan
5 7
6 np.nan
7 np.nan
In this simple example the nan values in the y column should be removed and at the same time the corresponding values in the x column (4, 6, 7).
Thank you.
EDIT: Ok, I have a problem filtering the nans, I do:
for index, x in np.ndenumerate(a):
if x == np.nan:
print index, x
and it doesn't print anything and I am sure there are nan values...
EDIT (SELF ANSWER): Ok, I have found a way to localize the nans:
for index, x in np.ndenumerate(a):
if x != x:
print index, x
As said in the comments, simply remove the nan as a preprocessing step:
import numpy as np
x = range(2,8)
y = [4,2,np.nan,7,np.nan,np.nan]
for a,b in zip(x,y):
if str(b) == 'nan':
x.remove(a)
y.remove(b)
print x,y
produces [2, 3, 5] [4, 2, 7].
If you're using some tool to preprocess the data which gives you the np.nan, check whether the API allows you to disable this behavior and take a minute to think whether this is really the behavior you want (or if you e.g. want to map this to constants because you find your input to be valuable even though they have no labels).

Categories

Resources