Hi everyone I'm trying to define a set of variables and I want to format their names.
The set up is:
features=['Gender','Age','Rank'] + other11columns #selected columns of my data
In [1]:data['Gender'].unique()
Out[1]: array([0, 1], dtype=int64)
In [2]:data['Age'].unique()
Out[2]: array([10, 20, 30, 40, 50], dtype=int64)
In [3]:data['Rank'].unique()
Out[3]: array([0, 1, 2, 3, 4, 5, 6], dtype=int64)
.....
first I want to set up some empty data frames with each tag. I want something like these:
report_Gender
Out[3]:
Prediction Actual
0 NaN NaN
1 NaN NaN
report_Age
Out[5]:
Prediction Actual
10 NaN NaN
20 NaN NaN
30 NaN NaN
40 NaN NaN
50 NaN NaN
report_Rank
Out[6]:
Prediction Actual
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
.......
The following code doesn't work but indicates what I want to do
for i in range(len(features)-1):
report_features[i]=pd.DataFrame(index=data[feature[i]].unique(),columns=['Prediction','Actual'])
I tried to play with the string formatting with %s operation but didn't figure out how to put in variables' name... any help is appreciated :)
Dynamically creating global variables can get hairy. It is much easier if you put it in a smaller scope ==> any object, e.g., a dictionary. You can achieve what you want like this
my_dictionary = dict()
for f in features:
my_dictionary['report_{}'.format(f)] = pd.DataFrame(index=data[f].unique(),columns=['Prediction','Actual'])
You can access the df like my_dictionary['report_Gender'] for example.
Another way would be to create a class:
class Reports:
pass
for f in features:
setattr(Reports, 'report_{}'.format(f), pd.DataFrame(index=data[f].unique(),columns=['Prediction','Actual'])
Then access as Reports.report_Gender etc...
You can use the setattr method if you really wan't to do it but I'll suggest to follow Ravi Patel's advice
for i in range(len(features)-1):
setattr(object_method_or_module_your_variable_belong,
name_for_you_varialbe,
pd.DataFrame(index=data[feature[i]].unique(),columns=['Prediction','Actual'])
Related
Let's say I have this Pandas series:
num = pd.Series([1,2,3,4,5,6,5,6,4,2,1,3])
What I want to do is to get a number, say 5, and return the index where it previously occurred. So if I'm using the element 5, I should get 4 as the element appears in indices 4 and 6. Now I want to do this for all of the elements of the series, and can be easily done using a for loop:
for idx,x in enumerate(num):
idx_prev = num[num == x].idxmax()
if(idx_prev < idx):
return idx_prev
However, this process consumes too much time for longer series lengths due to the looping. Is there a way to implement the same thing but in a vectorized form? The output should be something like this:
[NaN,NaN,NaN,NaN,NaN,NaN,4,5,3,1,0,2]
You can use groupby to shift the index:
num.index.to_series().groupby(num).shift()
Output:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 4.0
7 5.0
8 3.0
9 1.0
10 0.0
11 2.0
dtype: float64
It's possible to keep working in numpy.
Equivalent of [num[num == x].idxmax() for idx,x in enumerate(num)] using numpy is:
_, out = np.unique(num.values, return_inverse=True)
which assigns
array([0, 1, 2, 3, 4, 5, 4, 5, 3, 1, 0, 2], dtype=int64)
to out. Now you can assign bad values of out to Nans like this:
out_series = pd.Series(out)
out_series[out >= np.arange(len(out))] = np.nan
s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
output of s is
49 NaN
48 NaN
47 NaN
46 NaN
45 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
s.loc[[False,True]]
it gives output as-:
48 NaN
.loc Access a group of rows and columns by label(s), I have given list of false and true and it is also not equal to length of axis being sliced.
My doubt is if we gave list of boolean array to loc it slice the dataframe/series with position instead of label?
I certainly get an error when I am running this:
import pandas as pd
import numpy as np
s = pd.Series(np.nan, index=[49,48,47,46,45, 1, 2, 3, 4, 5])
s.loc[[False,True]]
the error (as expected) is:
IndexError: Item wrong length 2 instead of 10.
Maybe your problem is specific to a certain version of pandas? Maybe an old one? I used pandas version 0.25.3
So I have two ndarray
one is containing ndvi values the other one is containing temperature
The condition is that the for all the pixel with temperature that is above the 25% of all temperatures, its pixel's ndvi value has to be changed to np.nan.
So I am currently using:
temp[temp > T_25]=np.nan (which only sets the temp to nan)
I just need to look for the index in the above and apply to ndvi?
I tried to flatten it and use np.where(temp[temp > T_25]) but it seems to just give me an empty array.
what temp looks like after changing 75% before flatten:
[[ nan nan nan ... nan nan nan]
[ nan nan 229.3249 ... nan nan nan]
[229.35771 229.32663 229.28688 ... nan nan nan]
...
[229.09474 229.14499 229.17618 ... nan nan nan]
[229.1779 229.27306 229.27135 ... nan nan nan]
[229.30244 nan 229.33873 ... nan nan nan]]
suppose I want those nan to be in ndvi... shape is (600,400)
Thanks for reading this.
Any help will be much appreciated.
Your line temp[temp > T_25] = np.nan is almost correct. You just have to change the array that you're indexing to be ndvi:
ndvi[temp > T_25] = np.nan
Should do what you want.
You can also fold the calculation of T_25 into the same line (assuming that T_25 is the 25th percentile of the values in temp) like so:
ndvi[temp > np.percentile(temp, 25)] = np.nan
If I calculate the mean of a groupby object and within one of the groups there is a NaN(s) the NaNs are ignored. Even when applying np.mean it is still returning just the mean of all valid numbers. I would expect a behaviour of returning NaN as soon as one NaN is within the group. Here a simplified example of the behaviour
import pandas as pd
import numpy as np
c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
c.groupby('b').mean()
a
b
1 1.5
2 3.0
c.groupby('b').agg(np.mean)
a
b
1 1.5
2 3.0
I want to receive following result:
a
b
1 1.5
2 NaN
I am aware that I can replace NaNs beforehand and that i probably can write my own aggregation function to return NaN as soon as NaN is within the group. This function wouldn't be optimized though.
Do you know of an argument to achieve the desired behaviour with the optimized functions?
Btw, I think the desired behaviour was implemented in a previous version of pandas.
By default, pandas skips the Nan values. You can make it include Nan by specifying skipna=False:
In [215]: c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Out[215]:
a
b
1 1.5
2 NaN
There is mean(skipna=False), but it's not working
GroupBy aggregation methods (min, max, mean, median, etc.) have the skipna parameter, which is meant for this exact task, but it seems that currently (may-2020) there is a bug (issue opened on mar-2020), which prevents it from working correctly.
Quick workaround
Complete working example based on this comments: #Serge Ballesta, #RoelAdriaans
>>> import pandas as pd
>>> import numpy as np
>>> c = pd.DataFrame({'a':[1,np.nan,2,3],'b':[1,2,1,2]})
>>> c.fillna(np.inf).groupby('b').mean().replace(np.inf, np.nan)
a
b
1 1.5
2 NaN
For additional information and updates follow the link above.
Use the skipna option -
c.groupby('b').apply(lambda g: g.mean(skipna=False))
Another approach would be to use a value that is not ignored by default, for example np.inf:
>>> c = pd.DataFrame({'a':[1,np.inf,2,3],'b':[1,2,1,2]})
>>> c.groupby('b').mean()
a
b
1 1.500000
2 inf
There are three different methods for it:
slowest:
c.groupby('b').apply(lambda g: g.mean(skipna=False))
faster than apply but slower than default sum:
c.groupby('b').agg({'a': lambda x: x.mean(skipna=False)})
Fastest but need more codes:
method3 = c.groupby('b').sum()
nan_index = c[c['b'].isna()].index.to_list()
method3.loc[method3.index.isin(nan_index)] = np.nan
I landed here in search of a fast (vectorized) way of doing this, but did not find it. Also, in the case of complex numbers, groupby behaves a bit strangely: it doesn't like mean(), and with sum() it will convert groups where all values are NaN into 0+0j.
So, here is what I came up with:
Setup:
df = pd.DataFrame({
'a': [1, 2, 1, 2],
'b': [1, np.nan, 2, 3],
'c': [1, np.nan, 2, np.nan],
'd': np.array([np.nan, np.nan, 2, np.nan]) * 1j,
})
gb = df.groupby('a')
Default behavior:
gb.sum()
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 0.0 0.000000+0.000000j
A single NaN kills the group:
cnt = gb.count()
siz = gb.size()
mask = siz.values[:, None] == cnt.values
gb.sum().where(mask)
Out[]:
b c d
a
1 3.0 3.0 NaN
2 NaN NaN NaN
Only NaN if all values in group are NaN:
cnt = gb.count()
gb.sum() * (cnt / cnt)
out
Out[]:
b c d
a
1 3.0 3.0 0.000000+2.000000j
2 3.0 NaN NaN
Corollary: mean of complex:
cnt = gb.count()
gb.sum() / cnt
Out[]:
b c d
a
1 1.5 1.5 0.000000+2.000000j
2 3.0 NaN NaN
I have a dataframe df with NaN values and I want to dynamically replace them with the average values of previous and next non-missing values.
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 NaN -2.027325 1.533582
4 NaN NaN 0.461821
5 -0.788073 NaN NaN
6 -0.916080 -0.612343 NaN
7 -0.887858 1.033826 NaN
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
For example, A[3] is NaN so its value should be (-0.120211-0.788073)/2 = -0.454142. A[4] then should be (-0.454142-0.788073)/2 = -0.621108.
Therefore, the result dataframe should look like:
In [27]: df
Out[27]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621108 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260202
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Is this a good way to deal with the missing values? I can't simply replace them by the average values of each column because my data is time-series and tends to increase over time. (The initial value may be $0 and final value might be $100000, so the average is $50000 which can be much bigger/smaller than the NaN values).
You can try to understand your logic behind the average that is Geometric progression
s=df.isnull().cumsum()
t1=df[(s==1).shift(-1).fillna(False)].stack().reset_index(level=0,drop=True)
t2=df.lookup(s.idxmax()+1,s.idxmax().index)
df.fillna(t1/(2**s)+t2*(1-0.5**s)*2/2)
Out[212]:
A B C
0 -0.166919 0.979728 -0.632955
1 -0.297953 -0.912674 -1.365463
2 -0.120211 -0.540679 -0.680481
3 -0.454142 -2.027325 1.533582
4 -0.621107 -1.319834 0.461821
5 -0.788073 -0.966089 -1.260201
6 -0.916080 -0.612343 -2.121213
7 -0.887858 1.033826 -2.551718
8 1.948430 1.025011 -2.982224
9 0.019698 -0.795876 -0.046431
Explanation:
1st NaN x/2+y/2=1st
2nd NaN 1st/2+y/2=2nd
3rd NaN 2nd/2+y/2+3rd
Then x/(2**n)+y(1-(1/2)**n)/(1-1/2), this is the key
Got a simular Problem.
The following code worked for me.
def fill_nan_with_mean_from_prev_and_next(df):
NANrows = pd.isnull(df).any(1).nonzero()[0]
null_df = df.isnull()
for row in NANrows :
for colum in range(0,df.shape[1]):
if(null_df.iloc[row][colum]):
df.iloc[row][colum] = (df.iloc[row-1][colum]+df.iloc[row-1][colum])/2
return df
maybe it is helps someone too.
as Ben.T has mentioned above
if you have another group of NaN in the same column
you can consider this lazy solution :)
for column in df:
for ind,row in df[[column]].iterrows():
if ~np.isnan(row[column]):
previous = row[column]
else:
indx = ind + 1
while np.isnan(df.loc[indx,column]):
indx += 1
next = df.loc[indx,column]
previous = df[column][ind] = (previous + next)/2