I want to fill the missing data of gender in proportion in a data set.
i use boolean index and head or tail function to select the top data i want, but when i use fillna function, it doesn't work.but after i try, it only run without boolean index, how can i get the top 3 empty values in example and fill it with 0.
a = pd.DataFrame(np.random.randn(50).reshape((10,5)))
a[0][1,3,4,6,9] = np.nan
a[0][a[0].isnull()].head(3).fillna(value = '0', inplace = True)
the dataframe didn't fill the NaN
You should use the loc function, otherwise you will never attribute a value. Here is what you could do :
a.loc[a[0].isnull().index[0:3], 0] = 0
In [1] : print(a)
Out[1] : 0 1 2 3 4
0 0.786182 -0.474953 -0.285881 -0.285502 -0.541957
1 0.000000 0.648042 1.104871 1.237512 -0.156453
2 -1.327987 1.851947 -0.522366 0.631213 -0.960167
3 0.000000 0.561111 -0.945439 -1.414678 0.433246
4 0.000000 -1.463828 0.141122 1.468288 0.649452
5 1.554890 -0.411142 -1.162577 -0.186640 0.774959
6 0.000000 -0.061410 -0.312673 -1.324719 1.763257
7 0.587035 0.265302 -0.793746 -0.148613 0.059330
8 0.909685 1.169786 -1.289559 -0.090185 -0.024272
9 0.000000 0.606329 -0.806034 1.102597 0.820976
Starting with data:
a = pd.DataFrame(np.random.randn(50).reshape((10,5)))
a[0][1,3,4,6,9] = np.nan
gives
0 1 2 3 4
0 -0.388759 -0.660923 0.385984 0.933920 0.164083
1 NaN -0.996237 -0.384492 0.191026 -1.168100
2 -0.773971 0.453441 -0.543590 0.768267 -1.127085
3 NaN -1.051186 -2.251681 -0.575438 1.642082
4 NaN 0.123432 1.063412 -1.556765 0.839855
5 -1.678960 -1.617817 -1.344757 -1.469698 0.276604
6 NaN -0.813213 -0.077575 -0.064179 1.960611
7 1.256771 -0.541197 -1.577126 -1.723853 0.028666
8 0.236197 0.868503 -1.304098 -1.578005 -0.632721
9 NaN -0.227659 -0.857427 0.010257 -1.884986
Now you want to work on column zero so we use fillna with a limit of 3 and replace that column inplace
a[0].fillna(0, inplace=True, limit=3)
gives
0 1 2 3 4
0 -0.388759 -0.660923 0.385984 0.933920 0.164083
1 0.000000 -0.996237 -0.384492 0.191026 -1.168100
2 -0.773971 0.453441 -0.543590 0.768267 -1.127085
3 0.000000 -1.051186 -2.251681 -0.575438 1.642082
4 0.000000 0.123432 1.063412 -1.556765 0.839855
5 -1.678960 -1.617817 -1.344757 -1.469698 0.276604
6 NaN -0.813213 -0.077575 -0.064179 1.960611
7 1.256771 -0.541197 -1.577126 -1.723853 0.028666
8 0.236197 0.868503 -1.304098 -1.578005 -0.632721
9 NaN -0.227659 -0.857427 0.010257 -1.884986
Related
May I know how to ignore NaN when performing rolling on a df.
For example, given a df, perform rolling on column a, but ignore the Nan. This requirement should produced something
a avg
0 6772.0 7508.00
1 7182.0 8400.50
2 8570.0 9049.60
3 11078.0 10380.40
4 11646.0 11180.00
5 13426.0 12050.00
6 NaN NaN
7 17514.0 19350.00
8 18408.0 20142.50
9 22128.0 20142.50
10 22520.0 21018.67
11 NaN NaN
12 26164.0 27796.67
13 26590.0 21627.25
14 30636.0 23735.00
15 3119.0 25457.00
16 32166.0 25173.75
17 34774.0 23353.00
However, I dont know which part of this line should be tweaked to get the above expected output
df['a'].rolling(2 * w + 1, center=True, min_periods=1).mean()
Currently, the following code
import numpy as np
import pandas as pd
arr=[[6772],[7182],[8570],[11078],[11646],[13426],[np.nan],[17514],[18408],
[22128],[22520],[np.nan],[26164],[26590],[30636],[3119],[32166],[34774]]
df=pd.DataFrame(arr,columns=['a'])
w = 2
df['avg'] = df['a'].rolling(2 * w + 1, center=True, min_periods=1).mean()
produced the following,
a avg
0 6772.0 7508.00
1 7182.0 8400.50
2 8570.0 9049.60
3 11078.0 10380.40
4 11646.0 11180.00
5 13426.0 13416.00 <<<
6 NaN 15248.50 <<<
7 17514.0 17869.00 <<<
8 18408.0 20142.50
9 22128.0 20142.50
10 22520.0 22305.00 <<<
11 NaN 24350.50 <<<
12 26164.0 26477.50 <<<
13 26590.0 21627.25
14 30636.0 23735.00
15 3119.0 25457.00
16 32166.0 25173.75
17 34774.0 23353.00
<<< indicate where the values are different than the expected output
Update:
adding fillna
df['avg'] = df['a'].fillna(value=0).rolling(2 * w + 1, center=True, min_periods=1).mean()
Does not produced the expected output
a avg
0 6772.0 7508.00
1 7182.0 8400.50
2 8570.0 9049.60
3 11078.0 10380.40
4 11646.0 8944.00
5 13426.0 10732.80
6 NaN 12198.80
7 17514.0 14295.20
8 18408.0 16114.00
9 22128.0 16114.00
10 22520.0 17844.00
11 NaN 19480.40
12 26164.0 21182.00
13 26590.0 17301.80
14 30636.0 23735.00
15 3119.0 25457.00
16 32166.0 25173.75
17 34774.0 23353.00
12050=sum(11078 11646 13426 )/3
IIUC, you want to restart the rolling when nan is met. One way would be to use pandas.DataFrame.groupby:
m = df.isna().any(1)
df["avg"] = (df["a"].groupby(m.cumsum())
.rolling(2 * w + 1, center=True, min_periods=1).mean()
.reset_index(level=0, drop=True))
df["avg"] = df["avg"][~m]
Output:
a avg
0 6772.0 7508.000000
1 7182.0 8400.500000
2 8570.0 9049.600000
3 11078.0 10380.400000
4 11646.0 11180.000000
5 13426.0 12050.000000
6 NaN NaN
7 17514.0 19350.000000
8 18408.0 20142.500000
9 22128.0 20142.500000
10 22520.0 21018.666667
11 NaN NaN
12 26164.0 27796.666667
13 26590.0 21627.250000
14 30636.0 23735.000000
15 3119.0 25457.000000
16 32166.0 25173.750000
17 34774.0 23353.000000
My Problem
I have a loop that creates a value for x in time period t based on x in time period t-1. The loop is really slow so i wanted to try and turn it into a function. I tried to use np.where with shift() but I had no joy. Any idea how i might be able to get around this problem?
Thanks!
My Code
import numpy as np
import pandas as pd
csv1 = pd.read_csv('y_list.csv', delimiter = ',')
df = pd.DataFrame(csv1)
df.loc[df.index[0], 'var'] = 0
for x in range(1,len(df.index)):
if df["LAST"].iloc[x] > 0:
df["var"].iloc[x] = ((df["var"].iloc[x - 1] * 2) + df["LAST"].iloc[x]) / 3
else:
df["var"].iloc[x] = (df["var"].iloc[x - 1] * 2) / 3
df
Input Data
Dates,LAST
03/09/2018,-7
04/09/2018,5
05/09/2018,-4
06/09/2018,5
07/09/2018,-6
10/09/2018,6
11/09/2018,-7
12/09/2018,7
13/09/2018,-9
Output
Dates,LAST,var
03/09/2018,-7,0.000000
04/09/2018,5,1.666667
05/09/2018,-4,1.111111
06/09/2018,5,2.407407
07/09/2018,-6,1.604938
10/09/2018,6,3.069959
11/09/2018,-7,2.046639
12/09/2018,7,3.697759
13/09/2018,-9,2.465173
You are looking at ewm:
arg = df.LAST.clip(lower=0)
arg.iloc[0] = 0
arg.ewm(alpha=1/3, adjust=False).mean()
Output:
0 0.000000
1 1.666667
2 1.111111
3 2.407407
4 1.604938
5 3.069959
6 2.046639
7 3.697759
8 2.465173
Name: LAST, dtype: float64
You can use df.shift to shift the dataframe be a default of 1 row, and convert the if-else block in to a vectorized np.where:
In [36]: df
Out[36]:
Dates LAST var
0 03/09/2018 -7 0.0
1 04/09/2018 5 1.7
2 05/09/2018 -4 1.1
3 06/09/2018 5 2.4
4 07/09/2018 -6 1.6
5 10/09/2018 6 3.1
6 11/09/2018 -7 2.0
7 12/09/2018 7 3.7
8 13/09/2018 -9 2.5
In [37]: (df.shift(1)['var']*2 + np.where(df['LAST']>0, df['LAST'], 0)) / 3
Out[37]:
0 NaN
1 1.666667
2 1.133333
3 2.400000
4 1.600000
5 3.066667
6 2.066667
7 3.666667
8 2.466667
Name: var, dtype: float64
How to calculate amounts that row values greater than a specific value in pandas?
For example, I have a Pandas DataFrame dff. I want to count row values greater than 0.
dff = pd.DataFrame(np.random.randn(9,3),columns=['a','b','c'])
dff
a b c
0 -0.047753 -1.172751 0.428752
1 -0.763297 -0.539290 1.004502
2 -0.845018 1.780180 1.354705
3 -0.044451 0.271344 0.166762
4 -0.230092 -0.684156 -0.448916
5 -0.137938 1.403581 0.570804
6 -0.259851 0.589898 0.099670
7 0.642413 -0.762344 -0.167562
8 1.940560 -1.276856 0.361775
I am using an inefficient way. How to be more efficient?
dff['count'] = 0
for m in range(len(dff)):
og = 0
for i in dff.columns:
if dff[i][m] > 0:
og += 1
dff['count'][m] = og
dff
a b c count
0 -0.047753 -1.172751 0.428752 1
1 -0.763297 -0.539290 1.004502 1
2 -0.845018 1.780180 1.354705 2
3 -0.044451 0.271344 0.166762 2
4 -0.230092 -0.684156 -0.448916 0
5 -0.137938 1.403581 0.570804 2
6 -0.259851 0.589898 0.099670 2
7 0.642413 -0.762344 -0.167562 1
8 1.940560 -1.276856 0.361775 2
You can create a boolean mask of your DataFrame, that is True wherever a value is greater than your threshold (in this case 0), and then use sum along the first axis.
dff.gt(0).sum(1)
0 1
1 1
2 2
3 2
4 0
5 2
6 2
7 1
8 2
dtype: int64
I have two data frames as below:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.161456 0.033139 0.991840 2.111023 0.846197
1 1 10 0.636140 1.024235 36.333741 16.074662 3.142135
2 1 13 0.605840 0.034337 2.085061 2.125908 0.069698
3 1 14 0.038481 0.152382 4.608259 4.960007 0.162162
4 1 5 0.035628 0.087637 1.397457 0.768467 0.052605
5 1 6 0.114375 0.020196 0.220193 7.662065 0.077727
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 1 1 0.305224 0.542488 66.428382 73.615079 10.342252
1 1 10 0.814696 1.246165 73.802644 58.064363 11.179206
2 1 13 0.556437 0.517383 50.555948 51.913547 9.412299
3 1 14 0.314058 1.148754 56.165767 61.261950 9.142128
4 1 5 0.499129 0.460813 40.182454 41.770906 8.263437
5 1 6 0.300203 0.784065 47.359506 52.841821 9.833513
I want to divide the numerical values in the selected cells of the first by the second and I am using the following code:
df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
However, this way I lose the information from the column "Sample_name".
C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
0 0.528977 0.061088 0.014931 0.028677 0.081819
1 0.780831 0.821909 0.492309 0.276842 0.281070
2 1.088785 0.066367 0.041243 0.040951 0.007405
3 0.122529 0.132650 0.082047 0.080964 0.017738
4 0.071381 0.190178 0.034778 0.018397 0.006366
5 0.380993 0.025759 0.004649 0.145000 0.007904
How can I perform the division while keeping the column "Sample_name" in the resulting dataframe?
You can selectively overwrite using loc, the same way that you're already performing the division:
df1_int.loc[:,'C14-Cer':] = df1_int.loc[:,'C14-Cer':].div(df2.loc[:,'C14-Cer':])
This preserves the sample_name col:
In [12]:
df.loc[:,'C14-Cer':] = df.loc[:,'C14-Cer':].div(df1.loc[:,'C14-Cer':])
df
Out[12]:
Sample_name C14-Cer C16-Cer C18-Cer C18:1-Cer C20-Cer
index
0 1 1 0.528975 0.061087 0.014931 0.028677 0.081819
1 1 10 0.780831 0.821910 0.492309 0.276842 0.281070
2 1 13 1.088785 0.066367 0.041243 0.040951 0.007405
3 1 14 0.122528 0.132650 0.082047 0.080964 0.017738
4 1 5 0.071380 0.190179 0.034778 0.018397 0.006366
5 1 6 0.380992 0.025758 0.004649 0.145000 0.007904
Let's say I want to construct a dummy variable that is true if a number is between 1 and 10, I can do:
df['numdum'] = df['number'].isin(range(1,11))
Is there a way to do that for a continuous interval? So, create a dummy variable that is true if a number is in a range, allowing for non-integers.
Series objects (including dataframe columns) have a between method:
>>> s = pd.Series(np.linspace(0, 20, 8))
>>> s
0 0.000000
1 2.857143
2 5.714286
3 8.571429
4 11.428571
5 14.285714
6 17.142857
7 20.000000
dtype: float64
>>> s.between(1, 14.5)
0 False
1 True
2 True
3 True
4 True
5 True
6 False
7 False
dtype: bool
This works:
df['numdum'] = (df.number >= 1) & (df.number <= 10)
You could also do the same thing with cut(). No real advantage if there are just two categories:
>>> df['numdum'] = pd.cut( df['number'], [-99,10,99], labels=[1,0] )
number numdum
0 8 1
1 9 1
2 10 1
3 11 0
4 12 0
5 13 0
6 14 0
But it's nice if you have multiple categories:
>>> df['numdum'] = pd.cut( df['number'], [-99,8,10,99], labels=[1,2,3] )
number numdum
0 8 1
1 9 2
2 10 2
3 11 3
4 12 3
5 13 3
6 14 3
Labels can be True and False if that is preferred, or you can not specify the label at all, in which case the labels will contain info on the cutoff points.