I am updating the missing values in dataframe with another column from the same dataframe. But I can't understand the behaviour.
0 1 2
0 NaN 0.076733 0.378676
1 NaN 0.223911 NaN
2 NaN 0.173071 0.534397
3 NaN 0.991686 0.381196
4 0.088309 0.237683 0.003508
5 0.751860 0.494204 0.757413
6 0.630420 0.192947 0.538492
I am updating the column 1 & column 2 with column 0 series.
df.fillna(df[1])
I would expect it to align with the label index, but it fills with a scalar value from the series instead by the series itself.
0 1 2
0 0.076733 0.076733 0.378676
1 0.076733 0.223911 0.173071
2 0.076733 0.173071 0.534397
3 0.076733 0.991686 0.381196
4 0.088309 0.237683 0.003508
5 0.751860 0.494204 0.757413
6 0.630420 0.192947 0.538492
Edit:
I would expect it to output like this:
0 1 2
0 0.076733 0.076733 0.378676
1 0.223911 0.223911 0.223911
2 0.173071 0.173071 0.534397
3 0.991686 0.991686 0.381196
4 0.088309 0.237683 0.003508
5 0.751860 0.494204 0.757413
6 0.630420 0.192947 0.538492
Can somebody please help explain what's going on here?
Reedit
I found a way where pandas follow what I want to do - passing a dictionary for each column, which seems quite verbose.
df.fillna({0:df[1],2:df[1]})
It's filling your NA's according to first column (column 0 a.k.a df[0]).
0 1 2
0 *0.895575 0.522721 0.012833
1 **0.522721 0.522721 0.012833
2 ***0.012833 0.522721 0.558843
3 0.258442 0.522721 0.772859
4 0.900045 0.026117 0.720966
5 0.913345 0.677905 0.501755
6 0.907725 0.080543 0.881279
So if you had NA's in the first column, it would be replaced by the value of the first row of df[0]. In your example, that value would be 0.895575.
For NA's in your second column (df[1]), it's using the the second row of the column you specified (df[0], the first column). So, all NA's are filled with 0.522721
For NA's in your third column (df[2]), it's using the the third row of the column you specified (df[0], the first column). So, all NA's are filled with 0.012833
Hope this helps!
Edit: I suspect geekay's solution will accomplish what you had intended:
df.fillna(method='ffill', axis=1)
specify method and axis
this will do
import pandas as pd
df=pd.DataFrame({0:[1,2,3,4,5,6],1:[None,None,None,None,5,6],2:[None,None,3,4,5,6]})
df.fillna(method='ffill', axis=1)
print(df)
0 1 2
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
3 4.0 4.0 4.0
4 5.0 5.0 5.0
5 6.0 6.0 6.0
for arbitrary columns
df[1]=np.where(df[1].isnull(),df[0],df[1])
df[2]=np.where(df[2].isnull(),df[0],df[2])
Related
I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.
You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0
df1 = pd.DataFrame(np.arange(15).reshape(5,3))
df1.iloc[:4,1] = np.nan
df1.iloc[:2,2] = np.nan
df1.dropna(thresh=1 ,axis=1)
It seems that no nan value has been deleted.
0 1 2
0 0 NaN NaN
1 3 NaN NaN
2 6 NaN 8.0
3 9 NaN 11.0
4 12 13.0 14.0
if i run
df1.dropna(thresh=2,axis=1)
why it gives the following?
0 2
0 0 NaN
1 3 NaN
2 6 8.0
3 9 11.0
4 12 14.0
i just dont understand what thresh is doing here. If a column has more than one nan value, should the column be deleted?
thresh=N requires that a column has at least N non-NaNs to survive. In the first example, both columns have at least one non-NaN, so both survive. In the second example, only the last column has at least two non-NaNs, so it survives, but the previous column is dropped.
Try setting thresh to 4 to get a better sense of what's happening.
thresh parameter value decides the minimum number of non-NAN values needed in a "ROW" not to drop.
This will search along the column and check if the column has atleast 1 non-NaN values:
df1.dropna(thresh=1 ,axis=1)
So the Column name 1 has only one non-NaN value i.e 13 but thresh=2 need atleast 2 non-NaN, so this column failed and it will drop that column:
df1.dropna(thresh=2,axis=1)
wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]}
I want to delete the row with 'hhh', because all datas in 'a' are numbers.
The original data size is huge. Thank you very much.
Option 1
Convert a using pd.to_numeric
df.a = pd.to_numeric(df.a, errors='coerce')
df
a b
0 NaN 1.0
1 2.0 2.0
2 3.0 NaN
3 4.0 NaN
4 5.0 5.0
Non-Numeric columns are coerced to NaN. You can then drop this row -
df.dropna(subset=['a'])
a b
1 2.0 2.0
2 3.0 NaN
3 4.0 NaN
4 5.0 5.0
Option 2
Another alternative is using str.isdigit -
df.a.str.isdigit()
0 False
1 NaN
2 NaN
3 NaN
4 NaN
Name: a, dtype: object
Filter as such -
df[df.a.str.isdigit().isnull()]
a b
1 2 2.0
2 3 NaN
3 4 NaN
4 5 5.0
Notes -
This won't work for float columns
If the numbers are also as strings, then drop the isnull bit -
df[df.a.str.isdigit()]
import pandas as pd
import numpy as np
wu=pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})
#wu = wu[wu.a.str.contains('\d+',na=False)]
#wu = wu[wu.a.apply(lambda x: x.isnumeric())]
wu = wu[wu.a.apply(lambda x: isinstance(x, (int, np.int64)))]
print(wu)
Note that you missed out a closing parenthesis when creating your DataFrame.
I tried 3 ways, but only the third one worked. You can always try the other ones (commented out) if that works for you. Do let me know if it works on the larger dataset.
df = pd.DataFrame({'a':['hhh',2,3,4,5],'b':[1,2,np.nan,np.nan,5]})
df.drop(df[df['a'].apply(type) != int].index, inplace=True)
if you just want to view the appropriate rows:
df.loc[df['a'].apply(type) != int, :]
I have a dataframe like the one given below. There is one column on top. There is a second column with element name given below it. I am trying to take a count of all the numbers under each element and trying to transpose the data so that the ranking will become the column header after transposing and the count will be the data underneath each rank. tried multiple methods using pandas like
df.eq('1').sum(axis=1)
df2=df.transpose
but not getting the desired output.
how would you rank these items on a scale of 1-5
X Y Z
1 2 1
2 1 3
3 1 1
1 3 2
1 1 2
2 5 3
4 1 2
1 4 4
3 3 5
desired output is something like
1 2 3 4 5
X (count of 1s)(count of 2s).....so on
Y (count of 1s)(count of 2s).......
Z (count of 1s)(count of 2s)............
any help would really mean a lot.
You can apply the pd.value_counts to all columns, which will count values from all the columns and then transpose the result:
df.apply(pd.value_counts).fillna(0).T
# 1 2 3 4 5
#X 4.0 2.0 2.0 1.0 0.0
#Y 4.0 1.0 2.0 1.0 1.0
#Z 2.0 3.0 2.0 1.0 1.0
Option 0
pd.concat
pd.concat({c: s.value_counts() for c, s in df.iteritems()}).unstack(fill_value=0)
Option 1
stack preserves int dtype
df.stack().groupby(level=1).apply(
pd.value_counts
).unstack(fill_value=0)
In my data sets (train, test) max_floor values are null for some records. I am trying to fill the null values with the mode of max_floor values of apartments which shares the same apartment name:
for t in full.apartment_name.unique():
for df in frames:
df['max_floor'].fillna((df.loc[df["apartment_name"]==t,
'max_floor']).mode, inplace=True)
where full is train.append(test)
and df is [train,test]
Running the above code is not giving me the expected result. The above code is running fine but is filling all the max_floor null values with the below text:
bound method Series.mode of 0 NaN
1084 NaN
23278 9.0
Name: max_floor, dtype: float64
I just wanted to replace the above text with just the max_floor values. Any help would be appreciated.
mode() is a function and you've referred to it but not invoked it.
Change mode to mode()
You need to access the first value from the mode() result. For example:
A B
0 1 3.0
1 2 NaN
2 2 NaN
3 3 NaN
Fill missed values with the mode of the column A:
df.fillna(df['A'].mode()[0])
Output:
A B
0 1 3.0
1 2 2.0
2 2 2.0
3 3 2.0