Hi I have a dataframe with some missing values ex:
The black numbers 40 and 50 are the values already inputted and the red ones are to autofill from the previous values. Row 2 is blank as there is no previous number to fill.
Any idea how I can do this efficiently? I was trying loops but maybe there is a better way
It can be done easily with ffill method in pandas fillna.
To illustrate the working consider the following sample dataframe
df = pd.DataFrame()
df['Vals'] = [1, 2, 3, np.nan, np.nan, 6, 7, np.nan, 8]
Vals
0 1.0
1 2.0
2 3.0
3 NaN
4 5.0
5 6.0
6 7.0
7 NaN
8 8.0
To fill the missing value do this
df['Vals'].fillna(method='ffill', inplace=True)
Vals
0 1.0
1 2.0
2 3.0
3 3.0
4 3.0
5 6.0
6 7.0
7 7.0
8 8.0
There is a direct synonym function for this, pandas.DataFrame.ffill
df['Vals',inplace=True]
Related
I am relatively new to python and I am wondering how I can merge these two tables and preserve both their values?
Consider these two tables:
df = pd.DataFrame([[1, 3], [2, 4],[2.5,1],[5,6],[7,8]], columns=['A', 'B'])
A B
1 3
2 4
2.5 1
5 6
7 8
df2 = pd.DataFrame([[1],[2],[3],[4],[5],[6],[7],[8]], columns=['A'])
A
1
2
...
8
I want to obtain the following result:
A B
1 3
2 4
2.5 1
3 NaN
4 NaN
5 6
6 NaN
7 8
8 NaN
You can see that column A includes all values from both the first and second dataframe in an ordered manner.
I have attempted:
pd.merge(df,df2,how='outer')
pd.merge(df,df2,how='right')
But the former does not result in an ordered dataframe and the latter does not include rows that are unique to df.
Let us do concat then drop_duplicates
out = pd.concat([df2,df]).drop_duplicates('A',keep='last').sort_values('A')
Out[96]:
A B
0 1.0 3.0
1 2.0 4.0
2 2.5 1.0
2 3.0 NaN
3 4.0 NaN
3 5.0 6.0
5 6.0 NaN
4 7.0 8.0
7 8.0 NaN
This is what I have:
df=pd.DataFrame({'A':[1,2,3,4,5],'B':[6,np.nan,np.nan,3,np.nan]})
A B
0 1 6.0
1 2 NaN
2 3 NaN
3 4 3.0
4 5 NaN
I would like to extend non-missing values of B to missing values of B underneath, so I have:
A B C
0 1 6.0 6.0
1 2 NaN NaN
2 3 NaN NaN
3 4 3.0 3.0
4 5 NaN NaN
I tried something like this, and it worked last night:
for i in df.index:
df['C'][i]=np.where(pd.isnull(df['B'].iloc[i]),df['C'][i-1],df.B.iloc[i])
But when I woke up this morning it said it didn't recognize 'C.' I couldn't identify the conditions in which it worked and didn't work.
Thanks!
You could use pandas fillna() method to forward fill the missing values with the last non-null value. See the pandas documentation for more details.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [6, np.nan, np.nan, 3, np.nan]
})
df['C'] = df['B'].fillna(method='ffill')
df
# A B C
# 0 1 6.0 6.0
# 1 2 NaN 6.0
# 2 3 NaN 6.0
# 3 4 3.0 3.0
# 4 5 NaN 3.0
Lets say I have data like this:
>>> df = pd.DataFrame({'values': [5, np.nan, 2, 2, 2, 5, np.nan, 4, 5]})
>>> print(df)
values
0 5.0
1 NaN
2 2.0
3 2.0
4 2.0
5 5.0
6 NaN
7 4.0
8 5.0
I know that I can use fillna(), with arguments such as fillna(method='ffill') to fill missing values with the previous value. Is there a way of writing a custom method for fillna? Lets say I want every NaN value to be replaced by the arithmetic middle of to previous 2 values and the next 2 values, how would I do that? (I am not saying that is a good method of filling the values, but I want to know if it can be done).
Example for what the output would have to look like:
0 5.0
1 3.0
2 2.0
3 2.0
4 2.0
5 5.0
6 4.0
7 4.0
8 5.0
You can use ffill and bfill together as follows :
df['values'] = df['values'].ffill().add(df['values'].bfill()).div(2)
print(df)
values
0 5.0
1 3.0
2 2.0
3 2.0
4 2.0
5 5.0
6 4.0
7 4.0
8 5.0
Just change the df['values'] to df to apply over the whole dataframe!
I have a pandas dataframe like
a b c
0 0.5 10 7
1 1.0 6 6
2 2.0 1 7
3 2.5 6 -5
4 3.5 9 7
and I would like to fill the missing columns with respect to the column 'a' on the basis of a certain step. In this case, given the step of 0.5, I would like to fill the 'a' column with the missing values, that is 1.5 and 3.0, and set the other columns to null, in order to obtain the following result.
a b c
0 0.5 10.0 7.0
1 1.0 6.0 6.0
2 1.5 NaN NaN
3 2.0 1.0 7.0
4 2.5 6.0 -5.0
5 3.0 NaN NaN
6 3.5 9.0 7.0
Which is the cleanest way to do this with pandas or other libraries like numpy or scipy?
Thanks!
Create array by numpy.arange, then create index by set_index and last reindex with reset_index:
step= .5
idx = np.arange(df['a'].min(), df['a'].max() + step, step)
df = df.set_index('a').reindex(idx).reset_index()
print (df)
a b c
0 0.5 10.0 7.0
1 1.0 6.0 6.0
2 1.5 NaN NaN
3 2.0 1.0 7.0
4 2.5 6.0 -5.0
5 3.0 NaN NaN
6 3.5 9.0 7.0
One simple way to achieve that is to first create the index you want and then merge the remaining of the information on it:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [0.5, 1, 2, 2.5, 3.5],
'b': [10, 6, 1, 6, 9],
'c': [7, 6, 7, -5, 7]})
ls = np.arange(df.a.min(), df.a.max(), 0.5)
new_df = pd.DataFrame({'a':ls})
new_df = new_df.merge(df, on='a', how='left')
I tried several methods to replace NaN in a row with values in another row, but none of them worked as expected. Here is my Dataframe:
test = pd.DataFrame(
{
"a": [1, 2, 3, 4, 5],
"b": [4, 5, 6, np.nan, np.nan],
"c": [7, 8, 9, np.nan, np.nan],
"d": [7, 8, 9, np.nan, np.nan]
}
)
a b c d
0 1 4.0 7.0 7.0
1 2 5.0 8.0 8.0
2 3 6.0 9.0 9.0
3 4 NaN NaN NaN
4 5 NaN NaN NaN
I need to replace NaN in 4th row with values first row, i.e.,
a b c d
0 1 **4.0 7.0 7.0**
1 2 5.0 8.0 8.0
2 3 6.0 9.0 9.0
3 4 **4.0 7.0 7.0**
4 5 NaN NaN NaN
And the second question is how can I multiply some/part values in a row by a number, for example, I need to double the values in second two when the columns are ['b', 'c', 'd'], then the result is:
a b c d
0 1 4.0 7.0 7.0
1 2 **10.0 16.0 16.0**
2 3 6.0 9.0 9.0
3 4 NaN NaN NaN
4 5 NaN NaN NaN
First of all, I suggest you do some reading on Indexing and selecting data in pandas.
Regaring the first question you can use .loc() with isnull() to perform boolean indexing on the column vaulues:
mask_nans = test.loc[3,:].isnull()
test.loc[3, mask_nans] = test.loc[0, mask_nans]
And to double the values you can directly multiply by 2 the sliced dataframe also using .loc():
test.loc[1,'b':] *= 2
a b c d
0 1 4.0 7.0 7.0
1 2 10.0 16.0 16.0
2 3 6.0 9.0 9.0
3 4 4.0 7.0 7.0
4 5 NaN NaN NaN
Indexing with labels
If you wish to filter by a, and a values are unique, consider making it your index to simplify your logic and make it more efficient:
test = test.set_index('a')
test.loc[4] = test.loc[4].fillna(test.loc[1])
test.loc[2] *= 2
Boolean masks
If a is not unique and Boolean masks are required, you can still use fillna with an additional step::
mask = test['a'].eq(4)
test.loc[mask] = test.loc[mask].fillna(test.loc[test['a'].eq(1).idxmax()])
test.loc[test['a'].eq(2)] *= 2