I have a large dataframe that looks similar to this:
As you can tell, there are plenty of blanks. I want to propagate non-null values forward (so for example, in the first row 1029 goes to 1963.02.12 column, between 1029 and 1043) but only up to the last entry, that is it should stop propagating when it encounters the last non-null value (for D it would be
the 1992.03.23 column, but for A it'd be 1963.09.21, just outside the screenshot).
Is there a quicker way to achieve this without fiddling around with df.fillna(method='ffill', limit=x)? My original idea was to remember the date of the last entry, propagate values to the end of the row, and then fill the row with nulls after the saved date. I've been wondering if there is a cleverer method to achieve the same result.
This might not be very performant. I couldn't get a pure-pandas solution (which obviously doesn't guarantee performance anyway!)
>>> df
a b c d e
0 0.0 NaN NaN 1.0 NaN
1 0.0 1.0 NaN 2.0 3.0
2 NaN 1.0 2.0 NaN 4.0
What happens if we just ffill everything?
>>> df.ffill(axis=1)
a b c d e
0 0.0 0.0 0.0 1.0 1.0
1 0.0 1.0 1.0 2.0 3.0
2 NaN 1.0 2.0 2.0 4.0
We need to go back and add NaNs for the last null column in each row:
>>> new_data = []
>>> for _, row in df.iterrows():
... new_row = row.ffill()
... null_columns = [col for col, is_null in zip(row.index, row.isnull().values) if is_null]
... # replace value in last column with NaN
... if null_columns:
... last_null_column = null_columns[-1]
... new_row.ix[last_null_column] = np.nan
... new_data.append(new_row.to_dict())
...
>>> new_df = pd.DataFrame.from_records(new_data)
>>> new_df
a b c d e
0 0.0 0.0 0.0 1.0 NaN
1 0.0 1.0 NaN 2.0 3.0
2 NaN 1.0 2.0 NaN 4.0
Related
I am trying to set values to a DataFrame for a specific subset of a multi-index and instead of the values being set I am just getting NaN values.
Here is an example:
df_test = pd.DataFrame(np.ones((10,2)),index = pd.MultiIndex.from_product([['even','odd'],[0,1,2,3,4]],names = ['parity','mod5']))
df_test.loc[('even',),1] = pd.DataFrame(np.arange(5)+5,index = np.arange(5))
df_test
0 1
parity mod5
even 0 1.0 NaN
1 1.0 NaN
2 1.0 NaN
3 1.0 NaN
4 1.0 NaN
odd 0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
whereas I expected the following output:
0 1
parity mod5
even 0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 1.0 8.0
4 1.0 9.0
odd 0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
What do I need to do differently to get the expected result? I have tried a few other things like df_test.loc['even']['1'] but that doesn't even affect the DataFrame at all.
In this example, your indices are specially ordered. If you need to do something like this when index matching matters but the ordering of your DataFrame indices is not guaranteed, then this may be accomplished via DataFrame.update like this:
index = np.arange(5)
np.random.shuffle(index)
df_other = pd.DataFrame(np.arange(5) + 5, index=index).squeeze()
df_test.loc[('even',), 1].update(df_other)
The .squeeze() is needed to convert the DataFrame into a Series (whose shape and indices match those of df_test.loc[('even',), 1]).
You have:
df_test.loc[('even',),1] = pd.DataFrame(np.arange(5)+5,index = np.arange(5))
This assignment causes NaNs in df_test.loc[('even',),1] for 2 reasons. First, you are trying to assign a pd.DataFrame() to a single column. For this to work, you need the same index, as well as the same column name (which defaults to 0 below, but we need 1). It would be easier to use pd.Series(), in which case we don't need to worry about the name. Second, even with the pd.Series, you need to match the index (and index = np.arange(5) does not).
Try as follows:
pd.Series(np.arange(5,10),index = pd.MultiIndex.from_product([['even'],[0,1,2,3,4]]))
# or: pd.DataFrame(np.arange(5,10),columns=[1],
# index = pd.MultiIndex.from_product([['even'],[0,1,2,3,4]]))
# or, if you don't want to bother with the correct index,
# you could of course simply do:
# df_test.loc[('even',),1] = np.arange(5,10)
print(df_test)
0 1
parity mod5
even 0 1.0 5.0
1 1.0 6.0
2 1.0 7.0
3 1.0 8.0
4 1.0 9.0
odd 0 1.0 1.0
1 1.0 1.0
2 1.0 1.0
3 1.0 1.0
4 1.0 1.0
I'm very confused by the output of the pct_change function when data with NaN values are involved. The first several rows of output in the right column are correct - it gives the percentage change in decimal form of the cell to the left in Column A relative to the cell in Column A two rows prior. But as soon as it reaches the NaN values in Column A, the output of the pct_change function makes no sense.
For example:
Row 8: NaN is 50% greater than 2?
Row 9: NaN is 0% greater than 3?
Row 11: 4 is 33% greater than NaN?
Row 12: 2 is 33% less than NaN?`
Based on the above math, it seems like pct_change is assigning NaN a value of "3". Is that because pct_change effectively fills forward the last non-NaN value? Could someone please explain the logic here and why this happens?
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [2,1,3,1,4,5,2,3,np.nan,np.nan,np.nan,4,2,1,0,4]})
x = 2
df['pctchg_A'] = df['A'].pct_change(periods = x)
print(df.to_string())
Here's the output:
The behaviour is as expected. You need to carefully read the df.pct_change docs.
As per docs:
fill_method: str, default ‘pad’
How to handle NAs before computing percent changes.
Here, method pad means, it will forward-fill the NaN values with the nearest non-NaN value.
So, if you ffill or pad your NaN values, you will understand what's exactly happening. Check this out:
In [3201]: df['padded_A'] = df['A'].fillna(method='pad')
In [3203]: df['pctchg_A'] = df['A'].pct_change(periods = x)
In [3204]: df
Out[3204]:
A padded_A pctchg_A
0 2.0 2.0 NaN
1 1.0 1.0 NaN
2 3.0 3.0 0.500000
3 1.0 1.0 0.000000
4 4.0 4.0 0.333333
5 5.0 5.0 4.000000
6 2.0 2.0 -0.500000
7 3.0 3.0 -0.400000
8 NaN 3.0 0.500000
9 NaN 3.0 0.000000
10 NaN 3.0 0.000000
11 4.0 4.0 0.333333
12 2.0 2.0 -0.333333
13 1.0 1.0 -0.750000
14 0.0 0.0 -1.000000
15 4.0 4.0 3.000000
Now you can compare padded_A values with pctchg_A and see that it works as expected.
Basically, I'm trying to do something like this but for a fillna instead of a sum.
I have a list of df's, each with same colunms/indexes, ordered over time:
import numpy as np
import pandas as pd
np.random.seed(0)
df_list = []
for index in range(3):
a = pd.DataFrame(np.random.randint(3, size=(5,3)), columns=list('abc'))
mask = np.random.choice([True, False], size=a.shape)
df_list.append(a.mask(mask))
now, I want to do a replace the numpy.nan cells of the ith
DataFrame in df_list by the value of the same cell in the i-1 th
DataFrame in df_list.
so if the first DataFrame is:
a b c
0 NaN 1.0 0.0
1 1.0 1.0 NaN
2 0.0 NaN 0.0
3 NaN 0.0 2.0
4 NaN 2.0 2.0
and the 2nd is:
a b c
0 0.0 NaN NaN
1 NaN NaN NaN
2 0.0 1.0 NaN
3 NaN NaN 2.0
4 0.0 NaN 2.0
Then the output output_list should be a list of the same length as df_list and having also DataFrames as elements.
The first entry of output_list is the same as the first entry of df_list.
The second entry of output_list is:
a b c
0 0.0 1.0 0.0
1 1.0 1.0 NaN
2 0.0 1.0 0.0
3 NaN 0.0 2.0
4 0.0 2.0 2.0
I believe the update functionality is very good for this, see the docs: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html
It is a method that specifically allows you to update a DataFrame, in your case only the NaN-elements of it.
In particular, you could use it like this:
new_df_list = df_list[:1]
for df_new, df_old in zip(df_list[1:], df_list[:-1]):
df_new.update(df_old, overwrite=False)
new_df_list.append(df_new)
Which will give you the desired output
df.groupby([df.index.month, df.index.day])[vars_rs].transform(lambda y: y.fillna(y.median()))
I am filling missing values in a dataframe with median values from climatology. The days range from Jan 1 2010 to Dec 31st 2016. However, I only want to fill in missing values for days before current date (say Oct 1st 2016). How do I modify the statement?
The algorithm would be:
Get a part of the data frame which contains only rows filtered by date with a boolean mask
Perform required replacements on it
Append the rest of the initial data frame to the end of the resulting data frame.
Dummy data:
df = pd.DataFrame(np.zeros((5, 2)),columns=['A', 'B'],index=pd.date_range('2000',periods=5,freq='M'))
A B
2000-01-31 0.0 0.0
2000-02-29 0.0 0.0
2000-03-31 0.0 0.0
2000-04-30 0.0 0.0
2000-05-31 0.0 0.0
The code
vars_rs = ['A', 'B']
mask = df.index < '2000-03-31'
early = df[mask]
early = early.groupby([early.index.month, early.index.day])[vars_rs].transform(lambda y: y.replace(0.0, 1)) # replace with your code
result = early.append(df[~mask])
So the result is
A B
2000-01-31 1.0 1.0
2000-02-29 1.0 1.0
2000-03-31 0.0 0.0
2000-04-30 0.0 0.0
2000-05-31 0.0 0.0
Use np.where, example:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['a','a','b','b','c','c'],'B':[1,2,3,4,5,6],'C':[1,np.nan,np.nan,np.nan,np.nan,np.nan]})
df.ix[:,'C'] = np.where((df.A != 'c')&(df.B < 4)&(pd.isnull(df.C)),-99,df.ix[:,'C'])
Like this you can directly modify the desired column using boolean expressions and all columns.
Original dataframe:
A B C
0 a 1 1.0
1 a 2 NaN
2 b 3 NaN
3 b 4 NaN
4 c 5 NaN
5 c 6 NaN
Modified dataframe:
A B C
0 a 1 1.0
1 a 2 -99.0
2 b 3 -99.0
3 b 4 NaN
4 c 5 NaN
5 c 6 NaN
I am trying to fill none values in a Pandas dataframe with 0's for only some subset of columns.
When I do:
import pandas as pd
df = pd.DataFrame(data={'a':[1,2,3,None],'b':[4,5,None,6],'c':[None,None,7,8]})
print df
df.fillna(value=0, inplace=True)
print df
The output:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 NaN 7.0
3 NaN 6.0 8.0
a b c
0 1.0 4.0 0.0
1 2.0 5.0 0.0
2 3.0 0.0 7.0
3 0.0 6.0 8.0
It replaces every None with 0's. What I want to do is, only replace Nones in columns a and b, but not c.
What is the best way of doing this?
You can select your desired columns and do it by assignment:
df[['a', 'b']] = df[['a','b']].fillna(value=0)
The resulting output is as expected:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
You can using dict , fillna with different value for different column
df.fillna({'a':0,'b':0})
Out[829]:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
After assign it back
df=df.fillna({'a':0,'b':0})
df
Out[831]:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
You can avoid making a copy of the object using Wen's solution and inplace=True:
df.fillna({'a':0, 'b':0}, inplace=True)
print(df)
Which yields:
a b c
0 1.0 4.0 NaN
1 2.0 5.0 NaN
2 3.0 0.0 7.0
3 0.0 6.0 8.0
using the top answer produces a warning about making changes to a copy of a df slice. Assuming that you have other columns, a better way to do this is to pass a dictionary:
df.fillna({'A': 'NA', 'B': 'NA'}, inplace=True)
This should work and without copywarning
df[['a', 'b']] = df.loc[:,['a', 'b']].fillna(value=0)
Here's how you can do it all in one line:
df[['a', 'b']].fillna(value=0, inplace=True)
Breakdown: df[['a', 'b']] selects the columns you want to fill NaN values for, value=0 tells it to fill NaNs with zero, and inplace=True will make the changes permanent, without having to make a copy of the object.
Or something like:
df.loc[df['a'].isnull(),'a']=0
df.loc[df['b'].isnull(),'b']=0
and if there is more:
for i in your_list:
df.loc[df[i].isnull(),i]=0
For some odd reason this DID NOT work (using Pandas: '0.25.1')
df[['col1', 'col2']].fillna(value=0, inplace=True)
Another solution:
subset_cols = ['col1','col2']
[df[col].fillna(0, inplace=True) for col in subset_cols]
Example:
df = pd.DataFrame(data={'col1':[1,2,np.nan,], 'col2':[1,np.nan,3], 'col3':[np.nan,2,3]})
output:
col1 col2 col3
0 1.00 1.00 nan
1 2.00 nan 2.00
2 nan 3.00 3.00
Apply list comp. to fillna values:
subset_cols = ['col1','col2']
[df[col].fillna(0, inplace=True) for col in subset_cols]
Output:
col1 col2 col3
0 1.00 1.00 nan
1 2.00 0.00 2.00
2 0.00 3.00 3.00
Sometimes this syntax wont work:
df[['col1','col2']] = df[['col1','col2']].fillna()
Use the following instead:
df['col1','col2']