Very new to coding so please excuse my lack of knowledge
I currently have a dataframe that looks like this:
date A B C
2006-11-01 NaN 1 NaN
2016-11-02 NaN NaN 1
2016-11-03 1 NaN NaN
2016-11-04 NaN 1 NaN
2016-11-05 NaN 1 NaN
2016-11-06 NaN NaN NaN
2016-11-07 NaN 1 NaN
What I want to do, for example, is:
replace all NaN's in column A with 0 for all dates after 2016-11-03 and be able to do this same thing for each column but with different corresponding dates.
I have tried
for col in df:
if col == 'A' & 'date' > '2016-11-03':
value_1 = {'A':0}
df = df.fillna(value=value_1)
but I received this error TypeError: unsupported operand type(s) for &: 'str' and 'str'
I'm sure this has to do with my lack of knowledge but I'm not sure how to proceed.
EDIT: what I am looking for is something like this:
date A B C
2006-11-01 NaN 1 NaN
2016-11-02 NaN NaN 1
2016-11-03 1 NaN NaN
2016-11-04 0 1 NaN
2016-11-05 0 1 NaN
2016-11-06 0 NaN NaN
2016-11-07 0 1 NaN
The condition consists of two parts: being after a certain date and being a NaN.
condition = (df['date'] > '2016-11-03') & df['A'].isnull()
Now, choose the rows that match the condition and make the respective items in the column A equal 0:
df.loc[condition, 'A'] = 0
df.loc[(df.date > '2016-11-03') & (df['A'].isnull()),'A'] = 0;
Output:
A B C date
0 NaN 1.0 NaN 2006-11-01
1 NaN NaN 1.0 2016-11-02
2 1.0 NaN NaN 2016-11-03
3 0.0 1.0 NaN 2016-11-04
4 0.0 1.0 NaN 2016-11-05
5 0.0 NaN NaN 2016-11-06
6 0.0 1.0 NaN 2016-11-07
You can use fillna, replace empty rows with zero
if condition == 'A':
fillna(0, inplace = True)
Related
I have a dataframe where the col look like :
NaN
859.0
NaN
NaN
0.0
NaN
and I would like to change the zero by the previous non NaN value, and don't change the other NaN,id get this :
NaN
859.0
NaN
NaN
859.0
NaN
I've tried replace with ffill, but can't manage to get the right output.
Any help welcome !
.ffill().shift() will propagate the last non-null value forward, and then you can just assign any rows with value = 0 to that:
In [42]: s.ffill().shift()
Out[42]:
0 NaN
1 NaN
2 859.0
3 859.0
4 859.0
5 0.0
dtype: float64
In [43]: s[s==0] = s.ffill().shift()
In [44]: s
Out[44]:
0 NaN
1 859.0
2 NaN
3 NaN
4 859.0
5 NaN
dtype: float64
First replace 0 to missing values, use ffill for forward filling missing values and last replace missing values back by Series.mask:
df['col'] = df['col'].mask(df['col'].eq(0)).ffill().mask(df['col'].isna())
print (df)
col
0 NaN
1 859.0
2 NaN
3 NaN
4 859.0
5 NaN
you could also do this with last_valid_index:
say your column is in df['col']
for i,_ in df.iterrows():
if df.loc[i,'col'] == 0:
df.at[i,'col'] = df.loc[df.loc[:i-1,'col'].last_valid_index(),'col']
output:
col
0 NaN
1 859.0
2 NaN
3 NaN
4 859.0
5 NaN
How to combine this line into pandas dataframe to drop columns which its missing rate over 90%?
this line will show all the column and its missing rate:
percentage = (LoanStats_securev1_2018Q1.isnull().sum()/LoanStats_securev1_2018Q1.isnull().count()*100).sort_values(ascending = False)
Someone familiar with pandas please kindly help.
You can use dropna with a threshold
newdf=df.dropna(axis=1,thresh=len(df)*0.9)
axis=1 indicates column and thresh is the
minimum number of non-NA values required.
I think need boolean indexing with mean of boolean mask:
df = df.loc[:, df.isnull().mean() < .9]
Sample:
np.random.seed(2018)
df = pd.DataFrame(np.random.randn(20,3), columns=list('ABC'))
df.iloc[3:8,0] = np.nan
df.iloc[:-1,1] = np.nan
df.iloc[1:,2] = np.nan
print (df)
A B C
0 -0.276768 NaN 2.148399
1 -1.279487 NaN NaN
2 -0.142790 NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN NaN
8 -0.172797 NaN NaN
9 -1.604543 NaN NaN
10 -0.276501 NaN NaN
11 0.704780 NaN NaN
12 0.138125 NaN NaN
13 1.072796 NaN NaN
14 -0.803375 NaN NaN
15 0.047084 NaN NaN
16 -0.013434 NaN NaN
17 -1.580231 NaN NaN
18 -0.851835 NaN NaN
19 -0.148534 0.133759 NaN
print(df.isnull().mean())
A 0.25
B 0.95
C 0.95
dtype: float64
df = df.loc[:, df.isnull().mean() < .9]
print (df)
A
0 -0.276768
1 -1.279487
2 -0.142790
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 -0.172797
9 -1.604543
10 -0.276501
11 0.704780
12 0.138125
13 1.072796
14 -0.803375
15 0.047084
16 -0.013434
17 -1.580231
18 -0.851835
19 -0.148534
The issue below was created in Python 2.7.11 with Pandas 0.17.1
When grouping a categorical column with both a period and date column, unexpected rows appear in the grouping. Is this a Pandas bug, or could it be something else?
df = pd.DataFrame({'date': pd.date_range('2015-12-29', '2016-1-3'),
'val1': [1] * 6,
'val2': range(6),
'cat1': ['a', 'b', 'c'] * 2,
'cat2': ['A', 'B', 'C'] * 2})
df['cat1'] = df.cat1.astype('category')
df['month'] = [d.to_period('M') for d in df.date]
>>> df
cat1 cat2 date val1 val2 month
0 a A 2015-12-29 1 0 2015-12
1 b B 2015-12-30 1 1 2015-12
2 c C 2015-12-31 1 2 2015-12
3 a A 2016-01-01 1 3 2016-01
4 b B 2016-01-02 1 4 2016-01
5 c C 2016-01-03 1 5 2016-01
Grouping the month and date with a regular series (e.g. cat2) works as expected:
>>> df.groupby(['month', 'date', 'cat2']).sum().unstack()
val1 val2
cat2 A B C A B C
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
But grouping on a categorical produces unexpected results. You'll notice in the index that the extra dates do not correspond to the grouped month.
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-02 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-03 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01 2015-12-29 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-30 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2015-12-31 NaN NaN NaN NaN NaN NaN # <<< Extraneous row.
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
Grouping the categorical by month periods or dates works fine, but not when both are combined as in the example above.
>>> df.groupby(['month', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month
2015-12 1 1 1 0 1 2
2016-01 1 1 1 3 4 5
>>> df.groupby(['date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
date
2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
EDIT
This behavior originated in the 0.15.0 update. Prior to that, this was the output:
>>> df.groupby(['month', 'date', 'cat1']).sum().unstack()
val1 val2
cat1 a b c a b c
month date
2015-12 2015-12-29 1 NaN NaN 0 NaN NaN
2015-12-30 NaN 1 NaN NaN 1 NaN
2015-12-31 NaN NaN 1 NaN NaN 2
2016-01 2016-01-01 1 NaN NaN 3 NaN NaN
2016-01-02 NaN 1 NaN NaN 4 NaN
2016-01-03 NaN NaN 1 NaN NaN 5
As defined in pandas, grouping with a categorical will always have the full set of categories, even if there isn't any data with that category, e.g., doc example here
You can either not use a categorical, or add a .dropna(how='all') after your grouping step.
I have got a dataframe:
df = pd.DataFrame({'index' : range(8),
'variable1' : ["A","A","B","B","A","B","B","A"],
'variable2' : ["a","b","a","b","a","b","a","b"],
'variable3' : ["x","x","x","y","y","y","x","y"],
'result': [1,0,0,1,1,0,0,1]})
df2 = df.pivot_table(values='result',rows='index',cols=['variable1','variable2','variable3'])
df2['A']['a']['x'][4] = 1
df2['B']['a']['x'][3] = 1
variable1 A B
variable2 a b a b
variable3 x y x y x y
index
0 1 NaN NaN NaN NaN NaN
1 NaN NaN 0 NaN NaN NaN
2 NaN NaN NaN NaN 0 NaN
3 NaN NaN NaN NaN 1 1
4 1 1 NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN 0
6 NaN NaN NaN NaN 0 NaN
7 NaN NaN NaN 1 NaN NaN
Now I want to check for simultaneous occurrences of x == 1 and y == 1, but only within each subgroup, defined by variable1 and variable2. So, for the dataframe shown above, the condition is met for index == 4 (group A-a), but not for index == 3 (groups B-a and B-b).
I suppose some groupby() magic would be needed, but I cannot find the right way. I have also tried experimenting with a stacked dataframe (using df.stack()), but this did not get me any closer...
You can use groupby on the 2 first levels variable1 and variable2 to get the sum of the x and y columns at that level:
r = df2.groupby(level=[0,1], axis=1).sum()
r
Out[50]:
variable1 A B
variable2 a b a b
index
0 1 NaN NaN NaN
1 NaN 0 NaN NaN
2 NaN NaN 0 NaN
3 NaN NaN 1 1
4 2 NaN NaN NaN
5 NaN NaN NaN 0
6 NaN NaN 0 NaN
7 NaN 1 NaN NaN
Consequently, the lines you are searching for are the ones that contain the value 2:
r[r==2].dropna(how='all')
Out[53]:
variable1 A B
variable2 a b a b
index
4 2 NaN NaN NaN
I have a python pandas DataFrame that looks like this:
A B C ... ZZ
2008-01-01 00 NaN NaN NaN ... 1
2008-01-02 00 NaN NaN NaN ... NaN
2008-01-03 00 NaN NaN 1 ... NaN
... ... ... ... ... ...
2012-12-31 00 NaN 1 NaN ... NaN
and I can't figure out how to get a subset of the DataFrame where there is one or more '1' in it, so that the final df should be something like this:
B C ... ZZ
2008-01-01 00 NaN NaN ... 1
2008-01-03 00 NaN 1 ... NaN
... ... ... ... ...
2012-12-31 00 1 NaN ... NaN
This is, removing all rows and columns that do not have a 1 in it.
I try this which seems to remove the rows with no 1:
df_filtered = df[df.sum(1)>0]
And the try to remove columns with:
df_filtered = df_filtered[df.sum(0)>0]
but get this error after the second line:
IndexingError('Unalignable boolean Series key provided')
Do it with loc:
In [90]: df
Out[90]:
0 1 2 3 4 5
0 1 NaN NaN 1 1 NaN
1 NaN NaN NaN NaN NaN NaN
2 1 1 NaN NaN 1 NaN
3 1 NaN 1 1 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [91]: df.loc[df.sum(1) > 0, df.sum(0) > 0]
Out[91]:
0 1 2 3 4
0 1 NaN NaN 1 1
2 1 1 NaN NaN 1
3 1 NaN 1 1 NaN
Here's why you get that error:
Let's say I have the following frame, df, (similar to yours):
In [112]: df
Out[112]:
a b c d e
0 0 1 1 NaN 1
1 NaN NaN NaN NaN NaN
2 0 0 0 NaN 0
3 0 0 1 NaN 1
4 1 1 1 NaN 1
5 0 0 0 NaN 0
6 1 0 1 NaN 0
When I sum along the rows and threshold at 0, I get:
In [113]: row_sum = df.sum()
In [114]: row_sum > 0
Out[114]:
a True
b True
c True
d False
e True
dtype: bool
Since the index of row_sum is the columns of df, it doesn't make sense in this case to try to use the values of row_sum > 0 to fancy-index into the rows of df, since their row indices are not aligned and they cannot be aligned.
Alternatively to remove all NaN rows or columns you can use .any() too.
In [1680]: df
Out[1680]:
0 1 2 3 4 5
0 1.0 NaN NaN 1.0 1.0 NaN
1 NaN NaN NaN NaN NaN NaN
2 1.0 1.0 NaN NaN 1.0 NaN
3 1.0 NaN 1.0 1.0 NaN NaN
4 NaN NaN NaN NaN NaN NaN
In [1681]: df.loc[df.any(axis=1), df.any(axis=0)]
Out[1681]:
0 1 2 3 4
0 1.0 NaN NaN 1.0 1.0
2 1.0 1.0 NaN NaN 1.0
3 1.0 NaN 1.0 1.0 NaN