Identifying consecutive NaN's with pandas part 2 - python

I have a question related to the earlier question: Identifying consecutive NaN's with pandas
I am new on stackoverflow so I cannot add a comment, but I would like to know how I can partly keep the original index of the dataframe when counting the number of consecutive nans.
So instead of:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df
Out[38]:
a
0 1
1 2
2 NaN
3 NaN
4 NaN
5 6
6 7
7 8
8 9
9 10
10 NaN
11 NaN
12 13
13 14
I would like to obtain the following:
Out[41]:
a
0 0
1 0
2 3
5 0
6 0
7 0
8 0
9 0
10 2
12 0
13 0

I have found a workaround. It is quite ugly, but it does the trick. I hope you don't have massive data, because it might be not very performing:
df = pd.DataFrame({'a':[1,2,np.NaN, np.NaN, np.NaN, 6,7,8,9,10,np.NaN,np.NaN,13,14]})
df1 = df.a.isnull().astype(int).groupby(df.a.notnull().astype(int).cumsum()).sum()
# Determine the different groups of NaNs. We only want to keep the 1st. The 0's are non-NaN values, the 1's are the first in a group of NaNs.
b = df.isna()
df2 = b.cumsum() - b.cumsum().where(~b).ffill().fillna(0).astype(int)
df2 = df2.loc[df2['a'] <= 1]
# Set index from the non-zero 'NaN-count' to the index of the first NaN
df3 = df1.loc[df1 != 0]
df3.index = df2.loc[df2['a'] == 1].index
# Update the values from df3 (which has the right values, and the right index), to df2
df2.update(df3)
The NaN-group thingy is inspired by the following answer: This is coming from the this answer.

Related

Dynamically Fill NaN Values in Dataframe

I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.
You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0

what is the best way to create running total columns in pandas

What is the most pandastic way to create running total columns at various levels (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,'X','X','X','X',np.nan,'X','X','X','X','X','X',np.nan,np.nan,'X','X'
df['desired_output_level_1'] = np.nan,np.nan,'1','1','1','1',np.nan,'2','2','2','2','2','2',np.nan,np.nan,'3','3'
df['desired_output_level_2'] = np.nan,np.nan,'1','2','3','4',np.nan,'1','2','3','4','5','6',np.nan,np.nan,'1','2'
output:
test desired_output_level_1 desired_output_level_2
0 NaN NaN NaN
1 NaN NaN NaN
2 X 1 1
3 X 1 2
4 X 1 3
5 X 1 4
6 NaN NaN NaN
7 X 2 1
8 X 2 2
9 X 2 3
10 X 2 4
11 X 2 5
12 X 2 6
13 NaN NaN NaN
14 NaN NaN NaN
15 X 3 1
16 X 3 2
The test column can only contain X's or NaNs.
The number of consecutive X's is random.
In the 'desired_output_level_1' column, trying to count up the number of series of X's.
In the 'desired_output_level_2' column, trying to find the duration of each series.
Can anyone help? Thanks in advance.
Perhaps not the most pandastic way, but seems to yield what you are after.
Three key points:
we are operating on only rows that are not NaN, so let's create a mask:
mask = df['test'].notna()
For level 1 computation, it's easy to compare when there is a change from NaN to not NaN by shifting rows by one:
df.loc[mask, "level_1"] = (df["test"].isna() & df["test"].shift(-1).notna()).cumsum()
For level 2 computation, it's a bit trickier. One way to do it is to run the computation for each level_1 group and do .transform to preserve the indexing:
df.loc[mask, "level_2"] = (
df.loc[mask, ["level_1"]]
.assign(level_2=1)
.groupby("level_1")["level_2"]
.transform("cumsum")
)
Last step (if needed) is to transform columns to strings:
df['level_1'] = df['level_1'].astype('Int64').astype('str')
df['level_2'] = df['level_2'].astype('Int64').astype('str')

Checking multiple columns condition in pandas

I want to create a new column in my dataframe that places the name of the column in the row if only that column has a value of 8 in the respective row, otherwise the new column's value for the row would be "NONE". For the dataframe df, the new column df["New_Column"] = ["NONE","NONE","A","NONE"]
df = pd.DataFrame({"A": [1, 2,8,3], "B": [0, 2,4,8], "C": [0, 0,7,8]})
Cool problem.
Find the 8-fields in each row: df==8
Count them: (df==8).sum(axis=1)
Find the rows where the count is 1: (df==8).sum(axis=1)==1
Select just those rows from the original dataframe: df[(df==8).sum(axis=1)==1]==8
Find the 8-fields again: df[(df==8).sum(axis=1)==1]==8)
Find the columns that hold the True values with idxmax (because True>False): (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
Fill in the gaps with "NONE"
To summarize:
df["New_Column"] = (df[(df==8).sum(axis=1)==1]==8).idxmax(axis=1)
df["New_Column"] = df["New_Column"].fillna("NONE")
# A B C New_Column
#0 1 0 0 NONE
#1 2 2 0 NONE
#2 8 4 7 A
#3 3 8 8 NONE
# I added another line as a proof of concept
#4 0 8 0 B
You can accomplish this using idxmax and a mask:
out = (df==8).idxmax(1)
m = ~(df==8).any(1) | ((df==8).sum(1) > 1)
df.assign(col=out.mask(m))
A B C col
0 1 0 0 NaN
1 2 2 0 NaN
2 8 4 7 A
3 3 8 8 NaN
Or do:
df2=df[(df==8)]
df['New_Column']=(df2[(df2!=df2.dropna(thresh=2).values[0]).all(1)].dropna(how='all')).idxmax(1)
df['New_Column'] = df['New_Column'].fillna('NONE')
print(df)
dropna + dropna again + idxmax + fillna. that's all you need for this.
Output:
A B C New_Column
0 1 0 0 NONE
1 2 2 0 NONE
2 8 4 7 A
3 3 8 8 NONE

Modifying multiple columns in a subset of rows in pandas DataFrame

I've got a pandas DataFrame. In this DataFrame I want to modify several columns of some rows. These are the approaches I've attempted.
df[['finalA', 'finalB']] = df[['A', 'B']]
exceptions = df.loc[df.normal == False]
Which works like a charm, but now I want to set the exceptions:
df.loc[exceptions.index, ['finalA', 'finalB']] = \
df.loc[exceptions.index, ['A_except', 'B_except']]
Which doesn't work. So I tried using .ix from this answer.
df.ix[exceptions.index, ['finalA', 'finalB']] = \
df.ix[exceptions.index, ['A_except', 'B_except']]
Which doesn't work either. Both methods give me NaN in both finalA and finalB for the exceptional rows.
The only way that seems to work is doing it one column at a time:
df.ix[exceptions.index, 'finalA'] = \
df.ix[exceptions.index, 'A_except']
df.ix[exceptions.index, 'finalB'] = \
df.ix[exceptions.index, 'B_except']
What's going on here in pandas? How do I avoid setting the values to the copy that is apparently made by selecting multiple columns? Is there a way to avoid this kind of code repetition?
Some more musings: It doesn't actually set the values to a copy of the dataframe, it sets the values to NaN. It actually overwrites them to a new value.
Sample dataframe:
import pandas as pd
df = pd.DataFrame({'A': [1,2,3,4],
'B': [5,6,7,8],
'normal': [True, True, False, False],
'A_except': [0,0,9,9],
'B_except': [0,0,10,10]})
Result:
A A_except B B_except normal finalA finalB
0 1 0 5 0 True 1.0 5.0
1 2 0 6 0 True 2.0 6.0
2 3 9 7 10 False NaN NaN
3 4 9 8 10 False NaN NaN
Expected result:
A A_except B B_except normal finalA finalB
0 1 0 5 0 True 1 5
1 2 0 6 0 True 2 6
2 3 9 7 10 False 9 10
3 4 9 8 10 False 9 10
You can rename column names for align:
d = {'A_except':'finalA', 'B_except':'finalB'}
df.loc[exceptions.index, ['finalA', 'finalB']] = \
df.loc[exceptions.index, ['A_except', 'B_except']].rename(columns=d)
print (df)
A A_except B B_except normal finalA finalB
0 1 0 5 0 True 1 5
1 2 0 6 0 True 2 6
2 3 9 7 10 False 9 10
3 4 9 8 10 False 9 10
Another solution is convert output to numpy array, but columns dont align:
df.loc[exceptions.index, ['finalA', 'finalB']] = \
df.loc[exceptions.index, ['A_except', 'B_except']].values
print (df)
A A_except B B_except normal finalA finalB
0 1 0 5 0 True 1 5
1 2 0 6 0 True 2 6
2 3 9 7 10 False 9 10
3 4 9 8 10 False 9 10
If you view both sides of the equations, you will notice that the columns differ. Pandas takes the labels of the columns into account, and since they don't match, wont insert the value.
It works for a single column because then you are extracting a Series, and then the column label no longer applies.
A quick solution would be the simply strip the DataFrame to a bare array, then both the loc and ix method work:
df.loc[exceptions.index, ['finalA', 'finalB']] =
df.loc[exceptions.index, ['A_except', 'B_except']].values
But keep in mind that doing this will eliminate Pandas attempt to match column and index labels, its basically a 'hard' insert. So that makes you as a user responsible for the proper alignment. Which in this case is not a problem, but something to be aware of in general.

Merging and Filling in Pandas DataFrames

I have two dataframes in Pandas. The columns are named the same and they have the same dimensions, but they have different (and missing) values.
I would like to merge based on one key column and take the max or non-missing data for each equivalent row.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'key':[1,3,5,7], 'a':[np.NaN, 0, 5, 1], 'b':[datetime.datetime.today() - datetime.timedelta(days=x) for x in range(0,4)]})
df1
a b key
0 NaN 2014-08-01 10:37:23.828683 1
1 0 2014-07-31 10:37:23.828726 3
2 5 2014-07-30 10:37:23.828736 5
3 1 2014-07-29 10:37:23.828744 7
df2 = pd.DataFrame({'key':[1,3,5,7], 'a':[2, 0, np.NaN, 3], 'b':[datetime.datetime.today() - datetime.timedelta(days=x) for x in range(2,6)]})
df2.ix[2,'b']=np.NaN
df2
a b key
0 2 2014-07-30 10:38:13.857203 1
1 0 2014-07-29 10:38:13.857253 3
2 NaN NaT 5
3 3 2014-07-27 10:38:13.857272 7
The end result would look like:
df_together
a b key
0 2 2014-07-30 10:38:13.857203 1
1 0 2014-07-29 10:38:13.857253 3
2 5 2014-07-30 10:37:23.828736 5
3 3 2014-07-27 10:38:13.857272 7
I hope my example covers all cases. If both dataframes have NaN (or NaT) values, they the result should also have NaN (or NaT) values. Try as I might, I can't get the pd.merge function to give what I want.
Often it is easiest in these circumstances to do:
df_together = pd.concat([df1, df2]).groupby('key').max()

Categories

Resources