what is the best way to create running total columns in pandas - python

What is the most pandastic way to create running total columns at various levels (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,'X','X','X','X',np.nan,'X','X','X','X','X','X',np.nan,np.nan,'X','X'
df['desired_output_level_1'] = np.nan,np.nan,'1','1','1','1',np.nan,'2','2','2','2','2','2',np.nan,np.nan,'3','3'
df['desired_output_level_2'] = np.nan,np.nan,'1','2','3','4',np.nan,'1','2','3','4','5','6',np.nan,np.nan,'1','2'
output:
test desired_output_level_1 desired_output_level_2
0 NaN NaN NaN
1 NaN NaN NaN
2 X 1 1
3 X 1 2
4 X 1 3
5 X 1 4
6 NaN NaN NaN
7 X 2 1
8 X 2 2
9 X 2 3
10 X 2 4
11 X 2 5
12 X 2 6
13 NaN NaN NaN
14 NaN NaN NaN
15 X 3 1
16 X 3 2
The test column can only contain X's or NaNs.
The number of consecutive X's is random.
In the 'desired_output_level_1' column, trying to count up the number of series of X's.
In the 'desired_output_level_2' column, trying to find the duration of each series.
Can anyone help? Thanks in advance.

Perhaps not the most pandastic way, but seems to yield what you are after.
Three key points:
we are operating on only rows that are not NaN, so let's create a mask:
mask = df['test'].notna()
For level 1 computation, it's easy to compare when there is a change from NaN to not NaN by shifting rows by one:
df.loc[mask, "level_1"] = (df["test"].isna() & df["test"].shift(-1).notna()).cumsum()
For level 2 computation, it's a bit trickier. One way to do it is to run the computation for each level_1 group and do .transform to preserve the indexing:
df.loc[mask, "level_2"] = (
df.loc[mask, ["level_1"]]
.assign(level_2=1)
.groupby("level_1")["level_2"]
.transform("cumsum")
)
Last step (if needed) is to transform columns to strings:
df['level_1'] = df['level_1'].astype('Int64').astype('str')
df['level_2'] = df['level_2'].astype('Int64').astype('str')

Related

Dynamically Fill NaN Values in Dataframe

I have a dataframe with a series of numbers. For example:
Index Column 1
1 10
2 12
3 24
4 NaN
5 20
6 15
7 NaN
8 NaN
9 2
I can't use bfill or ffill as the rule is dynamic, taking the value from the previous row and dividing by the number of consecutive NaN + 1. For example, rows 3 and 4 should be replaced with 12 as 24/2, rows 6, 7 and 8 should be replaced with 5. All other numbers should remain unchanged.
How should I do that?
Note: Edited the dataframe to be more general by inserting a new row between rows 4 and 5 and another row at the end.
You can do:
m = (df["Column 1"].notna()) & (
(df["Column 1"].shift(-1).isna()) | (df["Column 1"].shift().isna())
)
out = df.groupby(m.cumsum()).transform(
lambda x: x.fillna(0).mean() if x.isna().any() else x
)
print(out):
Index Column 1
0 1 10.0
1 2 12.0
2 3 12.0
3 4 12.0
4 5 20.0
5 6 5.0
6 7 5.0
7 8 5.0
8 9 2.0
Explanation and intermediate values:
Basically look for the rows where the next value is NaN or previous value is NaN but their value itself is not NaN. Those rows form the first row of such groups.
So the m in above code looks like:
0 True
1 False
2 True
3 False
4 True
5 True
6 False
7 False
8 True
now I want to form groups of rows that are ['True', <all Falses>] because those are the groups I want to take average of. For that use cumsum
If you want to take a look at those groups, you can use ngroup() after groupby on m.cumsum():
0 0
1 0
2 1
3 1
4 2
5 3
6 3
7 3
8 4
The above is only to show what are the groups.
Now for each group you can get the mean of the group if the group has any NaN value. This is accomplished by checking for NaNs using x.isna().any().
If the group has any NaN value then assign mean after filling NaN with 0 ,otherwise just keep the group as is. This is accomplished by the lambda:
lambda x: x.fillna(0).mean() if x.isna().any() else x
Why not using interpolate? There is a method=s that would probably fitsyour desire
However, if you really want to do as you described above, you can do something like this. (Note that iterating over rows in pandas is considered bad practice, but it does the job)
import pandas as pd
import numpy as np
df = pd.DataFrame([10,
12,
24,
np.NaN,
15,
np.NaN,
np.NaN])
for col in df:
for idx in df.index: # (iterating over rows is considered bad practice)
local_idx=idx
while(local_idx+1<len(df) and np.isnan(df.at[local_idx+1,col])):
local_idx+=1
if (local_idx-idx)>0:
fillvalue = df.loc[idx]/(local_idx-idx+1)
for fillidx in range(idx, local_idx+1):
df.loc[fillidx] = fillvalue
df
Output:
0
0 10.0
1 12.0
2 12.0
3 12.0
4 5.0
5 5.0
6 5.0

How to assign/change values to top N values in dataframe using nlargest?

So using .nlargest I can get top N values from my dataframe.
Now if I run the following code:
df.nlargest(25, 'Change')['TopN']='TOP 25'
I expect to change all affected values in TopN column to become TOP 25. But somehow this assignemnt does not work and those rows remain unaffected. What am I doing wrong?
Assuming you really want the TOPN (limited to N values as nlargest would do), use the index from df.nlargest(25, 'Change') and loc:
df.loc[df.nlargest(25, 'Change').index, 'TopN'] = 'TOP 25'
Note the difference with the other approach that will give you all matching values:
df.loc[df['Change'].isin(df['Change'].nlargest(25)), 'TopN'] = 'TOP 25'
Highlighting the difference:
df = pd.DataFrame({'Change': [1,2,3,4,5,1,2,3,4,5,1,2,3,4,5]})
df.loc[df.nlargest(4, 'Change').index, 'TOP4 (A)'] = 'X'
df.loc[df['Change'].isin(df['Change'].nlargest(4)), 'TOP4 (B)'] = 'X'
output:
Change TOP4 (A) TOP4 (B)
0 1 NaN NaN
1 2 NaN NaN
2 3 NaN NaN
3 4 X X
4 5 X X
5 1 NaN NaN
6 2 NaN NaN
7 3 NaN NaN
8 4 NaN X
9 5 X X
10 1 NaN NaN
11 2 NaN NaN
12 3 NaN NaN
13 4 NaN X
14 5 X X
one thing to be aware of is that nlargest does not return ties by default, as in, on the 25th position if you have 5 rows where Change = 25th ranked value, nlargest would only return 25 rows rather than 29 rows unless you specify the parameter keep to be all
Using this parameter, it would be possible to identify the top 25 as
df.loc[df.nlargest(25, 'Change', 'all').index, 'TopN'] = 'Top 25'
Solution for compare top25 values by all values of column is:
df.loc[df['Change'].isin(df['Change'].nlargest(25)), 'TopN'] = 'TOP 25'

Find the number of previous consecutive occurences of value different than current row value in pandas dataframe

Assume that we have the following pandas dataframe:
df = pd.DataFrame({'x':[0,0,1,0,0,0,0],'y':[1,1,1,1,1,1,0],'z':[0,1,1,1,0,0,1]})
x y z
0 0 1 0
1 0 1 1
2 1 1 1
3 0 1 1
4 0 1 0
5 0 1 0
6 0 0 1
All dataframe is filled either by 1 or 0. Looking at each column separately, if current row value is different than previous value I need to count number of previous consecutive values:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
I tried to write a lambda function and apply it to entire dataframe, but I failed. Any idea?
Let's try this:
def f(col):
x = (col != col.shift().bfill())
s = x.cumsum()
return s.groupby(s).transform('count').shift().where(x)
df.apply(f).fillna('')
Output:
x y z
0
1 1
2 2
3 1
4 3
5
6 6 2
Details:
Use apply, to apply a custom function on each column of the dataframe.
Find the difference spots in the column then use cumsum to create groups of consecutive values, then groupby and transform to create a count for each record, then mask the values in the column using where for the difference spots.
You can try the following, where you identify the "runs" first, get the "runs" lengths. You will only entry at where it switches, so it is the lengths of the runs except the last one.
import pandas as pd
import numpy as np
def func(x,missing=np.NaN):
runs = np.cumsum(np.append(0,np.diff(x)!=0))
switches = np.where(np.diff(x!=0))[0] + 1
out = np.repeat(missing,len(x))
out[switches] = np.bincount(runs)[:-1]
# thanks to Scott see comments below
##out[switches] = pd.value_counts(runs,sort=False)[:-1]
return(out)
df.apply(func)
x y z
0 NaN NaN NaN
1 NaN NaN 1.0
2 2.0 NaN NaN
3 1.0 NaN NaN
4 NaN NaN 3.0
5 NaN NaN NaN
6 NaN 6.0 2.0
It might be faster with a good implementation of run length encoding.. but I am not too familiar with it in python..

Pandas: Sum multiple columns, but write NaN if any column in that row is NaN or 0

I am trying to create a new column in a pandas dataframe that sums the total of other columns. However, if any of the source columns are blank (NaN or 0), I need the new column to also be written as blank (NaN)
a b c d sum
3 5 7 4 19
2 6 0 2 NaN (note the 0 in column c)
4 NaN 3 7 NaN
I am currently using the pd.sum function, formatted like this
df['sum'] = df[['a','b','c','d']].sum(axis=1, numeric_only=True)
which ignores the NaNs, but does not write NaN to the sum column.
Thanks in advance for any advice
replace your 0 to np.nan then pass skipna = False
df.replace(0,np.nan).sum(1,skipna=False)
0 19.0
1 NaN
2 NaN
dtype: float64
df['sum'] = df.replace(0,np.nan).sum(1,skipna=False)

Python: Create New Column Equal to the Sum of all Columns Starting from Column Number 9

I want to create a new column called 'test' in my dataframe that is equal to the sum of all the columns starting from column number 9 to the end of the dataframe. These columns are all datatype float.
Below is the code I tried but it didn't work --> gives me back all NaN values in 'test' column:
df_UBSrepscomp['test'] = df_UBSrepscomp.iloc[:, 9:].sum()
If I'm understanding your question, you want the row-wise sum starting at column 9. I believe you want .sum(axis=1). See an example below using column 2 instead of 9 for readability.
df = DataFrame(npr.rand(10, 5))
df.iloc[0:3, 0:4] = np.nan # throw in some na values
df.loc[:, 'test'] = df.iloc[:, 2:].sum(axis=1); print(df)
0 1 2 3 4 test
0 NaN NaN NaN NaN 0.73046 0.73046
1 NaN NaN NaN NaN 0.79060 0.79060
2 NaN NaN NaN NaN 0.53859 0.53859
3 0.97469 0.60224 0.90022 0.45015 0.52246 1.87283
4 0.84111 0.52958 0.71513 0.17180 0.34494 1.23187
5 0.21991 0.10479 0.60755 0.79287 0.11051 1.51094
6 0.64966 0.53332 0.76289 0.38522 0.92313 2.07124
7 0.40139 0.41158 0.30072 0.09303 0.37026 0.76401
8 0.59258 0.06255 0.43663 0.52148 0.62933 1.58744
9 0.12762 0.01651 0.09622 0.30517 0.78018 1.18156

Categories

Resources