Pandas nested sort and NaN - python

I'm trying to understand the expected behavior of DataFrame.sort on columns with NaN values.
Given this DataFrame:
In [36]: df
Out[36]:
a b
0 1 9
1 2 NaN
2 NaN 5
3 1 2
4 6 5
5 8 4
6 4 5
Sorting using one column puts the NaN at the end, as expected:
In [37]: df.sort(columns="a")
Out[37]:
a b
0 1 9
3 1 2
1 2 NaN
6 4 5
4 6 5
5 8 4
2 NaN 5
But nested sort doesn't behave as I would expect, leaving the NaN unsorted:
In [38]: df.sort(columns=["a","b"])
Out[38]:
a b
3 1 2
0 1 9
1 2 NaN
2 NaN 5
6 4 5
4 6 5
5 8 4
Is there a way to make sure the NaNs in nested sort will appear at the end, per column?

Until fixed in Pandas, this is what I'm using for sorting for my needs, with a subset of the functionality of the original DataFrame.sort function. This will work for numerical values only:
def dataframe_sort(df, columns, ascending=True):
a = np.array(df[columns])
# ascending/descending array - -1 if descending, 1 if ascending
if isinstance(ascending, bool):
ascending = len(columns) * [ascending]
ascending = map(lambda x: x and 1 or -1, ascending)
ind = np.lexsort([ascending[i] * a[:, i] for i in reversed(range(len(columns)))])
return df.iloc[[ind]]
Usage example:
In [4]: df
Out[4]:
a b c
10 1 9 7
11 NaN NaN 1
12 2 NaN 6
13 NaN 5 6
14 1 2 6
15 6 5 NaN
16 8 4 4
17 4 5 3
In [5]: dataframe_sort(df, ['a', 'c'], False)
Out[5]:
a b c
16 8 4 4
15 6 5 NaN
17 4 5 3
12 2 NaN 6
10 1 9 7
14 1 2 6
13 NaN 5 6
11 NaN NaN 1
In [6]: dataframe_sort(df, ['b', 'a'], [False, True])
Out[6]:
a b c
10 1 9 7
17 4 5 3
15 6 5 NaN
13 NaN 5 6
16 8 4 4
14 1 2 6
12 2 NaN 6
11 NaN NaN 1

Related

Pandas Dataframe aggregating function to count also nan values

I have the following dataframe
print(A)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
And I have the following Code (Pandas Dataframe count occurrences that only happen immediately), which counts the occurrences of values that happen immediately one after another.
ser = A["1or0"].ne(A["1or0"].shift().bfill()).cumsum()
B = (
A.groupby(ser, as_index=False)
.agg({"Index": ["first", "last", "count"],
"1or0": "unique"})
.set_axis(["StartNum", "EndNum", "Size", "Value"], axis=1)
.assign(Value= lambda d: d["Value"].astype(str).str.strip("[]"))
)
print(B)
​
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
The issues is, when NaN Values occur, the code doesn't put them together in one interval it count them always as one sized interval and not e.g. 3
print(A2)
Index 1or0
0 1 0
1 2 0
2 3 0
3 4 1
4 5 1
5 6 1
6 7 1
7 8 0
8 9 1
9 10 1
10 11 NaN
11 12 NaN
12 13 NaN
print(B2)
​
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 11 1 NaN
5 12 12 1 NaN
6 13 13 1 NaN
But I want B2 to be the following
print(B2Wanted)
​
StartNum EndNum Size Value
0 1 3 3 0
1 4 7 4 1
2 8 8 1 0
3 9 10 2 1
4 11 13 3 NaN
What do I need to change so that it works also with NaN?
First fillna with a value this is not possible (here -1) before creating your grouper:
group = A['1or0'].fillna(-1).diff().ne(0).cumsum()
# or
# s = A['1or0'].fillna(-1)
# group = s.ne(s.shift()).cumsum()
B = (A.groupby(group, as_index=False)
.agg(**{'StartNum': ('Index', 'first'),
'EndNum': ('Index', 'last'),
'Size': ('1or0', 'size'),
'Value': ('1or0', 'first')
})
)
Output:
StartNum EndNum Size Value
0 1 3 3 0.0
1 4 7 4 1.0
2 8 8 1 0.0
3 9 10 2 1.0
4 11 13 3 NaN

Count preceding non NaN values in pandas

I have a DataFrame that looks like the following:
a b c
0 NaN 8 NaN
1 NaN 7 NaN
2 NaN 5 NaN
3 7.0 3 NaN
4 3.0 5 NaN
5 5.0 4 NaN
6 7.0 1 NaN
7 8.0 9 3.0
8 NaN 5 5.0
9 NaN 6 4.0
What I want to create is a new DataFrame where each value contains the sum of all non-NaN values before it in the same column. The resulting new DataFrame would look like this:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
I have achieved it with the following code:
for i in range(len(df)):
df.iloc[i] = df.iloc[0:i].isna().sum()
However, I can only do so with an individual column. My real DataFrame contains thousands of columns so iterating between them is impossible due to the low processing speed. What can I do? Maybe it should be something related to using the pandas .apply() function.
There's no need for apply. It can be done much more efficiently using notna + cumsum (notna for the non-NaN values and cumsum for the counts):
out = df.notna().cumsum()
Output:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3
Check with notna with cumsum
out = df.notna().cumsum()
Out[220]:
a b c
0 0 1 0
1 0 2 0
2 0 3 0
3 1 4 0
4 2 5 0
5 3 6 0
6 4 7 0
7 5 8 1
8 5 9 2
9 5 10 3

how to dynamically add values of some of the columns in a dataframe?

dataframe in the image
year= 2020 (MAX COLUMN)
lastFifthYear = year - 4
input = '2001509-00'
I want to add all the values between year(2020) and lastFifthYear(2016) WHERE INPUT PARTNO = input
so for input value I should get 4+6+2+3+2 (2016+2017+2018+2019+2020) i.e 17
please give me some code
Here is some code that should work but you definitely need to improve on the way you ask questions here :-)
Considering df is the table you pasted as image above.
>>> year = 2016
>>> df_new=df.query('INPUT_PARTNO == "2001509-00"').melt(['ACTUAL_GI_YEAR', 'INPUT_PARTNO'], var_name='year', value_name='number')
>>> df_new.year=df_new.year.astype(int)
>>> df_new[df_new.year >= year].groupby(['ACTUAL_GI_YEAR','INPUT_PARTNO']).agg({'number' : sum})
number
ACTUAL_GI_YEAR INPUT_PARTNO
0 2001509-00 17
Example Setup
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0, 10, (10, 10)),
columns=list('ab')+list(range(2,10)))
Solved
#sum where a == 9 columns between 3,6 by rows
df['number'] = df.loc[df['a'].eq(9),
pd.to_numeric(df.columns, errors='coerce')
.to_series()
.between(3, 6)
.values].sum(axis=1)
print(df)
a b 2 3 4 5 6 7 8 9 number
0 1 9 9 2 6 0 6 1 4 2 NaN
1 2 3 4 8 7 2 4 0 0 6 NaN
2 2 2 7 4 9 6 7 1 0 0 NaN
3 0 3 5 3 0 4 2 7 2 6 NaN
4 7 7 1 4 7 7 9 7 4 2 NaN
5 9 9 9 0 3 3 3 8 7 7 9.0
6 9 0 5 5 7 9 6 6 5 7 27.0
7 2 1 9 1 9 3 3 4 4 9 NaN
8 4 0 5 9 6 7 3 9 1 6 NaN
9 5 5 0 8 6 4 5 4 7 4 NaN

How to loop through each row in pandas dataframe and set values equal to nan after a threshold is surpassed?

If I have a pandas dataframe like this:
0 1 2 3 4 5
A 5 5 10 9 4 5
B 10 10 10 8 1 1
C 8 8 0 9 6 3
D 10 10 11 4 2 9
E 0 9 1 5 8 3
If I set a threshold of 7, how do I loop through each row and set the values after the threshold is no longer met equal to np.nan such that I get a data frame like this:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2 9
E 0 9 1 5 8 NaN
Where everything after the last number greater than 7 is set equal to np.nan.
Let's try this:
df.where(df.where(df > 7).bfill(axis=1).notna())
Output:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN
create a mask m by using df.where on df.gt(7) and bfill and isna. Finally, indexing df using m
m = df.where(df.gt(7)).bfill(1).notna()
df[m]
Out[24]:
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN
A very nice question , reverse the order then cumsum the one equal to 0 should be NaN
df.where(df.iloc[:,::-1].gt(7).cumsum(1).ne(0))
0 1 2 3 4 5
A 5 5 10 9 NaN NaN
B 10 10 10 8 NaN NaN
C 8 8 0 9 NaN NaN
D 10 10 11 4 2.0 9.0
E 0 9 1 5 8.0 NaN

Add new column with one value

I have the following dataframe:
a = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9], [10, 11, 12]], columns=['a','b','c'])
a
Out[234]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
3 10 11 12
I want to add a column with only the last row as the mean of the last 2 values of column c. Something like:
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 mean(9,12)
I tried this but the first part gives an error:
a['d'].iloc[-1] = a.c.iloc[-2:].values.mean()
You can use .at to assign at a single row/column label pair:
ix = a.shape[0]
a.at[ix-1,'d'] = a.loc[ix-2:ix, 'c'].values.mean()
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
Also note that chained indexing (what you're doing with a.c.iloc[-2:]) is explicitly discouraged in the docs, given that pandas sees these operations as separate events, namely two separate calls to __getitem__, rather than a single call using a nested tuple of slices.
You may set d column beforehand (to ensure assignment):
In [100]: a['d'] = np.nan
In [101]: a['d'].iloc[-1] = a.c.iloc[-2:].mean()
In [102]: a
Out[102]:
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
We can use .loc, .iloc & np.mean
a.loc[a.index.max(), 'd'] = np.mean(a.iloc[-2:, 2])
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5
Or just using .loc and np.mean:
a.loc[a.index.max(), 'd'] = np.mean(a.loc[a.index.max()-1:, 'c'])
a b c d
0 1 2 3 NaN
1 4 5 6 NaN
2 7 8 9 NaN
3 10 11 12 10.5

Categories

Resources