repeat .shift() if it extracts a nan value

repeat .shift() if it extracts a nan value - python

I would like to understand what shift() does when it returns an nan value.
Example df:
index a b
1 2 3
2 3 3
3 nan nan
4 8 7
If I type:
for i in range(2,5)
df["a"] = np.where((df.index==i), df.b * df.a.shift(1), df.a)
I guess (but am not sure) it will return:
index a b
1 2 3
2 6 3
3 nan nan
4 8 7
Is there a simple way that it will return:
index a b
1 2 3
2 6 3
3 nan nan
4 42 7
with the 42 in column "a" row 4 calculated as 6*7 (column "a" row 2 multiplied by column "b" row 4)
What I want is that if the value extracted with shift(1) is nan, the function takes the value one row above as if I typed shift(2). If there is a nan value in this cell again, it should take the value in the cell above, as if I typed shift(3) and so on.

Related

Ignore nan elements in a list using loc pandas

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?

Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

Pandas filter rows based on certain number of certain columns being NaN

I have a data set like this:
seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
A NaN 4 5 NaN 4 7 NaN 6 8
B 7 2 9 2 1 9 2 1 1
C NaN 4 4 2 4 NaN 2 6 8
D 5 NaN NaN 2 5 9 NaN 1 1
I want to remove the rows where at least three of the columns marked 'T01' are NaN
So the output would be:
seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B 7 2 9 2 1 9 2 1 1
C NaN 4 4 2 4 NaN 2 6 8
D 5 NaN NaN 2 5 9 NaN 1 1
Because the A row there is NaN is S01-T01, S02-T02, S03-T01. Row D also has three NaNs, but it is kept in because I am only interested in removing the rows if specifically there is >=3 NaN in the column names that have a T01 in them.
I know this could be simple to do, I wrote:
import sys
import pandas as pd
df = pd.read_csv('data.csv',sep=',')
print(df.columns.str.contains['T01'])
To first get all of the cells with T01 in them, and then I was going to count them.
I got the error:
print(df.columns.str.contains['T01'])
TypeError: 'method' object is not subscriptable
Then I thought about iterating through the rows and counting instead e.g.:
for index,row in df.iterrows():
if 'T01' in row:
print(row)
This runs without error but prints nothing to screen. Could someone demonstrate a better way to do this?

If you select only the 'T01' columns, you can take the rowwise sum of nulls and keep only rows that are less than 3.
df.loc[df[[x for x in df if 'T01' in x]].isnull().sum(1).lt(3)]

How to access common values from two or more columns?

I need to find the number of common values in a column wrt another column.
For example:
There are two columns X , Y.
X:
a
b
c
a
d
a
b
b
a
a
Y:
NaN
2
4
Nan
NaN
6
4
NaN
5
4
So how do I group values like NaN wrt a,b,c,d.
For example,
a has 2 NaN values.
b has 1 NaN values.

Per my comment, I have transposed your dataframe with df.set_index(0).T to get the following starting point.
In[1]:
0 X Y
1 a NaN
2 b 2
3 c 4
4 a NaN
5 d NaN
6 a 6
7 b 4
8 b NaN
9 a 5
10 a 4
From there, you can filter for null values with .isnull(). Then, you can use .groupby('X').size() to return the count of null values per group:
df[df['Y'].isnull()].groupby('X').size()
X
a 2
b 1
d 1
dtype: int64
Or, you could use value_counts() to achieve the same thing:
df[df['Y'].isnull()]['X'].value_counts()

Division of multiple dimension data in pandas using groupby

Since pandas can't work in multi-dimensions, I usually stack the data row-wise and use a dummy column to mark the data dimensions. Now, I need to divide one dimension by another.
For example, given this dataframe where key define the dimensions
index key value
0 a 10
1 b 12
2 a 20
3 b 15
4 a 8
5 b 9
I want to achieve this:
index key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.33333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
Is there a way to do it using groupby?

You don't really need (and should not use) groupby for this:
# interpolate the b values
s = df['value'].where(df['key'].eq('b')).bfill()
# mask the a values and divide
# change to df['key'].ne('b') if you have many values of a
df['ratio'] = df['value'].where(df['key'].eq('a')).div(s)
Output:
index key value ratio
0 0 a 10 0.833333
1 1 b 12 NaN
2 2 a 20 1.333333
3 3 b 15 NaN
4 4 a 8 0.888889
5 5 b 9 NaN

Using eq, cumsum and GroupBy.apply with shift.
We use .eq to get a boolean where the value is a then we use cumsum to make an unique identifier for each a, b pair.
Then we use groupby and divide each value by the value one row below with shift
s = df['key'].eq('a').cumsum()
df['ratio_a_b'] = df.groupby(s)['value'].apply(lambda x: x.div(x.shift(-1)))
Output
key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.333333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
This is what s returns, our unique identifier for each a,b pair:
print(s)
0 1
1 1
2 2
3 2
4 3
5 3
Name: key, dtype: int32

Backfilling columns by groups in Pandas

I have a csv like
A,B,C,D
1,2,,
1,2,30,100
1,2,40,100
4,5,,
4,5,60,200
4,5,70,200
8,9,,
In row 1 and row 4 C value is missing (NaN). I want to take their value from row 2 and 5 respectively. (First occurrence of same A,B value).
If no matching row is found, just put 0 (like in last line)
Expected op:
A,B,C,D
1,2,30,
1,2,30,100
1,2,40,100
4,5,60,
4,5,60,200
4,5,70,200
8,9,0,
using fillna I found bfill: use NEXT valid observation to fill gap but the NEXT observation has to be taken logically (looking at col A,B values) and not just the upcoming C column value

You'll have to call df.groupby on A and B first and then apply the bfill function:
In [501]: df.C = df.groupby(['A', 'B']).apply(lambda x: x.C.bfill()).reset_index(drop=True)
In [502]: df
Out[502]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
You can also group and then call dfGroupBy.bfill directly (I think this would be faster):
In [508]: df.C = df.groupby(['A', 'B']).C.bfill().fillna(0).astype(int); df
Out[508]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
If you wish to get rid of NaNs in D, you could do:
df.D.fillna('', inplace=True)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

repeat .shift() if it extracts a nan value - python

Related

Ignore nan elements in a list using loc pandas

Pandas filter rows based on certain number of certain columns being NaN

How to access common values from two or more columns?

Division of multiple dimension data in pandas using groupby

Backfilling columns by groups in Pandas

Categories

Resources