repeat .shift() if it extracts a nan value - python

I would like to understand what shift() does when it returns an nan value.
Example df:
index a b
1 2 3
2 3 3
3 nan nan
4 8 7
If I type:
for i in range(2,5)
df["a"] = np.where((df.index==i), df.b * df.a.shift(1), df.a)
I guess (but am not sure) it will return:
index a b
1 2 3
2 6 3
3 nan nan
4 8 7
Is there a simple way that it will return:
index a b
1 2 3
2 6 3
3 nan nan
4 42 7
with the 42 in column "a" row 4 calculated as 6*7 (column "a" row 2 multiplied by column "b" row 4)
What I want is that if the value extracted with shift(1) is nan, the function takes the value one row above as if I typed shift(2). If there is a nan value in this cell again, it should take the value in the cell above, as if I typed shift(3) and so on.

Related

Ignore nan elements in a list using loc pandas

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?
Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

Pandas filter rows based on certain number of certain columns being NaN

I have a data set like this:
seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
A NaN 4 5 NaN 4 7 NaN 6 8
B 7 2 9 2 1 9 2 1 1
C NaN 4 4 2 4 NaN 2 6 8
D 5 NaN NaN 2 5 9 NaN 1 1
I want to remove the rows where at least three of the columns marked 'T01' are NaN
So the output would be:
seq S01-T01 S01-T02 S01-T03 S02-T01 S02-T02 S02-T03 S03-T01 S03-T02 S03-T03
B 7 2 9 2 1 9 2 1 1
C NaN 4 4 2 4 NaN 2 6 8
D 5 NaN NaN 2 5 9 NaN 1 1
Because the A row there is NaN is S01-T01, S02-T02, S03-T01. Row D also has three NaNs, but it is kept in because I am only interested in removing the rows if specifically there is >=3 NaN in the column names that have a T01 in them.
I know this could be simple to do, I wrote:
import sys
import pandas as pd
df = pd.read_csv('data.csv',sep=',')
print(df.columns.str.contains['T01'])
To first get all of the cells with T01 in them, and then I was going to count them.
I got the error:
print(df.columns.str.contains['T01'])
TypeError: 'method' object is not subscriptable
Then I thought about iterating through the rows and counting instead e.g.:
for index,row in df.iterrows():
if 'T01' in row:
print(row)
This runs without error but prints nothing to screen. Could someone demonstrate a better way to do this?
If you select only the 'T01' columns, you can take the rowwise sum of nulls and keep only rows that are less than 3.
df.loc[df[[x for x in df if 'T01' in x]].isnull().sum(1).lt(3)]

How to access common values from two or more columns?

I need to find the number of common values in a column wrt another column.
For example:
There are two columns X , Y.
X:
a
b
c
a
d
a
b
b
a
a
Y:
NaN
2
4
Nan
NaN
6
4
NaN
5
4
So how do I group values like NaN wrt a,b,c,d.
For example,
a has 2 NaN values.
b has 1 NaN values.
Per my comment, I have transposed your dataframe with df.set_index(0).T to get the following starting point.
In[1]:
0 X Y
1 a NaN
2 b 2
3 c 4
4 a NaN
5 d NaN
6 a 6
7 b 4
8 b NaN
9 a 5
10 a 4
From there, you can filter for null values with .isnull(). Then, you can use .groupby('X').size() to return the count of null values per group:
df[df['Y'].isnull()].groupby('X').size()
X
a 2
b 1
d 1
dtype: int64
Or, you could use value_counts() to achieve the same thing:
df[df['Y'].isnull()]['X'].value_counts()

Division of multiple dimension data in pandas using groupby

Since pandas can't work in multi-dimensions, I usually stack the data row-wise and use a dummy column to mark the data dimensions. Now, I need to divide one dimension by another.
For example, given this dataframe where key define the dimensions
index key value
0 a 10
1 b 12
2 a 20
3 b 15
4 a 8
5 b 9
I want to achieve this:
index key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.33333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
Is there a way to do it using groupby?
You don't really need (and should not use) groupby for this:
# interpolate the b values
s = df['value'].where(df['key'].eq('b')).bfill()
# mask the a values and divide
# change to df['key'].ne('b') if you have many values of a
df['ratio'] = df['value'].where(df['key'].eq('a')).div(s)
Output:
index key value ratio
0 0 a 10 0.833333
1 1 b 12 NaN
2 2 a 20 1.333333
3 3 b 15 NaN
4 4 a 8 0.888889
5 5 b 9 NaN
Using eq, cumsum and GroupBy.apply with shift.
We use .eq to get a boolean where the value is a then we use cumsum to make an unique identifier for each a, b pair.
Then we use groupby and divide each value by the value one row below with shift
s = df['key'].eq('a').cumsum()
df['ratio_a_b'] = df.groupby(s)['value'].apply(lambda x: x.div(x.shift(-1)))
Output
key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.333333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
This is what s returns, our unique identifier for each a,b pair:
print(s)
0 1
1 1
2 2
3 2
4 3
5 3
Name: key, dtype: int32

Backfilling columns by groups in Pandas

I have a csv like
A,B,C,D
1,2,,
1,2,30,100
1,2,40,100
4,5,,
4,5,60,200
4,5,70,200
8,9,,
In row 1 and row 4 C value is missing (NaN). I want to take their value from row 2 and 5 respectively. (First occurrence of same A,B value).
If no matching row is found, just put 0 (like in last line)
Expected op:
A,B,C,D
1,2,30,
1,2,30,100
1,2,40,100
4,5,60,
4,5,60,200
4,5,70,200
8,9,0,
using fillna I found bfill: use NEXT valid observation to fill gap but the NEXT observation has to be taken logically (looking at col A,B values) and not just the upcoming C column value
You'll have to call df.groupby on A and B first and then apply the bfill function:
In [501]: df.C = df.groupby(['A', 'B']).apply(lambda x: x.C.bfill()).reset_index(drop=True)
In [502]: df
Out[502]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
You can also group and then call dfGroupBy.bfill directly (I think this would be faster):
In [508]: df.C = df.groupby(['A', 'B']).C.bfill().fillna(0).astype(int); df
Out[508]:
A B C D
0 1 2 30 NaN
1 1 2 30 100.0
2 1 2 40 100.0
3 4 5 60 NaN
4 4 5 60 200.0
5 4 5 70 200.0
6 8 9 0 NaN
If you wish to get rid of NaNs in D, you could do:
df.D.fillna('', inplace=True)

Categories

Resources