How to access common values from two or more columns?

How to access common values from two or more columns? - python

I need to find the number of common values in a column wrt another column.
For example:
There are two columns X , Y.
X:
a
b
c
a
d
a
b
b
a
a
Y:
NaN
2
4
Nan
NaN
6
4
NaN
5
4
So how do I group values like NaN wrt a,b,c,d.
For example,
a has 2 NaN values.
b has 1 NaN values.

Per my comment, I have transposed your dataframe with df.set_index(0).T to get the following starting point.
In[1]:
0 X Y
1 a NaN
2 b 2
3 c 4
4 a NaN
5 d NaN
6 a 6
7 b 4
8 b NaN
9 a 5
10 a 4
From there, you can filter for null values with .isnull(). Then, you can use .groupby('X').size() to return the count of null values per group:
df[df['Y'].isnull()].groupby('X').size()
X
a 2
b 1
d 1
dtype: int64
Or, you could use value_counts() to achieve the same thing:
df[df['Y'].isnull()]['X'].value_counts()

Related

Ignore nan elements in a list using loc pandas

I have 2 different dataframes: df1, df2
df1:
index a
0 10
1 2
2 3
3 1
4 7
5 6
df2:
index a
0 1
1 2
2 4
3 3
4 20
5 5
I want to find the index of maximum values with a specific lookback in df1 (let's consider lookback=3 in this example). To do this, I use the following code:
tdf['a'] = df1.rolling(lookback).apply(lambda x: x.idxmax())
And the result would be:
id a
0 nan
1 nan
2 0
3 2
4 4
5 4
Now I need to save the values in df2 for each index found by idxmax() in tdf['b']
So if tdf['a'].iloc[3] == 2, I want tdf['b'].iloc[3] == df2.iloc[2]. I expect the final result to be like this:
id b
0 nan
1 nan
2 1
3 4
4 20
5 20
I'm guessing that I can do this using .loc() function like this:
tdf['b'] = df2.loc[tdf['a']]
But it throws an exception because there are nan values in tdf['a']. If I use dropna() before passing tdf['a'] to the .loc() function, then the indices get messed up (for example in tdf['b'], index 0 has to be nan but it'll have a value after dropna()).
Is there any way to get what I want?

Simply use a map:
lookback = 3
s = df1['a'].rolling(lookback).apply(lambda x: x.idxmax())
s.map(df2['a'])
Output:
0 NaN
1 NaN
2 1.0
3 4.0
4 20.0
5 20.0
Name: a, dtype: float64

repeat .shift() if it extracts a nan value

I would like to understand what shift() does when it returns an nan value.
Example df:
index a b
1 2 3
2 3 3
3 nan nan
4 8 7
If I type:
for i in range(2,5)
df["a"] = np.where((df.index==i), df.b * df.a.shift(1), df.a)
I guess (but am not sure) it will return:
index a b
1 2 3
2 6 3
3 nan nan
4 8 7
Is there a simple way that it will return:
index a b
1 2 3
2 6 3
3 nan nan
4 42 7
with the 42 in column "a" row 4 calculated as 6*7 (column "a" row 2 multiplied by column "b" row 4)
What I want is that if the value extracted with shift(1) is nan, the function takes the value one row above as if I typed shift(2). If there is a nan value in this cell again, it should take the value in the cell above, as if I typed shift(3) and so on.

How to remove observations with missing values for specific columns from pandas DataFrame?

I have pandas DataFrame containing columns with missing values. I want remove observations, rows with them but only for specific columns. For example:
A B C D E
2 1 NaN 7 9
1 3 6 NaN 10
NaN 3 11 0 8
And let's say I want to remove observations with missing value for column D. So I want result like this:
A B C D E
2 1 NaN 7 9
NaN 3 11 0 8
Thank you for all suggestions.

Lets try mask pd.Series.notna()
df[df.D.notna()]
A B C D E
0 2.0 1 NaN 7.0 9
2 NaN 3 11.0 0.0 8

Division of multiple dimension data in pandas using groupby

Since pandas can't work in multi-dimensions, I usually stack the data row-wise and use a dummy column to mark the data dimensions. Now, I need to divide one dimension by another.
For example, given this dataframe where key define the dimensions
index key value
0 a 10
1 b 12
2 a 20
3 b 15
4 a 8
5 b 9
I want to achieve this:
index key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.33333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
Is there a way to do it using groupby?

You don't really need (and should not use) groupby for this:
# interpolate the b values
s = df['value'].where(df['key'].eq('b')).bfill()
# mask the a values and divide
# change to df['key'].ne('b') if you have many values of a
df['ratio'] = df['value'].where(df['key'].eq('a')).div(s)
Output:
index key value ratio
0 0 a 10 0.833333
1 1 b 12 NaN
2 2 a 20 1.333333
3 3 b 15 NaN
4 4 a 8 0.888889
5 5 b 9 NaN

Using eq, cumsum and GroupBy.apply with shift.
We use .eq to get a boolean where the value is a then we use cumsum to make an unique identifier for each a, b pair.
Then we use groupby and divide each value by the value one row below with shift
s = df['key'].eq('a').cumsum()
df['ratio_a_b'] = df.groupby(s)['value'].apply(lambda x: x.div(x.shift(-1)))
Output
key value ratio_a_b
0 a 10 0.833333
1 b 12 NaN
2 a 20 1.333333
3 b 15 NaN
4 a 8 0.888889
5 b 9 NaN
This is what s returns, our unique identifier for each a,b pair:
print(s)
0 1
1 1
2 2
3 2
4 3
5 3
Name: key, dtype: int32

Match values based on group value with columns values and merge it in two columns

df
index group1 group2 a b c d
-
0 a b 1 2 NaN NaN
1 b c NaN 5 1 NaN
2 c d NaN NaN 6 9
4 b a 1 7 NaN NaN
5 d a 6 NaN NaN 5
df expect
index group1 group2 one two
-
0 a b 1 2
1 b c 5 1
2 c d 6 9
4 b a 7 1
5 d a 5 6
I want to match values based on columns ['group1','group2'] and append to columns [‘one','two'] by order. For example, row index 5: group1 is 'd', so it will take value of 5 from 'd' first, and then it will do group2.
I am trying to use lookup function: df.one = df.lookup(df.index, df.group1), it works on small data, but not with big data with lots of columns, and values get mixed up.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to access common values from two or more columns? - python

I need to find the number of common values in a column wrt another column. For example: There are two columns X , Y. X: a b c a d a b b a a Y: NaN 2 4 Nan NaN 6 4 NaN 5 4 So how do I group values like NaN wrt a,b,c,d. For example, a has 2 NaN values. b has 1 NaN values.

Related

Ignore nan elements in a list using loc pandas

repeat .shift() if it extracts a nan value

How to remove observations with missing values for specific columns from pandas DataFrame?

Division of multiple dimension data in pandas using groupby

Match values based on group value with columns values and merge it in two columns

Categories

Resources