This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 2 years ago.
I am trying to convert a dataframe with repeating rows into columns as follows
INPUT
Key | Value
A | 1
B | 2
C | 3
A | 4
B | 5
C | 6
EXPECTED OUTPUT
A | B | C
1 | 2 | 3
4 | 5 | 6
There are a lot of options like pivot(), unstack(), groupby(), etc. But, I was unsure of using it with just 2 columns as shown in the input.
Its not a straight-forward pivot. Do this using df.pivot with df.apply and Series.dropna:
In [747]: x = df.pivot(index=None, columns='Key', values='Value').apply(lambda x: pd.Series(x.dropna().to_numpy()))
In [748]: x
Out[748]:
Key A B C
0 1.0 2.0 3.0
1 4.0 5.0 6.0
Explanation:
Let's break it down:
First you pivot your df like this:
In [751]: y = df.pivot(index=None, columns='Key', values='Value')
In [752]: y
Out[752]:
Key A B C
0 1.0 NaN NaN
1 NaN 2.0 NaN
2 NaN NaN 3.0
3 4.0 NaN NaN
4 NaN 5.0 NaN
5 NaN NaN 6.0
Now we are close to your expected output, but we need to remove Nan and collapse the 6 rows into 2 rows.
For that, we convert each column to a pd.Series and dropna():
In [753]: y.apply(lambda x: pd.Series(x.dropna().to_numpy()))
Out[753]:
Key A B C
0 1.0 2.0 3.0
1 4.0 5.0 6.0
This is your final output.
Related
I have a DataFrame with around 1000 columns, some columns have 0 NaNs, some have 3, some have 400.
What I want to do is remove all columns where there exists a number of consecutive NaNs that are larger than some threshold N, the rest I will impute by taking mean of nearest neighbors.
df
ColA | ColB | ColC | ColD | ColE
NaN 5 3 NaN NaN
NaN 6 NaN 4 4
NaN 7 4 4 4
NaN 5 5 NaN NaN
NaN 5 4 NaN 4
NaN 3 3 NaN 3
threshold = 2
remove_consecutive_nan(df,threshold)
Which would return
ColB | ColC | ColE
5 3 NaN
6 NaN 4
7 4 4
5 5 NaN
5 4 4
3 3 3
How would I write the remove_consecutive_nan function?
You can create groups for each column of same values for consecutive missing values, then count each column separately and last filter out columns by threshold:
def remove_consecutive_nan(df,threshold):
m = df.notna()
mask = m.cumsum().mask(m).apply(pd.Series.value_counts).gt(threshold)
return df.loc[:, ~mask.any(axis=0)]
print (remove_consecutive_nan(df, 2))
ColB ColC ColE
0 5 3.0 NaN
1 6 NaN 4.0
2 7 4.0 4.0
3 5 5.0 NaN
4 5 4.0 4.0
5 3 3.0 3.0
Alternative with counts missing values by range:
def remove_consecutive_nan(df,threshold):
m = df.isna()
b = m.cumsum()
mask = b.sub(b.mask(m).ffill().fillna(0)).gt(threshold)
return df.loc[:, ~mask.any(axis=0)]
I have a data in the form:
'cat' 'value'
a 1
a,b 2
a,b,c 3
b,c 2
b 1
which I would like to convert using a pivot table:
'a' 'b' 'c'
1
2 2
3 3 3
2 2
1
How do I perform this. If I use the pivot command:
df.pivot(columns= 'cat', values = 'value')
which yields this result
'a' 'a,b' 'a,b,c' 'b,c' 'b'
1
2
3
2
1
You can use .explode() after transforming the string into a list, and then pivot it normally:
df['cat'] = df['cat'].str.split(',')
df = df.explode('cat').pivot_table(index=df.explode('cat').index,columns='cat',values='value')
This outputs:
cat a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN
You can then reset, or rename the index if you wish for it to not be named cat.
Try with str.get_dummies and multiply the value column (then replace 0 with nan if necessary)
df['cat'].str.get_dummies(",").mul(df['value'],axis=0).replace(0,np.nan)
a b c
0 1.0 NaN NaN
1 2.0 2.0 NaN
2 3.0 3.0 3.0
3 NaN 2.0 2.0
4 NaN 1.0 NaN
I have the following two pandas dataframes:
df1
A B C
0 1 2 1
1 7 3 6
2 3 10 11
df2
A B C
0 2 0 2
1 8 4 7
Where A,B and C are column headings of both dataframes.
I am trying to compare columns of df1 to columns of df2 such that the first row in df2 is the lower bound and the second row is the upper bound. Any values in df1 outside the lower and upper bound (column wise) needs to be replaced with NaN.
So in this example the output should be:
A B C
0 nan 2 nan
1 7 3 6
2 3 nan nan
As a basic I am trying df1[df1 < df2] = np.nan, but this does not work. I have also tried .where() but not getting any success.
Would appreciate some help here, thanks.
IIUC
df=df1.where(df1.ge(df2.iloc[0])&df1.lt(df2.iloc[1]))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
You could do something like:
lower = df1 < df2.iloc[0, :]
upper = df1 > df2.iloc[1, :]
df1[lower | upper] = np.nan
print(df1)
Output
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
Here is one with df.clip and mask:
df1.mask(df1.ne(df1.clip(lower = df2.loc[0],upper = df1.loc[1],axis=1)))
A B C
0 NaN 2.0 NaN
1 7.0 3.0 6.0
2 3.0 NaN NaN
A slightly different approach using between,
df1.apply(lambda x:x.where(x.between(*df2.values, False)), axis=1)
I have a column in a pandas dataframe where some of the rows have NaN values.
I would like to select the rows that satisfy these conditions :
- they are NaN values;
- they are directly followed OR are ahead of non-null values
For example, I would like to select the rows for which there is this nan value :
input:
index | Col
...
1 | 1344
2 | NaN
3 | 532
...
desired ouptut :
2 | NaN
But I don't want to select these nan values (as they are followed by a NaN value or are right after another NaN value) :
index | Col
...
1 | 1344
2 | NaN
3 | NaN
4 | 532
...
Any help would be much appreciated
Thank you!
Below I show you how to do it with an example.On the one hand, Series.notna + Series.cumsum + Series.shift is used to group consecutive NaN values through groupby. Using transform you get a Boolean Series with False in those groups that have more than one NaN. the AND operation of this Boolean series with the resulting series of df2['col2']. isna() is the series we are looking for to perform the Boolean indexing and select those rows where there is NaN but not consecutively
df=pd.DataFrame({'col1':[1,2,3,4,5,6,7,8,9,10],'col2':[np.nan,2,3,np.nan,np.nan,6,np.nan,8,9,np.nan]})
print(df)
col1 col2
0 1 NaN
1 2 2.0
2 3 3.0
3 4 NaN
4 5 NaN
5 6 6.0
6 7 NaN
7 8 8.0
8 9 9.0
9 10 NaN
mask_repeat_NaN=df.groupby(df['col2'].notna().cumsum())['col2'].transform('size').le(2)
mask=mask_repeat_NaN&df['col2'].isna()
df_filtered=df[mask]
print(df_filtered)
col1 col2
0 1 NaN
6 7 NaN
9 10 NaN
I am an R user learning how to use Python's dfply, the Python equivalent to R's dplyr. My problem: in dfply, I am unable to mask on multiple conditions in a pipe. I seek a solution involving dfply pipes rather than multiple lines of subsetting.
My code:
# Import
import pandas as pd
import numpy as np
from dfply import *
# Create data frame and mask it
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask((X.a.isnull()) | ~(X.b.isnull())))
print(df)
print(df2)
Here is the oringal data frame, df:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
4 5.0 NaN 1
And here is the result of the piped mask, df2:
a b c
0 NaN 6.0 5
4 5.0 NaN 1
However, I expect this instead:
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
Why don't the "|" and "~" operators result in rows in which column "a" is either NaN or column "b" is not NaN?
By the way, I also tried np.logical_or():
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >>
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
print(df)
print(df2)
But this resulted in error:
mask(np.logical_or(X.a.isnull(),~X.b.isnull())))
ValueError: invalid __array_struct__
Edit: Tweak the second conditional to "df.col2.notnull()". No idea why the tilde is ignored after the pipe.
df = pd.DataFrame({'a':[np.nan,2,3,4,5],'b':[6,7,8,9,np.nan],'c':[5,4,3,2,1]})
df2 = (df >> mask((X.a.isnull()) | (X.b.notnull())))
print(df2)
a b c
0 NaN 6.0 5
1 2.0 7.0 4
2 3.0 8.0 3
3 4.0 9.0 2
How about filter_by?
df >> filter_by((X.a.isnull()) | (X.b.isnull()))