Computing correlation of a matrix with its transpose - python

I am trying to compute the correlation of a matrix(here, rows of dataframe) with its transpose using apply.
The following is the code:
import pandas as pd
from pprint import pprint
d = {'A': [1,0,3,0], 'B':[2,0,1,0], 'C':[0,0,8,0], 'D':[1,0,0,1]}
df = pd.DataFrame(data=d)
df_T = df.T
corr = df.apply(lambda s: df_T.corrwith(s))
All the columns of correlation variable contains NaN entries. I'd
like to understand why NaN occurs.
Could someone explain?

I think you need DataFrame.corr:
print (df.corr())
A B C D
A 1.000000 0.492366 0.942809 -0.408248
B 0.492366 1.000000 0.174078 0.301511
C 0.942809 0.174078 1.000000 -0.577350
D -0.408248 0.301511 -0.577350 1.000000
If need your solution is necessary same index and columns values:
df = pd.DataFrame(data=d).set_index(df.columns)
print (df)
A B C D
A 1 2 0 1
B 0 0 0 0
C 3 1 8 0
D 0 0 0 1
df_T = df.T
corr = df.apply(lambda s: df_T.corrwith(s))
print (corr)
A B C D
A -0.866025 -0.426401 -0.816497 0.000000
B NaN NaN NaN NaN
C 0.993399 0.489116 0.936586 -0.486664
D -0.471405 -0.522233 -0.333333 0.577350

Related

How to shift a dataframe element-wise to fill NaNs?

I have a DataFrame like this:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
I am trying to fill NaN with values of the previous column in the next row and dropping this second row. In other words, I want to combine the two rows with NaNs to form a single row without NaNs like this:
a b
0 A E
1 B C
2 D F
I have tried various flavors of df.fillna(method="<bfill/ffill>") but this didn't give me the expected output.
I haven't found any other question about this problem, Here's one. And actually that DataFrame is made from list of DataFrame by doing .concat(), you may notice that from indexes also. I am telling this because it may be easy to do in single row rather then in multiple rows.
I have found some suggestions to use shift, combine_first but non of them worked for me. You may try these too.
I also have found this too. It is a whole article about filling nan values but I haven't found problem/answer like mine.
OK misunderstood what you wanted to do the first time. The dummy example was a bit ambiguous.
Here is another:
>>> df = pd.DataFrame({'a': list('ABCD'), 'b': ['E',np.nan,np.nan,'F']})
a b
0 A E
1 B NaN
2 C NaN
3 D F
To my knowledge, this operation does not exist with pandas, so we will use numpy to do the work.
First transform the dataframe to numpy array and flatten it to be one-dimensional. Then drop NaNs using pandas.isna that is working on a larger range types than numpy.isnan, and then reshape the array to its original shape before transforming back to dataframe:
array = df.to_numpy().flatten()
pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
output:
a b
0 A E
1 B C
2 D F
It is also working for more complex examples, as long as the NaN pattern is conserved among columns with NaNs:
In:
a b c d
0 A H A2 H2
1 B NaN B2 NaN
2 C NaN C2 NaN
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
Out:
a b c d
0 A H A2 H2
1 B B2 C C2
2 D I D2 I2
3 E E2 F F2
4 G J G2 J2
In:
a b c
0 A F H
1 B NaN NaN
2 C NaN NaN
3 D NaN NaN
4 E G I
Out:
a b c
0 A F H
1 B C D
2 E G I
In case NaNs columns do not have the same pattern such as:
a b c d
0 A H A2 NaN
1 B NaN B2 NaN
2 C NaN C2 H2
3 D I D2 I2
4 E NaN E2 NaN
5 F NaN F2 NaN
6 G J G2 J2
You can apply the operation per group of two columns:
def elementwise_shift(df):
array = df.to_numpy().flatten()
return pd.DataFrame(array[~pd.isna(array)].reshape(-1,df.shape[1]), columns=df.columns)
(df.groupby(np.repeat(np.arange(df.shape[1]/2), 2), axis=1)
.apply(elementwise_shift)
)
output:
a b c d
0 A H A2 B2
1 B C C2 H2
2 D I D2 I2
3 E F E2 F2
4 G J G2 J2
You can do this in two steps with a placeholder column. First you fill all the nans in column b with the a values from the next row. Then you apply the filtering. In this example I use ffill with a limit of 1 to filter all nan values after the first, there's probably a better method.
import pandas as pd
import numpy as np
df=pd.DataFrame({"a":[1,2,3,3,4],"b":[1,2,np.nan,np.nan,4]})
# Fill all nans:
df['new_b'] = df['b'].fillna(df['a'].shift(-1))
df = df[df['b'].ffill(limit=1).notna()].copy() # .copy() because loc makes a view
df = df.drop('b', axis=1).rename(columns={'new_b': 'b'})
print(df)
# output:
# a b
# 0 1 1
# 1 2 2
# 2 3 2
# 4 4 4

Moving rows from one column to another along with respective values in pandas DataFrame

This is the dataframe I have with three rows and three columns.
a d aa
b e bb
c f cc
What I want is to remove the second column and adding those values to the rows in first column along with their respective values from third column.
This is the expected result:
a aa
b bb
c cc
d aa
e bb
f cc
Firstly concat the columns:
df1 = pd.concat([df[df.columns[[0,2]]], df[df.columns[[1,2]]]])
Then what you obtain is:
0 1 2
0 a NaN aa
1 b NaN bb
2 c NaN cc
0 NaN d aa
1 NaN e bb
2 NaN f cc
Now, just replace the NaN values in [0] with the corresponding values from [1].
df1[0] = df1[0].fillna(df1[1])
Output:
0 1 2
0 a NaN aa
1 b NaN bb
2 c NaN cc
0 d d aa
1 e e bb
2 f f cc
Here, you may only need [0] and [2] columns.
df1[[0,2]]
Final Output:
0 2
0 a aa
1 b bb
2 c cc
0 d aa
1 e bb
2 f cc
Here are 4 steps: split into 2 dataframes; make column names the same; append; reindex.
Import pandas as pd
df = pd.DataFrame({'col1':['a','b','c'],'col2':['c','d','e'],'col3':['aa','bb','cc']})
df2 = df[['col1','col3']] # split into 2 dataframes
df3 = df[['col2','col3']]
df3.columns = df2.columns # make column names the same
df_final = df2.append(df3) # append
df_final.index = range(len(df_final.index)) # reindex
print(df_final)
pd.concat([df[df.columns[[0, 2]]], df[df.columns[[0, 1]]])

replacing NaN values in dataframe with pandas

I want to create a function that takes a dataframe and replaces NaN with the mode in categorical columns, and replaces NaN in numerical columns with the mean of that column. If there are more than one mode in the categorical columns, then it should use the first mode.
I have managed to do it with following code:
def exercise4(df):
df1 = df.select_dtypes(np.number)
df2 = df.select_dtypes(exclude = 'float')
mode = df2.mode()
df3 = df1.fillna(df.mean())
df4 = df2.fillna(mode.iloc[0,:])
new_df = [df3,df4]
df5 = pd.concat(new_df,axis=1)
new_cols = list(df.columns)
df6 = df5[new_cols]
return df6
But i am sure there is a far easier method to do this?
You can use:
df = pd.DataFrame({
'A':list('abcdec'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':list('bbcdeb'),
})
df.iloc[[1,3], [1,2,0,4]] = np.nan
print (df)
A B C D E
0 a 4.0 7.0 1 b
1 NaN NaN NaN 3 NaN
2 c 4.0 9.0 5 c
3 NaN NaN NaN 7 NaN
4 e 5.0 2.0 1 e
5 c 4.0 3.0 0 b
Idea is use DataFrame.select_dtypes for non numeric columns with DataFrame.mode and select first row by DataFrame.iloc for positions, then count means - non numeric are expluded by default, so possible use Series.append for Series with all values for replacement passed to DataFrame.fillna:
modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
print (both)
A c
E b
B 4.25
C 5.25
D 2.83333
dtype: object
df.fillna(both, inplace=True)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Passed to function with DataFrame.pipe:
def exercise4(df):
modes = df.select_dtypes(exclude=np.number).mode().iloc[0]
means = df.mean()
both = modes.append(means)
df.fillna(both, inplace=True)
return df
df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Another idea is use DataFrame.apply, but is necessary result_type='expand' parameter with test dtypes by types.is_numeric_dtype:
from pandas.api.types import is_numeric_dtype
f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
print (df)
A B C D E
0 a 4.00 7.00 1 b
1 c 4.25 5.25 3 b
2 c 4.00 9.00 5 c
3 c 4.25 5.25 7 b
4 e 5.00 2.00 1 e
5 c 4.00 3.00 0 b
Passed to function:
from pandas.api.types import is_numeric_dtype
def exercise4(df):
f = lambda x: x.mean() if is_numeric_dtype(x.dtype) else x.mode()[0]
df.fillna(df.apply(f, result_type='expand'), inplace=True)
return df
df = df.pipe(exercise4)
#alternative
#df = exercise4(df)
print (df)
You can use the _get_numeric_data() method to get the numeric columns (and consequently the categorical ones):
numerical_col = df._get_numeric_data().columns
At this point you only need one line of code using an apply function that runs through the columns:
fixed_df = df.apply(lambda col: col.fillna(col.mean()) if col.name in numerical_col else col.fillna(col.mode()[0]), axis=0)
Actually you have all the ingredients already there! Some of your steps can be chained though making some others obsolete.
Looking at these two lines for example:
mode = df2.mode()
df4 = df2.fillna(mode.iloc[0,:])
You could just replace them with df4 = df2.fillna(df2.mode().iloc[0,:]. Then instead of constantly reassigning new (sub)dataframes to variables, altering them and concatenating them you can make these alterations inplace, meaning they are applied directly to the dataframe in question. Lastly exclude='float' might work in your particular (example) case, but what if there are even more datatypes in the dataframe? A string column maybe?
My suggestion:
def mean_mode(df):
df.select_dtypes(np.number).fillna(df.mean(), inplace=True)
df.select_dtypes('category').fillna(df.mode()[0], inplace=True)
return df
You can work as follows:
df = df.apply(lambda x: x.fillna(x.mode()[0]) if (x.dtypes==category) else x.fillna(x.mean()) )

combining columns in pandas dataframe

I have the following dataframe:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan],
'user_b':['A','B',np.nan,'D']
})
I would like to create a new column called user and have the resulting dataframe:
What's the best way to do this for many users?
Use forward filling missing values and then select last column by iloc:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan,np.nan],
'user_b':['A','B',np.nan,'D',np.nan]
})
df['user'] = df.ffill(axis=1).iloc[:, -1]
print (df)
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
4 NaN NaN NaN
use .apply method:
In [24]: df = pd.DataFrame({'user_a':['A','B','C',np.nan],'user_b':['A','B',np.nan,'D']})
In [25]: df
Out[25]:
user_a user_b
0 A A
1 B B
2 C NaN
3 NaN D
In [26]: df['user'] = df.apply(lambda x: [i for i in x if not pd.isna(i)][0], axis=1)
In [27]: df
Out[27]:
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D

Custom computing function for dataframes in Pandas

I have two dataframes:
df0
a b
c 0 6
d 0 9
df1
a b
c 3 2
d 0 0
I have a custom divide function:
def cdiv(x,y):
if x == 0:
return 0
return x / y
Result I expect:
a b
c 0 3
d 0 Inf
How could I apply the function for those two dataframe?
You need div with mask:
df = df1.div(df2).mask(df1 == 0, 0)
print (df)
a b
c 0.0 3.000000
d 0.0 inf
Or maybe can works as ayhan commented if no NaN values in DataFrames:
(df0/df1).fillna(0)
Another solution with numpy.where:
df = pd.DataFrame(np.where(df1 == 0, 0, df1 / df2), index=df1.index, columns=df1.columns)
print (df)
a b
c 0.0 3.000000
d 0.0 inf

Categories

Resources