Pearson correlation between adjacent columns in a DataFrame

Pearson correlation between adjacent columns in a DataFrame - python

let say I have a dataframe of 10 columns.
now I want to quickly calculate the relation between each column and its following column.
so pearson r of column 1 and 2, of column 2 and 3, of column 3 and 4 and so on.
is there a quick way for me to do that?
thank you!

You can use pandas.DataFrame.corr for Pearson correlation and numpy.diag to extract the values of interest. Let me show you a toy example with 5 columns (for simplicity):
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,10,(3,5)))
pcorr = df.corr()
np.diag(pcorr, 1)
and you get:
df:
0 1 2 3 4
0 7 9 0 0 9
1 9 2 9 9 0
2 2 8 5 9 2
pcorr:
0 1 2 3 4
0 1.000000 -0.622693 0.215274 -0.240192 0.029344
1 -0.622693 1.000000 -0.898170 -0.609994 0.763857
2 0.215274 -0.898170 1.000000 0.896258 -0.969816
3 -0.240192 -0.609994 0.896258 1.000000 -0.977356
4 0.029344 0.763857 -0.969816 -0.977356 1.000000
your values of interest:
array([-0.62269252, -0.89817029, 0.89625816, -0.97735555])

Related

Setting Values in Pandas Dataframe Based on Condition in Another Column

I am looking to update the values in a pandas series that satisfy a certain condition and take the corresponding value from another column.
Specifically, I want to look at the subcluster column and if the value equals 1, I want the record to update to the corresponding value in the cluster column.
For example:
Cluster
Subcluster
3
1
3
2
3
1
3
4
4
1
4
2
Should result in this
Cluster
Subcluster
3
3
3
2
3
3
3
4
4
4
4
2
I've been trying to use apply and a lambda function, but can't seem to get it to work properly. Any advice would be greatly appreciated. Thanks!

You can use np.where:
import numpy as np
df['Subcluster'] = np.where(df['Subcluster'].eq(1), df['Cluster'], df['Subcluster'])
Output:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2

In your case try mask
df.Subcluster.mask(lambda x : x==1, df.Cluster,inplace=True)
df
Out[12]:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2
Or
df.loc[df.Subcluster==1,'Subcluster'] = df['Cluster']

Really all you need here is to use .loc with a mask (you don't actually need to create the mask, you could apply a mask inline)
df = pd.DataFrame({'cluster':np.random.randint(0,10,10)
,'subcluster':np.random.randint(0,3,10)}
)
df.to_clipboard(sep=',')
df at this point
,cluster,subcluster
0,8,0
1,5,2
2,6,2
3,6,1
4,8,0
5,1,1
6,0,0
7,6,0
8,1,0
9,3,1
create and apply the mask (you could do this all in one line)
mask = df.subcluster == 1
df.loc[mask,'subcluster'] = df.loc[mask,'cluster']
df.to_clipboard(sep=',')
final output:
,cluster,subcluster
0,8,0
1,5,2
2,6,2
3,6,6
4,8,0
5,1,1
6,0,0
7,6,0
8,1,0
9,3,3

Here's the lambda you couldn't write. In lamba, x corresponds to the index, so you can use that to refer a specific row in a column.
df['Subcluster'] = df.apply(lambda x: x['Cluster'] if x['Subcluster'] == 1 else x['Subcluster'], axis = 1)
And the output:
Cluster Subcluster
0 3 3
1 3 2
2 3 3
3 3 4
4 4 4
5 4 2

How to add a random value to many rows in a Pandas Dataframe iteratively?

Suppose I have a Pandas Dataframe named df, which has the following structure:-
Column 1 Column 2 ......... Column 104
Row 1 0.01 0.55 3
Row 2 0.03 0.14 1
...
Row 100 0.75 0.56 0
What I am trying to accomplish is that for all rows which match the condition given below, I need to generate 100 more rows with a random value between 0 and 0.05 added to each row:-
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append([df_try]*100,ignore_index=True)
The problem is that I can simply duplicate the rows in df_try to generate 100 more rows for each case, but I want to add a random value to each row as well, such that each row is different from the others but very similar.
import random
df = df.append([df_try + random.uniform(0,0.05)]*100, ignore_index=True)
What this does is to simply add the fixed random value to df_try's 100 new rows, but not a unique random value to each row. I know that this is because the above syntax does not iterate over df_try, resulting in the fixed random value being added, but is there a suitable way to add the random values iteratively over the data frame in this case?

One idea is create 2d array with same size like new appended DataFrame and add to joined lists with concat:
N = 10
arr = np.random.uniform(0,0.05, size=(N, len(df.columns)))
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append(pd.concat([df_try]*N) + arr,ignore_index=True)
print (df)
Column 1 Column 2 Column 104
0 0.010000 0.550000 3.000000
1 0.030000 0.140000 1.000000
2 0.750000 0.560000 0.000000
3 0.024738 0.561647 3.045146
4 0.035315 0.584161 3.008656
5 0.022386 0.563025 3.033091
6 0.039175 0.588785 3.004649
7 0.049465 0.594903 3.003303
8 0.027366 0.580478 3.041745
9 0.044721 0.599853 3.001736
10 0.052849 0.589775 3.042434
11 0.033957 0.582610 3.045215
12 0.044349 0.582218 3.027665
Your solution should be changed by list comprehension if need add scalar to each df_try:
N = 10
is_less = df.iloc[:,-1] > 1
df_try = df[is_less]
df = df.append( [df_try + random.uniform(0, 0.05) for _ in range(N)], ignore_index=True)
print (df)
Column 1 Column 2 Column 104
0 0.010000 0.550000 3.000000
1 0.030000 0.140000 1.000000
2 0.750000 0.560000 0.000000
3 0.036756 0.576756 3.026756
4 0.039357 0.579357 3.029357
5 0.048746 0.588746 3.038746
6 0.040197 0.580197 3.030197
7 0.011045 0.551045 3.001045
8 0.013942 0.553942 3.003942
9 0.054658 0.594658 3.044658
10 0.025909 0.565909 3.015909
11 0.012093 0.552093 3.002093
12 0.058463 0.598463 3.048463

You can combine the copies first and create a single array containing all the random values, add them together, and then append the result to the original:
import numpy as np
n_copies = 2
df = pd.DataFrame(np.c_[np.arange(6), np.random.randint(1, 3, size=6)])
subset = df[df.iloc[:, -1] > 1]
extra = pd.concat([subset] * n_copies).add(np.random.uniform(0, 0.05, len(subset) * n_copies), axis='rows')
result = df.append(extra, ignore_index=True)
print(result)
Output:
0 1
0 0.000000 2.000000
1 1.000000 2.000000
2 2.000000 1.000000
3 3.000000 2.000000
4 4.000000 1.000000
5 5.000000 2.000000
6 0.007723 2.007723
7 1.005718 2.005718
8 3.003063 2.003063
9 5.005238 2.005238
10 0.006509 2.006509
11 1.034742 2.034742
12 3.022345 2.022345
13 5.040911 2.040911

expand pandas groupby results to initial dataframe

Say I have a dataframe df and group it by a few columns, dfg, with the median of one of its columns. How could I then take those median values, and expand them out so that those mean values are in a new column of the original df, and associated with the respective conditions? This will mean there are duplicates, but I will next be using this column for a subsequent calculation and having these in a column will make this possible.
Example data:
import pandas as pd
data = {'idx':[1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2],
'condition1':[1,1,2,2,3,3,4,4,1,1,2,2,3,3,4,4],
'condition2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2,1,2],
'values':np.random.normal(0,1,16)}
df = pd.DataFrame(data)
dfg = df.groupby(['idx', 'condition2'], as_index=False)['values'].median()
example of desired result (note duplicates corresponding to correct conditions):
idx condition1 condition2 values medians
0 1 1 1 0.35031 0.656355
1 1 1 2 -0.291736 -0.024304
2 1 2 1 1.593545 0.656355
3 1 2 2 -1.275154 -0.024304
4 1 3 1 0.075259 0.656355
5 1 3 2 1.054481 -0.024304
6 1 4 1 0.9624 0.656355
7 1 4 2 0.243128 -0.024304
8 2 1 1 1.717391 1.155406
9 2 1 2 0.788847 1.006583
10 2 2 1 1.145891 1.155406
11 2 2 2 -0.492063 1.006583
12 2 3 1 -0.157029 1.155406
13 2 3 2 1.224319 1.006583
14 2 4 1 1.164921 1.155406
15 2 4 2 2.042239 1.006583

I believe you need GroupBy.transform with median for new column:
df['medians'] = df.groupby(['idx', 'condition2'])['values'].transform('median')

Update in pandas on specific columns

I want to update values in one pandas data frame based on the values in another dataframe, but I want to specify which column to update by (i.e., which column should be the “key” for looking up matching rows). Right now it seems to do treat the first column as the key one. Is there a way to pass it a specific column name?
Example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame()
df_a['x'] = range(5)
df_a['y'] = range(4, -1, -1)
df_a['z'] = np.random.rand(5)
df_b = pd.DataFrame()
df_b['x'] = range(5)
df_b['y'] = range(5)
df_b['z'] = range(5)
print('df_b:')
print(df_b.head())
print('\nold df_a:')
print(df_a.head(10))
df_a.update(df_b)
print('\nnew df_a:')
print(df_a.head())
Out:
df_b:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
old df_a:
x y z
0 0 4 0.333648
1 1 3 0.683656
2 2 2 0.605688
3 3 1 0.816556
4 4 0 0.360798
new df_a:
x y z
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 3
4 4 4 4
You see, what it did is replaced y and z in df_a with the respective columns in df_b based on matches of x between df_a and df_b.
What if I wanted to keep y the same? What if I want it to replace based on y and not x. Also, what if there are multiple columns on which I’d like to do the replacement (in the real problem, I have to update a dataset with a new dataset, where there is a match in two or three columns between the two on the values from a fourth column).
Basically, I want to do some sort of a merge-replace action, where I specify which columns I am merging/replacing on and which column should be replaced.
Hope this makes things clearer. If this cannot be accomplished with update in pandas, I am wondering if there is another way (short of writing a separate function with for loops for it).

This is my current solution, but it seems somewhat inelegant:
df_merge = df_a.merge(df_b, on='y', how='left', suffixes=('_a', '_b'))
print(df_merge.head())
df_merge['x'] = df_merge.x_b
df_merge['z'] = df_merge.z_b
df_update = df_a.copy()
df_update.update(df_merge)
print(df_update)
Out:
x_a y z_a x_b z_b
0 0 0 0.505949 0 0
1 1 1 0.231265 1 1
2 2 2 0.241109 2 2
3 3 3 0.579765 NaN NaN
4 4 4 0.172409 NaN NaN
x y z
0 0 0 0.000000
1 1 1 1.000000
2 2 2 2.000000
3 3 3 0.579765
4 4 4 0.172409
5 5 5 0.893562
6 6 6 0.638034
7 7 7 0.940911
8 8 8 0.998453
9 9 9 0.965866

Reduce number of columns in a pandas DataFrame

I'm trying to create a violin plot in seaborn. The input is a pandas DataFrame, and it looks like in order to separate the data along the x axis I need to differentiate on a single column. I currently have a DataFrame that has floating point values for several sensors:
>>>df.columns
Index('SensorA', 'SensorB', 'SensorC', 'SensorD', 'group_id')
That is, each Sensor[A-Z] column contains a bunch of numbers:
>>>df['SensorA'].head()
0 0.072706
1 0.072698
2 0.072701
3 0.072303
4 0.071951
Name: SensorA, dtype: float64
And for this problem, I'm only interested in 2 groups:
>>>df['group_id'].unique()
'1', '2'
I want each Sensor to be a separate violin along the x axis.
I think this means I need to convert this into something of the form:
>>>df.columns
Index('Value', 'Sensor', 'group_id')
where the Sensor column in the new DataFrame contains the text "SensorA", "SensorB", etc., the Value column in the new DataFrame contains the values that were original in each Sensor[A-Z] column, and the group information is preserved.
I could then create a violinplot using the following command:
ax = sns.violinplot(x="Sensor", y="Value", hue="group_id", data=df)
I'm thinking I kind of need to do a reverse pivot. Is there an easy way of doing this?

Use panda's melt function
import pandas as pd
import numpy as np
df = pd.DataFrame({'SensorA':[1,3,4,5,6], 'SensorB':[5,2,3,6,7], 'SensorC':[7,4,8,1,10], 'group_id':[1,2,1,1,2]})
df = pd.melt(df, id_vars = 'group_id', var_name = 'Sensor')
print df
gives
group_id Sensor value
0 1 SensorA 1
1 2 SensorA 3
2 1 SensorA 4
3 1 SensorA 5
4 2 SensorA 6
5 1 SensorB 5
6 2 SensorB 2
7 1 SensorB 3
8 1 SensorB 6
9 2 SensorB 7
10 1 SensorC 7
11 2 SensorC 4
12 1 SensorC 8
13 1 SensorC 1
14 2 SensorC 10

May it's not the best way but it works (AFAIU):
import pandas as pd
import numpy as np
df = pd.DataFrame({'SensorA':[1,3,4,5,6], 'SensorB':[5,2,3,6,7], 'SensorC':[7,4,8,1,10], 'group_id':[1,2,1,1,2]})
groupedID = df.groupby('group_id')
df1 = pd.DataFrame()
for groupNum in groupedID.groups.keys():
dfSensors = groupedID.get_group(groupNum).filter(regex='Sen').stack()
_, sensorNames = zip(*dfSensors.index)
df2 = pd.DataFrame({'Sensor': sensorNames, 'Value':dfSensors.values, 'group_id':groupNum})
df1 = pd.concat([df1, df2])
print(df1)
Output:
Sensor Value group_id
0 SensorA 1 1
1 SensorB 5 1
2 SensorC 7 1
3 SensorA 4 1
4 SensorB 3 1
5 SensorC 8 1
6 SensorA 5 1
7 SensorB 6 1
8 SensorC 1 1
0 SensorA 3 2
1 SensorB 2 2
2 SensorC 4 2
3 SensorA 6 2
4 SensorB 7 2
5 SensorC 10 2

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pearson correlation between adjacent columns in a DataFrame - python

let say I have a dataframe of 10 columns. now I want to quickly calculate the relation between each column and its following column. so pearson r of column 1 and 2, of column 2 and 3, of column 3 and 4 and so on. is there a quick way for me to do that? thank you!

Related

Setting Values in Pandas Dataframe Based on Condition in Another Column

How to add a random value to many rows in a Pandas Dataframe iteratively?

expand pandas groupby results to initial dataframe

Update in pandas on specific columns

Reduce number of columns in a pandas DataFrame

Categories

Resources