Creating correlation matrix for multiple combinations of variables - python

I have a csv file with 10 columns. I can use pandas to import the dataframe and use the corr() function to output a matrix heatmap. What I want to achieve next is for the code to loop through the dataframe and find high or low correlations between combinations of columns
For example, the simple correlation matrix looks at:
A:A, A:B, A:C, A:D etc
But I want the code to combine columns, in every conceivable way, such as:
AB:A, AB:B, AB:C, AB: D etc
ABC:A, ABC:B, ABC:D etc
And if there are any noticeable correlations between certain combinations, to highlight those.
Is this possible at all? Or are there proprietary applications that can do this?
Thanks

I assume with "combination" you mean linear combination. You can loop over the columns (not the most elegant way) and use sklearn linear_model
import pandas as pd
import numpy as np
from sklearn import linear_model
df = pd.DataFrame(np.random.random([10,10]),columns=['A','B','C','D','E','F','G','H','I','J'])
for i,col1 in enumerate(df):
if i > 0:
X = df.iloc[:,0:i]
for j,col2 in enumerate(df):
if j >= i:
y = df[[col2]]
regr = linear_model.LinearRegression()
regr.fit(X, y)
score = regr.score(X,y)
print(f'X: {X.columns} y: {y.columns} score:{score}')

Related

Creating Python Function to Iterate over List/DataFrame (VIF)

I have a dataset and I want to select the subset of variables with VIF(Variance Inflation Factor) smaller than a certain threshold. My idea was to calculate the VIF for every variable, then take out the variable for the highest value (if its higher than a certain threshold), recalculate the VIF for every remaining variable and repeat the process until there is no VIF higher than the treshold.
There is no novel idea in this approach but I couldn't get past a certain point to make a function to automatize this process in Python.
x is the dataset with the target variable dropped
import pandas as pd
import numpy as np
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
x_vif = add_constant(x)
vif = pd.DataFrame([variance_inflation_factor(x_vif.values, i) for i in range(x_vif.shape[1])], index=x_vif.columns)
The vif could also be a List. So, is there any package that does that automatically or could you give me an idea how to create this function ?
I found a R library (thinXwithVIF) that could do that automatically, but I couldn't make rpy2 work with the python version that I need to use.
Maybe what would make sense is to remove the variable with the highest vif in each round, subset the dataframe and stop when all variables are lower than your threshold. I don't think vif would be be-all-and-end-all and you really have to look at the data to decide what to include etc.
import statsmodels.api as sm
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
data = sm.datasets.get_rdataset('mtcars')
x_vif = data.data.iloc[:,1:]
y = data.data['mpg']
thres = 10
while True:
Cols = range(x_vif.shape[1])
vif = np.array([variance_inflation_factor(x_vif.values, i) for i in Cols])
if all(vif < thres):
break
else:
Cols = np.delete(Cols,np.argmax(vif))
x_vif = x_vif.iloc[:,Cols]

Numpy Array for SVM model rather than a DataFrame

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
# Read the data.
data = np.asarray(pd.read_csv('data.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y.
X = data[:,0:2]
y = data[:,2]
# TODO: Create the model and assign it to the variable model.
# Find the right parameters for this model to achieve 100% accuracy on the dataset.
model = SVC()
model.fit(X,y)
2 Questions:
the data goes into a numpy array from a pandas Dataframe (by pd.read_csv).
Is that better? Is there a good reason for that? why not stay with the DataFrame?
I do not understand this notation:
X = data[:,0:2]
y = data[:,2]
What does it do?
Thank you.
The data consists of a CSV file with many rows like this:
0.28917,0.65643,0
It includes three columns, the first 2 comprising of the coordinates of the points, and the third one of the label.

How to perform regression in loop using multiple columns from different CSV files

I have the same number of columns and rows stored in two csv files. Now I want to compute correlation R and R-squared between the same columns from two CSVs (c1 vs c1, c2 vs c2, ...). Here is my code but it is unable to perform the task:
import numpy as np
from sklearn.metrics import r2_score
from scipy import stats
import statsmodels.api as sm
import math
df1 = np.loadtxt('data1_1981_2007_DD.csv', delimiter=',')
df2 = np.loadtxt('data2_1981_2007_DD.csv', delimiter=',')
correlation_r2 = r2_score(df1, df2)
The shape of df1 and df2 are (9861, 10).
After running the code I am getting only a single value. I want to get all 10 values for r2. Can somebody help with this?
You are computing the correlation of two vectors. To compute the correlation of each pair of values in these two columns, you need to use logic like this:
correlation_r2 = [r2_score(df1[i], df2[i]) for i in range(len(df1))]

PySpark: Convert RDD to column in dataframe

I have a spark dataframe using which I am calculating the Euclidean distance between a row and a given set of corrdinates. I am recreating a structurally similar dataframe 'df_vector' here to explain better.
from pyspark.ml.feature import VectorAssembler
arr = [[1,2,3], [4,5,6]]
df_example = spark.createDataFrame(arr, ['A','B','C'])
assembler = VectorAssembler(inputCols=[x for x in df_example.columns],outputCol='features')
df_vector = assembler.transform(df_example).select('features')
>>> df_vector.show()
+-------------+
| features|
+-------------+
|[1.0,2.0,3.0]|
|[4.0,5.0,6.0]|
+-------------+
>>> df_vector.dtypes
[('features', 'vector')]
As you can see the features column is a vector. In practice, I get this vector column as the output of a StandardScaler. Anyway, since I need to calculate Euclidean distance, I do the following
rdd = df_vector.select('features').rdd.map(lambda r: np.linalg.norm(r-b))
where
b = np.asarray([0.5,1.0,1.5])
I have all the calculations I need but I need this rdd as a column in df_vector. How do I go about it?
Instead of creating a new rdd, you could use an UDF:
norm_udf = udf(lambda r: np.linalg.norm(r - b).tolist(), FloatType())
df_vector.withColumn("norm", norm_udf(df.features))
Make sure numpy is defined on the worker nodes.
One way to tackle performance issues might be to use mapPartitions. The idea would be, at a partition level, to convert features to an array and then calculate the norm on the whole array (thus implicitly using numpy vectorisation). Then do some housekeeping to get the form you want. For large datasets this might improve performance:
Here is the function which calculates the norm at partition level:
from pyspark.sql import Row
def getnorm(vectors):
# convert vectors into numpy array
vec_array=np.vstack([v['features'] for v in vectors])
# calculate the norm
norm=np.linalg.norm(vec_array-b, axis=1)
# tidy up to get norm as a column
output=[Row(features=x, norm=y) for x,y in zip(vec_array.tolist(), norm.tolist())]
return(output)
Applying this using mapPartitions gives an RDD of Rows which can then be converted to a DataFrame:
df_vector.rdd.mapPartitions(getnorm).toDF()

Adjusted R square for each predictor variable in python

I have a pandas data frame that contains several columns. I need to perform a multivariate linear regression. Before doing that i would like to analyze the R,R2,adjusted R2 and p value of each independent variable with respect to the dependent variable.
For the R and R2 I have no problem, since i can calculate the R matrix and the select only the dependent variable and then see the R coefficient between it and all the independent variables. Then i can square these values to obtain the R2.
My problem is how to do the same with the adjusted R2 and the p value
At the end what i want to obtain is somenthing like that:
Variable R R2 ADJUSTED_R2 p_value
A 0.4193 0.1758 ...
B 0.2620 0.0686 ...
C 0.2535 0.0643 ...
All the values are with respect to the dependent variable let's say Y.
The following will not give you ALL the answers, but it WILL get you going using python, pandas and statsmodels for regression analyses.
Given a dataframe like this...
# Imports
import pandas as pd
import numpy as np
import itertools
# A datafrane with random numbers
np.random.seed(123)
rows = 12
listVars= ['y','x1', 'x2', 'x3']
rng = pd.date_range('1/1/2017', periods=rows, freq='D')
df_1 = pd.DataFrame(np.random.randint(100,150,size=(rows, len(listVars))), columns=listVars)
df_1 = df_1.set_index(rng)
print(df_1)
...you can get any regression results using the statsmodels library and altering the result = model.rsquared part in the snippet below:
x = df_1['x1']
x = sm.add_constant(x)
model = sm.OLS(df_1['y'], x).fit()
result = model.rsquared
print(result)
Now you have r-squared. Use model.pvalues for the p-value. And use dir(model)to have closer look at other model results (there is more in the output than what you can see below):
Now, this should get you going to obtain your desired results.
To get desired results for ALL combinations of variables / columns, the question and answer here should get you very far.
Edit: You can have a closer look at some common regression results using model.summary(). Using that together with dir(model) you can see that not ALL regression results are availabel the same way that pvalues are using model.pvalues. To get Durbin-Watson, for example, you'll have to use durbinwatson = sm.stats.stattools.durbin_watson(model.fittedvalues, axis=0).
This post has got more information on the issue.

Categories

Resources