Compute correlation between features and target variable - python

What is the best solution to compute correlation between my features and target variable ?? My dataframe have 1000 rows and 40 000 columns...
Exemple :
df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]], columns=['Feature1', 'Feature2','Feature3','Target'])
This code works fine but this is too long on my dataframe ... I need only the last column of correlation matrix : correlation with target (not pairwise feature corelation).
corr_matrix=df.corr()
corr_matrix["Target"].sort_values(ascending=False)
The np.corcoeff() function works with array but can we exclude the pairwise feature correlation ?

You could use pandas corr on each column:
df.drop("Target", axis=1).apply(lambda x: x.corr(df.Target))

Since Pandas 0.24 released in January 2019, you can simply use DataFrame.corrwith():
df.corrwith(df["Target"])

You can use scipy.stats.pearsonr on each of the feature columns like so:
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
# example data
df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]],
columns=['Feature1', 'Feature2','Feature3','Target'])
# Only compute pearson prod-moment correlations between feature
# columns and target column
target_col_name = 'Target'
feature_target_corr = {}
for col in df:
if target_col_name != col:
feature_target_corr[col + '_' + target_col_name] = \
pearsonr(df[col], df[target_col_name])[0]
print("Feature-Target Correlations")
print(feature_target_corr)

df = pd.DataFrame([[1, 2, 4 ,6], [1, 3, 4, 7], [4, 6, 8, 12], [5, 3, 2 ,10]], columns=['Feature1', 'Feature2','Feature3','Target'])
For correlation between your target variable and all other features:
df.corr()['Target']
This works in my case. Let me know if any corrections/updates on the same.
To get any conclusive results your instance should be atleast 10 times your number of features.

Related

Is it possible to pass more than one argument to pandas converters (read_csv)?

I have a CSV file that I need to read as a DataFrame, but I'd like to apply a transformation in one of the columns using converters from pandas.read_csv.
This is what's in my file:
matrix size
"(1, 2, 3, 4)" 2
"(1, 2, 3, 4, 5, 6, 7, 8, 9)" 3
The strings in matrix need to be converted to matrices according to the corresponding size. (The actual process is more complex and the values in the data actually correspond to the lower triangle of each matrix, etc.)
So, the expected output DataFrame is:
matrix size
0 [[1, 2], [3, 4]] 2
1 [[1, 2, 3], [4, 5, 6], [7, 8, ... 3
I'm trying to use converters to convert the columns as I read them.
For example, if I wanted to read the strings in matrix as simple arrays, I could do the following:
import numpy as np
converters = {'matrix': lambda x: np.fromstring(x[1:-1], sep=',').astype('int64')}
And then read the file passing this dictionary:
import pandas as pd
df = pd.read_csv('mydata.csv', converters=converters)
The output would be:
matrix size
0 [1, 2, 3, 4] 2
1 [1, 2, 3, 4, 5, 6, 7, 8, 9] 3
In my case, I have a function to transform the strings to matrices:
def array_to_matrix(array_str, size):
array = np.fromstring(array_str[1:-1], sep=',').astype('int64')
return array.reshape(size, size)
But this function requires two arguments.
I can parse the matrix columns by doing this:
df['matrix'] = df.apply(lambda x: array_to_matrix(x['matrix'], x['size']), axis=1)
However, I haven't been able to find a way to parse the matrices using converters. To use converters, I could do the following:
matrix_converters = dict([('matrix', lambda x, y: array_to_matrix(x, y))])
But x will become the value in matrix (the dictionary key) and I have no way to pass y.
My use case is more complex and would benefit from being able to parse many similar columns while reading the file.
Is it possible to pass more than one column in the DataFrame to converters, or is it limited to one?
try:
df.matrix = df.apply(lambda x: np.array(eval(x[0])).reshape((x[1], x[1])), axis=1)
or of the matrix is not square:
df.matrix = df.apply(lambda x: np.array(eval(x[0])).reshape((x[1], -1)), axis=1)
Output:
print(df)
matrix size
0 [[1, 2], [3, 4]] 2
1 [[1, 2, 3], [4, 5, 6], [7, 8, 9]] 3

Compare two columns and return the row numbers in Python

I have two columns A = [11, 24, 7, 7, 0, 4, 9, 20, 3, 5] and B = [5, 8, 9, 1, 11]. But they have different row numbers. I want to find if A has the same element as B does and return the row numbers of A and B. For example,
A and B have the same value 5,9, 11 and the returned matrix is C = [5, 9, 11]
The returned row number of A should be row_A = [9, 6, 0]
The returned row number of B should be row_B = [0, 2, 4]
Are there any function in python I can use?
Thanks
This is numpy.intersect1d. Using the return_indices argument you can also get the indices in addition to the values. This will work if A/B are numpy arrays, pandas Series, or lists.
Note, if your objects are pandas objects, the returned indices are the array indices. So if your Series don't have a RangeIndex starting from 0, you can slice their Index objects by the array locations to get the actual Series Index labels.
import numpy as np
vals, idxA, idxB = np.intersect1d(A, B, return_indices=True)
vals
#array([ 5, 9, 11])
idxA
#array([9, 6, 0])
idxB
#array([0, 2, 4])
If you have two different dataframes,
copy index in another column for each and join on columns:
df_A["index"] = df.index
df_B["index"] = df.index
df_A.merge(df_B, left_on="A", right_on="B", how="inner")
df = pd.DataFrame(data = {"A":[11, 24, 7, 7, 0, 4, 9, 20, 3, 5], "B" : [5, 8, 9, 1, 11, -1,-1,-1,-1,-1]})
df_A = df[["A"]]
df_A["index"] = df.index
df_B = df[["B"]]
df_B["index"] = df.index
df_A.merge(df_B, left_on="A", right_on="B", how="inner")

Python: What is the most efficient / fastest way to unpack the following dataframe to a matrix?

I have the following grid(actually a dataframe):
params = pd.DataFrame(np.array([(alpha, gamma) for alpha in np.linspace(0,1,10) for gamma in np.linspace(0,2,10)]),
columns = ['alpha','gamma'])
Then I use apply
params['res'] = params.apply(lambda row: func(x=params['alpha'],y=params['gamma'],axis=1)
How do I unpack the above into a matrix/dataframe below?
pd.DataFrame(elements of params['res'],
index = np.linspace(0,1,10),
columns = np.linspace(0,2,10))
You can first trasform the res Series to a numpy array, then use the reshape method:
result_df = pd.DataFrame(params['res'].to_numpy().reshape(10,10),
index = np.linspace(0,1,10),
columns = np.linspace(0,2,10))
Using a smaller size example:
# Simulating res
res = np.random.randint(0,10, 9)
res
array([9, 3, 5, 9, 3, 1, 4, 0, 6])
res.reshape(3,3)
array([[9, 3, 5],
[9, 3, 1],
[4, 0, 6]])
If this is not your expected result, you can traspose it:
res.reshape(3,3).T
array([[9, 9, 4],
[3, 3, 0],
[5, 1, 6]])
You are looking for pivot:
params.pivot(index='alpha', columns='gamma', values='res')
If the question only ask the fastest way to make a 10 by 10 numpy 2d array/matrix from the res column, then I think this solution is the one:
params.res.values.reshape(10,10)

function across 2 dataframes based on index (python)

i have 2 dataframes A and B and was thinking how do i create the dataframe in orange
Values to be populated for each cell would be based on the column and header. For example: the top left cell would be a func based on the row and column index (dataframe A.A0 + dataframe A.A1 - dataframe B.0)
i tried with an empty dataframe of the orange dimensions (emptyDF)
emptyDf.applyMap(lambda x: x[dfA[0]] + x[dfA[1] - x[dfB[0]]]
What you are trying to do is not in the spirit of the uses of the Pandas dataframe, but it is more a matrix manipulation exercise for which NumPy is more appropriate, the library upon which Pandas is built. It is not hard to move between Pandas dataframes and NumPy arrays and back again, you might need to be careful though to store indexes and column labels somewhere safe to use when you bring it back into pandas. There are all kinds of NumPy functions to do any manipulation you could dream up, I found a few tools to help this application:
import pandas as pd
import numpy as np
# create your dataframes:
series = pd.Series([10,9,8,7,6], index=[0,1,2,3,4])
df1 = pd.DataFrame([series])
cols = ['A','B','C','D']
list_of_series = [pd.Series([1,2,3,4],index=cols), pd.Series([5,6,7,8],index=cols)]
df2 = pd.DataFrame(list_of_series, columns=cols)
Now convert to NumPy
A = np.array(df2)
>>> A
array([[1, 2, 3, 4],
[5, 6, 7, 8]])
B = np.array(df1)
>>> B.T
array([[10],
[ 9],
[ 8],
[ 7],
[ 6]])
Now a few NumPy operations to accomplish the task:
C = A.sum(axis=0)
D = np.tile(C,(5,1))
E = np.tile(B.T, (1,4))
F = D - E
F
array([[-4, -2, 0, 2],
[-3, -1, 1, 3],
[-2, 0, 2, 4],
[-1, 1, 3, 5],
[ 0, 2, 4, 6]])
Now convert it back to a dataframe:
pd.DataFrame(F, columns=['A','B','C','D'], index=[0,1,2,3,4])
Anyway, I wonder if this can work directly from Pandas, but it just strikes me as a matrix issue, and it terms of computation time for a large system as this is staying within NumPy I don't think it would be slow.

How to group rows in a Numpy 2D matrix based on column values?

What would be an efficient (time, easy) way of grouping a 2D NumPy matrix rows by different column conditions (e.g. group by column 2 values) and running f1() and f2() on each of those groups?
Thanks
If you have an array arr of shape (rows, cols), you can get the vector of all values in column 2 as
col = arr[:, 2]
You can then construct a boolean array with your grouping condition, say group 1 is made up of those rows with have a value larger than 5 in column 2:
idx = col > 5
You can apply this boolean array directly to your original array to select rows:
group_1 = arr[idx]
group_2 = arr[~idx]
For example:
>>> arr = np.random.randint(10, size=(6,4))
>>> arr
array([[0, 8, 7, 4],
[5, 2, 6, 9],
[9, 5, 7, 5],
[6, 9, 1, 5],
[8, 0, 5, 8],
[8, 2, 0, 6]])
>>> idx = arr[:, 2] > 5
>>> arr[idx]
array([[0, 8, 7, 4],
[5, 2, 6, 9],
[9, 5, 7, 5]])
>>> arr[~idx]
array([[6, 9, 1, 5],
[8, 0, 5, 8],
[8, 2, 0, 6]])
A compact solution is to use numpy_indexed (disclaimer: I am its author), which implements a fully vectorized solution to this type of problem:
The simplest way to use it is as:
import numpy_indexed as npi
npi.group_by(arr[:, col1]).mean(arr)
But this also works:
# run function f1 on each group, formed by keys which are the rows of arr[:, [col1, col2]
npi.group_by(arr[:, [col1, col2]], arr, f1)
from operator import itemgetter
sorted(my_numpy_array,key=itemgetter(1))
or maybe something like
from itertools import groupby
from operator import itemgetter
print groupby(my_numpy_array,key = itemgetter(1))

Categories

Resources