Correlation matrix heatmap range on a target value

Correlation matrix heatmap range on a target value - python

I am trying to figure out how I can do a correlation matrix heatmap with SNS with heatmap values from a target column. I am trying to identify if combination of 2 features have an effect on the target_value.
I know I can do following but this is correlation between features but not correlation of 2 features on target_value
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(dataframe.corr());
I have following CSV
feature_1, feature_2, feature_3, feature_4, target_value
4, 8, 9, 8, 0.1
9, 7, 2, 0, 0.2
4, 4, 1, 4, 0.6
9, 7, 8, 4, 0.7
0, 9, 0, 7, 0.9
I could encode them as followed based on a threshold to define presence (1) or not present (0) of that feature.
feature_1, feature_2, feature_3, feature_4, target_value
0, 1, 1, 1, 0.1
1, 1, 0, 0, 0.2
0, 0, 0, 0, 0.6
1, 1, 1, 0, 0.7
0, 1, 0, 1, 0.9
I would like to know correlation of feature_1:4 on the target value. I would also like to know if/how I could filter the correlation features shown on axis? For this I guess I can filter the dataframe based on target_value However, I am not sure how I can show/hide features on the axis
e.g.
feature_1 and feature_2 on X axis /
feature_3 and feature_4 on Y axis
for target value >= 0.5
e.g.
feature_1 and feature_2 on X axis /
feature_3 and feature_4 on Y axis
for target value < 0.5

Pandas' corrwith() helps to find the correlation between one column and the others. As the result is a series and seaborn expects a dataframe, the series needs to be converted to one.
To find the correlation between feature_1/feature_2 and feature_3/feature_4 for a subset of the target values:
take the desired subset of the dataframe
calculate the correlation
take some rows/columns from the correlation matrix
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'feature_1': [4, 9, 4, 9, 0],
'feature_2': [8, 7, 4, 7, 9],
'feature_3': [9, 2, 1, 8, 0],
'feature_4': [8, 0, 4, 4, 7],
'target_value': [0.1, 0.2, 0.6, 0.7, 0.9]})
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(16, 5))
corr_with_target = df.drop(columns='target_value').corrwith(df['target_value'])
sns.heatmap(pd.DataFrame(corr_with_target), ax=ax1)
ax1.tick_params(rotation=0)
ax1.set_xticks([])
ax1.set_title('Correlation with target value')
corr_for_large_target = df[df['target_value'] >= 0.5].corr().loc[['feature_1', 'feature_2'], ['feature_3', 'feature_4']]
sns.heatmap(pd.DataFrame(corr_for_large_target), ax=ax2)
ax2.set_title('Correlation for large target')
corr_for_small_target = df[df['target_value'] <= 0.5].corr().loc[['feature_1', 'feature_2'], ['feature_3', 'feature_4']]
sns.heatmap(pd.DataFrame(corr_for_small_target), ax=ax3)
ax3.set_title('Correlation for small target')
plt.tight_layout()
plt.show()

Related

Plotting (x,y) point to point connections with python

I am trying to plot a point to point line plot in python.
My data is in a pandas dataframe as below..
df = pd.DataFrame({
'x_coordinate': [0, 0, 0, 0, 1, 1,-1,-1,-2,0],
'y_coordinate': [0, 2, 1, 3, 3, 1,1,-2,2,-1],
})
print(df)
x_coordinate y_coordinate
0 0 0
1 0 2
2 0 1
3 0 3
4 1 3
5 1 1
6 -1 1
7 -1 -2
8 -2 2
9 0 -1
when I plot this, it is joining from point to point as in the order in the df.
df.plot('x_coordinate','y_coordinate')
But, is there a way, I can plot an order number next to it ? I mean the order it is travelling. Say 1 for the first connection from (0,0) to (0,2) and 2 from (0,2) to (0,1) and so on ?

The plot is OK. If you want to check how each vertex is plotted, you need modified data. Here is the modified data (x only) and the plot.
df = pd.DataFrame({
'x_coordinate': [0.1, 0.2, 0.3, 0.4, 1.5, 1.6,-1.7,-1.8,-2.9,0.1],
'y_coordinate': [0, 2, 1, 3, 3, 1,1,-2,2,-1],
})
Edit
For your new request, the code is modified as follows (full runnable code).
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
df = pd.DataFrame({
'x_coordinate': [0.1, 0.2, 0.3, 0.4, 1.5, 1.6,-1.7,-1.8,-2.9,0.1],
'y_coordinate': [0, 2, 1, 3, 3, 1,1,-2,2,-1],
})
fig = plt.figure(figsize=(6,5))
ax1 = fig.add_subplot(1, 1, 1)
df.plot('x_coordinate','y_coordinate', legend=False, ax=ax1)
for ea in zip(np.array((range(len(df)))), df.x_coordinate.values, df.y_coordinate.values):
text, x, y = "P"+str(ea[0]), ea[1], ea[2]
ax1.annotate(text, (x,y))

I found an easier way to do it.. Thought to share..
fig, ax = plt.subplots()
df.plot('x_coordinate','y_coordinate',ax=ax)
for k, v in df[['x_coordinate','y_coordinate']].iterrows():
ax.annotate('p'+str(k+1), v)
plt.show()

points to be plotted are not in order

Few data points have been obtained from an expt,but they are not in order ,so the lines between plots are not correct,
I need to plot them in say, increasing order in Xaxis
C=[0.5,4,2,1,3,8,6,10]
D=[20,2,2,10,0.3,2.5,0.8,1]
%matplotlib inline
import matplotlib.pyplot as plt
#plot obtained from given data points
plt.plot(C,D)
## required plot
A=[0.5, 1, 2, 3, 4, 6, 8, 10]
B=[20, 10, 2, 0.5, 2, 0.8, 2.5, 1]
plt.plot(A,B)

Solution using pandas. I recommend using DataFrames in future for plotting tasks.
from matplotlib import pyplot as plt
import pandas as pd
C= [0.5, 4, 2, 1, 3, 8, 6, 10]
D= [20, 2, 2, 10, 0.3, 2.5, 0.8, 1]
xy = pd.DataFrame({'x': C, 'y': D})
xy.sort_values('x', inplace=True)
plt.plot(xy['x'], xy['y'])
plt.show()

Your C is not sorted and hence by default the points which are joined by a continuous line seems like a mess in your output of plot(C,D). I personally would make use of the np.argsort function to get the sorted indices of C and use them to plot C and D as follows (showing only relevant lines added):
import numpy as np
C = np.array([0.5,4,2,1,3,8,6,10])
D = np.array([20,2,2,10,0.3,2.5,0.8,1])
plt.plot(sorted(C), D[np.argsort(C)], 'b')
Output

Histogram Bars not Centred over xticks in pyplot.hist

I guess I just didn't use the right keywords, because this probably has been asked before, but I didn't find a solution. Anyway, I have a problem where the the bars of a histogram do not line up with the xticks. I want the bars to be centred over the xticks they correspond to, but they get placed between ticks to fill the space in-between evenly.
import matplotlib.pyplot as plt
data = [1, 1, 1, 1.5, 2, 4, 4, 4, 4, 4.5, 5, 6, 6.5, 7, 9,9, 9.5]
bins = [x+n for n in range(1, 10) for x in [0.0, 0.5]]+[10.0]
plt.hist(data, bins, rwidth = .3)
plt.xticks(bins)
plt.show()

Note that what you are plotting here is not a histogram. A histogram would be
import matplotlib.pyplot as plt
data = [1, 1, 1, 1.5, 2, 4, 4, 4, 4, 4.5, 5, 6, 6.5, 7, 9,9, 9.5]
bins = [x+n for n in range(1, 10) for x in [0.0, 0.5]]+[10.0]
plt.hist(data, bins, edgecolor="k", alpha=1)
plt.xticks(bins)
plt.show()
Here, the bars range between the bins as expected. E.g. you have 3 values in the interval 1 <= x < 1.5.
Conceptually what you want to do here is get a bar plot of the counts of data values. This would not require any bins at all and could be done as follows:
import numpy as np
import matplotlib.pyplot as plt
data = [1, 1, 1, 1.5, 2, 4, 4, 4, 4, 4.5, 5, 6, 6.5, 7, 9,9, 9.5]
u, inv = np.unique(data, return_inverse=True)
counts = np.bincount(inv)
plt.bar(u, counts, width=0.3)
plt.xticks(np.arange(1,10,0.5))
plt.show()
Of course you can "misuse" a histogram plot to get a similar result. This would require to move the center of the bar to the left bin edge, plt.hist(.., align="left").
import matplotlib.pyplot as plt
data = [1, 1, 1, 1.5, 2, 4, 4, 4, 4, 4.5, 5, 6, 6.5, 7, 9,9, 9.5]
bins = [x+n for n in range(1, 10) for x in [0.0, 0.5]]+[10.0]
plt.hist(data, bins, align="left", rwidth = .6)
plt.xticks(bins)
plt.show()
This results in the same plot as above.

Calculating percentile of bins from numpy digitize?

I have a set of data, and a set of thresholds for creating bins:
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
thresholds = np.array([0,5,10])
bins = np.digitize(data, thresholds, right=True)
For each of the elements in bins, I want to know the base percentile. For example, in bins, the smallest bin should start at the 0th percentile. Then the next bin, for example, the 20th percentile. So that if a value in data falls between the 0th and 20th percentile of data, it belongs in the first bin.
I've looked into pandas rank(pct=True) but can't seem to get this done correctly.
Suggestions?

You can calculate the percentile for each element in your data array as described in a previous StackOverflow question (Map each list value to its corresponding percentile).
import numpy as np
from scipy import stats
data = np.array([0.01, 0.02, 1, 1, 1, 2, 2, 8, 8, 4.5, 6.6])
Method 1: Using scipy.stats.percentileofscore :
data_percentile = np.array([stats.percentileofscore(data, a) for a in data])
data_percentile
Out[1]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Method 2: Using scipy.stats.rankdata and normalising to 100 (faster) :
ranked = stats.rankdata(data)
data_percentile = ranked/len(data)*100
data_percentile
Out[2]:
array([ 9.09090909, 18.18181818, 36.36363636, 36.36363636,
36.36363636, 59.09090909, 59.09090909, 95.45454545,
95.45454545, 72.72727273, 81.81818182])
Now that you have a list of percentiles, you can bin them as before using numpy.digitize :
bins_percentile = [0,20,40,60,80,100]
data_binned_indices = np.digitize(data_percentile, bins_percentile, right=True)
data_binned_indices
Out[3]:
array([1, 1, 2, 2, 2, 3, 3, 5, 5, 4, 5], dtype=int64)
This gives you the data binned according to the indices of your chosen list of percentiles. If desired, you could also return the actual (upper) percentiles using numpy.take :
data_binned_percentiles = np.take(bins_percentile, data_binned_indices)
data_binned_percentiles
Out[4]:
array([ 20, 20, 40, 40, 40, 60, 60, 100, 100, 80, 100])

matplotlib, pyplot : custom color for a specific data value

I am generating a heat map for my data.
everything works fine, but I have a little problem. My data (numbers) are from 0 to 10.000.
0 means nothing (no data) and at the moment the field with 0 just take the lowest color of my color scala. My problem is how to make the data with 0 to have a total different color (e.g. black or white)
Just see the Picture to better understand what i mean:
My code (snippet) looks like this:
matplotlib.pyplot.imshow(results, interpolation='none')
matplotlib.pyplot.colorbar();
matplotlib.pyplot.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8], [10, 15, 20, 25, 30, 35, 40, 45, 50]);
matplotlib.pyplot.xlabel('Population')
matplotlib.pyplot.yticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 'serial']);
matplotlib.pyplot.ylabel('Communication Step');
axis.xaxis.tick_top();
matplotlib.pyplot.savefig('./results_' + optimisationProblem + '_dim' + str(numberOfDimensions) + '_' + statisticType + '.png');
matplotlib.pyplot.close();

If you are not interested in a smooth transition between the values 0 and 0.0001, you can just set every value that equals 0 to NaN. This will result in a white color whereas 0.0001 will still be deep blue-ish.
In the following code I include an example. I generate the data randomly. I therefore select a single element from my array and set it to NaN. This results in the color white. I also included a line in which you can set every data point that equals 0 to NaN.
import numpy
import matplotlib.pyplot as plt
#Random data
data = numpy.random.random((10, 10))
#Set all data points equal to zero to NaN
#data[data == 0.] = float("NaN")
#Set single data value to nan
data[2][2] = float("NaN")
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.imshow(data, interpolation = "nearest")
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Correlation matrix heatmap range on a target value - python

Related

Plotting (x,y) point to point connections with python

points to be plotted are not in order

Histogram Bars not Centred over xticks in pyplot.hist

Calculating percentile of bins from numpy digitize?

matplotlib, pyplot : custom color for a specific data value

Categories

Resources