points to be plotted are not in order - python

Few data points have been obtained from an expt,but they are not in order ,so the lines between plots are not correct,
I need to plot them in say, increasing order in Xaxis
C=[0.5,4,2,1,3,8,6,10]
D=[20,2,2,10,0.3,2.5,0.8,1]
%matplotlib inline
import matplotlib.pyplot as plt
#plot obtained from given data points
plt.plot(C,D)
## required plot
A=[0.5, 1, 2, 3, 4, 6, 8, 10]
B=[20, 10, 2, 0.5, 2, 0.8, 2.5, 1]
plt.plot(A,B)

Solution using pandas. I recommend using DataFrames in future for plotting tasks.
from matplotlib import pyplot as plt
import pandas as pd
C= [0.5, 4, 2, 1, 3, 8, 6, 10]
D= [20, 2, 2, 10, 0.3, 2.5, 0.8, 1]
xy = pd.DataFrame({'x': C, 'y': D})
xy.sort_values('x', inplace=True)
plt.plot(xy['x'], xy['y'])
plt.show()

Your C is not sorted and hence by default the points which are joined by a continuous line seems like a mess in your output of plot(C,D). I personally would make use of the np.argsort function to get the sorted indices of C and use them to plot C and D as follows (showing only relevant lines added):
import numpy as np
C = np.array([0.5,4,2,1,3,8,6,10])
D = np.array([20,2,2,10,0.3,2.5,0.8,1])
plt.plot(sorted(C), D[np.argsort(C)], 'b')
Output

Related

Correlation matrix heatmap range on a target value

I am trying to figure out how I can do a correlation matrix heatmap with SNS with heatmap values from a target column. I am trying to identify if combination of 2 features have an effect on the target_value.
I know I can do following but this is correlation between features but not correlation of 2 features on target_value
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(dataframe.corr());
I have following CSV
feature_1, feature_2, feature_3, feature_4, target_value
4, 8, 9, 8, 0.1
9, 7, 2, 0, 0.2
4, 4, 1, 4, 0.6
9, 7, 8, 4, 0.7
0, 9, 0, 7, 0.9
I could encode them as followed based on a threshold to define presence (1) or not present (0) of that feature.
feature_1, feature_2, feature_3, feature_4, target_value
0, 1, 1, 1, 0.1
1, 1, 0, 0, 0.2
0, 0, 0, 0, 0.6
1, 1, 1, 0, 0.7
0, 1, 0, 1, 0.9
I would like to know correlation of feature_1:4 on the target value. I would also like to know if/how I could filter the correlation features shown on axis? For this I guess I can filter the dataframe based on target_value However, I am not sure how I can show/hide features on the axis
e.g.
feature_1 and feature_2 on X axis /
feature_3 and feature_4 on Y axis
for target value >= 0.5
e.g.
feature_1 and feature_2 on X axis /
feature_3 and feature_4 on Y axis
for target value < 0.5
Pandas' corrwith() helps to find the correlation between one column and the others. As the result is a series and seaborn expects a dataframe, the series needs to be converted to one.
To find the correlation between feature_1/feature_2 and feature_3/feature_4 for a subset of the target values:
take the desired subset of the dataframe
calculate the correlation
take some rows/columns from the correlation matrix
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
df = pd.DataFrame({'feature_1': [4, 9, 4, 9, 0],
'feature_2': [8, 7, 4, 7, 9],
'feature_3': [9, 2, 1, 8, 0],
'feature_4': [8, 0, 4, 4, 7],
'target_value': [0.1, 0.2, 0.6, 0.7, 0.9]})
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, figsize=(16, 5))
corr_with_target = df.drop(columns='target_value').corrwith(df['target_value'])
sns.heatmap(pd.DataFrame(corr_with_target), ax=ax1)
ax1.tick_params(rotation=0)
ax1.set_xticks([])
ax1.set_title('Correlation with target value')
corr_for_large_target = df[df['target_value'] >= 0.5].corr().loc[['feature_1', 'feature_2'], ['feature_3', 'feature_4']]
sns.heatmap(pd.DataFrame(corr_for_large_target), ax=ax2)
ax2.set_title('Correlation for large target')
corr_for_small_target = df[df['target_value'] <= 0.5].corr().loc[['feature_1', 'feature_2'], ['feature_3', 'feature_4']]
sns.heatmap(pd.DataFrame(corr_for_small_target), ax=ax3)
ax3.set_title('Correlation for small target')
plt.tight_layout()
plt.show()

Plot a histogram where the bars are coloured based on a second list of values

I can plot a histogram in Python for example with matplotlib:
from matplotlib import pyplot as plt
x = [3,5,12,7,8,6,4,6]
plt.hist(x)
However I have a second array y = [4,6,8,2,4,5,8,7] where each value corresponds to the value at the same position of x. Now I would like to create a histogram where each bar's height is defined by x, but each bar's color is defined by the values in y that belong to its x values. You could also say I have tuples as in list(zip(x,y)) where the first value should be used for the histogram itself and the mean value of the second tuple value in each bin should determine the color.
np.unique(x, return_counts=True) returns an array with the unique values of x and their count.
Converting everything to numpy arrays, y[x == val] selects the subset of y at each position where x is equal to val. y[x == val].mean() gets the mean of those values. Calling cmap(norm(...)) gives the color corresponding to that value. The cmap and norm can be used to create a colorbar.
Here is some example code, including embellishments to change ticks, margins and spines:
import matplotlib.pyplot as plt
from matplotlib.ticker import MultipleLocator
from matplotlib.cm import ScalarMappable
import numpy as np
x = np.array([3, 5, 12, 7, 8, 6, 4, 6])
y = np.array([4, 6, 8, 2, 4, 5, 8, 7])
values, counts = np.unique(x, return_counts=True)
cmap = plt.get_cmap('inferno')
norm = plt.Normalize(0, y.max()) # or plt.Normalize(y.min(), y.max())
colors = [cmap(norm(y[x == val].mean())) for val in values]
fig, ax = plt.subplots()
ax.bar(values, counts, color=colors, edgecolor='black')
ax.yaxis.set_major_locator(MultipleLocator(1))
ax.xaxis.set_major_locator(MultipleLocator(1))
ax.set_ylabel('Count')
ax.margins(x=0.02, y=0)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.colorbar(ScalarMappable(cmap=cmap, norm=norm), pad=0.02, ax=ax)
plt.show()
Here is another example, using the tips dataset from seaborn, with the rounded total_bill on the x-axis, the count on the y-axis and colored via the tip amount.
import seaborn as sns
tips = sns.load_dataset('tips')
x = np.round(tips['total_bill'])
y = np.array(tips['tip'])
values, counts = np.unique(x, return_counts=True)
cmap = plt.get_cmap('turbo')
PS: As mentioned in #Arne's answer, seaborn can be used to replace the norm and color assignment with seaborn's hue. Without embelishments, the code would look like:
import numpy as np
import seaborn as sns
x = np.array([3, 5, 12, 7, 8, 6, 4, 6])
y = np.array([4, 6, 8, 2, 4, 5, 8, 7])
values, counts = np.unique(x, return_counts=True)
sns.set_style('darkgrid')
ax = sns.barplot(x=values, y=counts, hue=[y[x == val].mean() for val in values],
palette='inferno', dodge=False)
The seaborn library is very useful to visualize multi-dimensional data like these. You could store x and y in a pandas dataframe and then add the bin numbers and the average y values per bin:
import numpy as np
import pandas as pd
import seaborn as sns
x = [3, 5, 12, 7, 8, 6, 4, 6]
y = [4, 6, 8, 2, 4, 5, 8, 7]
n_bins = 4 # number of bins for the histogram
df = pd.DataFrame({'x': x, 'y': y})
_, bin_edges = np.histogram(x, bins=n_bins)
df['bin'] = pd.cut(x, bins=bin_edges, labels=False, include_lowest=True)
color = df.groupby('bin').mean()['y']
df['color'] = df.bin.apply(lambda k: color[k])
df
x y bin color
0 3 4 0 6.000000
1 5 6 0 6.000000
2 12 8 3 8.000000
3 7 2 1 4.666667
4 8 4 2 4.000000
5 6 5 1 4.666667
6 4 8 0 6.000000
7 6 7 1 4.666667
Then drawing the colored histogram is easy:
sns.histplot(data=df, x='x', bins=bin_edges, hue='color');

boxplot for all data in dataframe: error "'numpy.ndarray' object has no attribute 'boxplot'"

I am trying to display in a subplot all the boxplots corresponding to each columns in my dataframe df.
I have read this question:
Subplot for seaborn boxplot
and tried to implement the given solution:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
d = {'col1': [1, 2, 5.5, 100], 'col2': [3, 4, 0.2, 3], 'col3': [1, 4, 6, 30], 'col4': [2, 24, 0.2, 13], 'col5': [9, 84, 0.9, 3]}
df = pd.DataFrame(data=d)
names = list(df.columns)
f, axes = plt.subplots(round(len(names)/3), 3)
y = 0;
for name in names:
sns.boxplot(x= df[name], ax=axes[y])
y = y + 1
Unfortunately I get an error
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-111-489a538377fc> in <module>
3 y = 0;
4 for name in names:
----> 5 sns.boxplot(x= df[name], ax=axes[y])
6 y = y + 1
AttributeError: 'numpy.ndarray' object has no attribute 'boxplot'
I understand there is a problem with df[name] but I can't see how to fix it.
Would someone be able to point me in the right direction?
Thank you very much.
The problem comes from passing ax=axes[y] to boxplot. axes is a 2-d numpy array with shape (2, 3), that contains the grid of Matplotlib axes that you requested. So axes[y] is a 1-d numpy array that contains three Matplotlib AxesSubplotobjects. I suspect boxplot is attempting to dispatch to this argument, and it expects it to be an object with a boxplot method. You can fix this by indexing axes with the appropriate row and column that you want to use.
Here's your script, with a small change to do that:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
d = {'col1': [1, 2, 5.5, 100], 'col2': [3, 4, 0.2, 3], 'col3': [1, 4, 6, 30], 'col4': [2, 24, 0.2, 13], 'col5': [9, 84, 0.9, 3]}
df = pd.DataFrame(data=d)
names = list(df.columns)
f, axes = plt.subplots(round(len(names)/3), 3)
y = 0;
for name in names:
i, j = divmod(y, 3)
sns.boxplot(x=df[name], ax=axes[i, j])
y = y + 1
plt.tight_layout()
plt.show()
The plot:

Histogram Bars not Centred over xticks in pyplot.hist

I guess I just didn't use the right keywords, because this probably has been asked before, but I didn't find a solution. Anyway, I have a problem where the the bars of a histogram do not line up with the xticks. I want the bars to be centred over the xticks they correspond to, but they get placed between ticks to fill the space in-between evenly.
import matplotlib.pyplot as plt
data = [1, 1, 1, 1.5, 2, 4, 4, 4, 4, 4.5, 5, 6, 6.5, 7, 9,9, 9.5]
bins = [x+n for n in range(1, 10) for x in [0.0, 0.5]]+[10.0]
plt.hist(data, bins, rwidth = .3)
plt.xticks(bins)
plt.show()
Note that what you are plotting here is not a histogram. A histogram would be
import matplotlib.pyplot as plt
data = [1, 1, 1, 1.5, 2, 4, 4, 4, 4, 4.5, 5, 6, 6.5, 7, 9,9, 9.5]
bins = [x+n for n in range(1, 10) for x in [0.0, 0.5]]+[10.0]
plt.hist(data, bins, edgecolor="k", alpha=1)
plt.xticks(bins)
plt.show()
Here, the bars range between the bins as expected. E.g. you have 3 values in the interval 1 <= x < 1.5.
Conceptually what you want to do here is get a bar plot of the counts of data values. This would not require any bins at all and could be done as follows:
import numpy as np
import matplotlib.pyplot as plt
data = [1, 1, 1, 1.5, 2, 4, 4, 4, 4, 4.5, 5, 6, 6.5, 7, 9,9, 9.5]
u, inv = np.unique(data, return_inverse=True)
counts = np.bincount(inv)
plt.bar(u, counts, width=0.3)
plt.xticks(np.arange(1,10,0.5))
plt.show()
Of course you can "misuse" a histogram plot to get a similar result. This would require to move the center of the bar to the left bin edge, plt.hist(.., align="left").
import matplotlib.pyplot as plt
data = [1, 1, 1, 1.5, 2, 4, 4, 4, 4, 4.5, 5, 6, 6.5, 7, 9,9, 9.5]
bins = [x+n for n in range(1, 10) for x in [0.0, 0.5]]+[10.0]
plt.hist(data, bins, align="left", rwidth = .6)
plt.xticks(bins)
plt.show()
This results in the same plot as above.

matplotlib, pyplot : custom color for a specific data value

I am generating a heat map for my data.
everything works fine, but I have a little problem. My data (numbers) are from 0 to 10.000.
0 means nothing (no data) and at the moment the field with 0 just take the lowest color of my color scala. My problem is how to make the data with 0 to have a total different color (e.g. black or white)
Just see the Picture to better understand what i mean:
My code (snippet) looks like this:
matplotlib.pyplot.imshow(results, interpolation='none')
matplotlib.pyplot.colorbar();
matplotlib.pyplot.xticks([0, 1, 2, 3, 4, 5, 6, 7, 8], [10, 15, 20, 25, 30, 35, 40, 45, 50]);
matplotlib.pyplot.xlabel('Population')
matplotlib.pyplot.yticks([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10], [1, 2, 3, 4, 5, 10, 15, 20, 25, 30, 'serial']);
matplotlib.pyplot.ylabel('Communication Step');
axis.xaxis.tick_top();
matplotlib.pyplot.savefig('./results_' + optimisationProblem + '_dim' + str(numberOfDimensions) + '_' + statisticType + '.png');
matplotlib.pyplot.close();
If you are not interested in a smooth transition between the values 0 and 0.0001, you can just set every value that equals 0 to NaN. This will result in a white color whereas 0.0001 will still be deep blue-ish.
In the following code I include an example. I generate the data randomly. I therefore select a single element from my array and set it to NaN. This results in the color white. I also included a line in which you can set every data point that equals 0 to NaN.
import numpy
import matplotlib.pyplot as plt
#Random data
data = numpy.random.random((10, 10))
#Set all data points equal to zero to NaN
#data[data == 0.] = float("NaN")
#Set single data value to nan
data[2][2] = float("NaN")
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.imshow(data, interpolation = "nearest")
plt.show()

Categories

Resources