How do I make my histogram of unequal bins show properly? - python

My data consists of the following:
Majority numbers < 60, and then a few outliers that are in the 2000s.
I want to display it in a histogram with the following bin ranges:
0-1, 1-2, 2-3, 3-4, ..., 59-60, 60-max
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as axes
b = list(range(61)) + [2000] # will make [0, 1, ..., 60, 2000]
plt.hist(b, bins=b, edgecolor='black')
plt.xticks(b)
plt.show()
This shows the following:
Essentially what you see is all the numbers 0 .. 60 squished together on the left, and the 2000 on the right. This is not what I want.
So I remove the [2000] and get something like what I am looking for:
As you can see now it is better, but I still have the following problems:
How do I fix this such that the graph doesn't have any white space around (there's a big gap before 0 and after 60).
How do I fix this such that after 60, there is a 2000 tick that shows at the very end, while still keeping roughly the same spacing (not like the first?)

Here is one hacky solution using some random data. I still don't quite understand your second question but I tried to do something based on your wordings
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.axes as axes
fig, ax = plt.subplots(figsize=(12, 6))
data= np.random.normal(10, 5, 5000)
upper = 31
outlier = 2000
data = np.append(data, 100*[upper])
b = list(range(upper)) + [upper]
plt.hist(data, bins=b, edgecolor='black')
plt.xticks(b)
b[-1] = outlier
ax.set_xticklabels(b)
plt.xlim(0, upper)
plt.show()

Related

How can you make a python histogram percentage sum to 100%?

I am struggling to make a histogram plot where the total percentage of events sums to 100%. Instead, for this particular example, it sums to approximately 3%. Will anyone be able to show me how I make the percentages of my events sum to 100% for any array used?
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
plt.gca().yaxis.set_major_formatter(PercentFormatter(1))
data = np.array([0,9,78,6,44,23,88,77,12,29])
length_of_data = len(data) # Length of data
bins = int(np.sqrt(length_of_data)) # Choose number of bins
y = data
plt.title('Histogram')
plt.ylabel('Percentage Of Events')
plt.xlabel('bins')
plt.hist(y,bins=bins, density = True)
plt.show()
print(bins)
One way of doing it is to get the bin heights that plt.hist returns, then re-set the patch heights to the normalized height you want. It's not that involved if you know what to do, but not that ideal. Here's your case:
import matplotlib.pyplot as plt
from matplotlib.ticker import PercentFormatter
import numpy as np
plt.gca().yaxis.set_major_formatter(PercentFormatter(100)) # <-- changed here
data = np.array([0,9,78,6,44,23,88,77,12,29])
length_of_data = len(data) # Length of data
bins = int(np.sqrt(length_of_data)) # Choose number of bins
y = data
plt.title('Histogram')
plt.ylabel('Percentage Of Events')
plt.xlabel('bins')
#### Setting new heights
n, bins, patches = plt.hist(y, bins=bins, density = True, edgecolor='k')
scaled_n = n / n.sum() * 100
for new_height, patch in zip(scaled_n, patches):
patch.set_height(new_height)
####
# Setting cumulative sum as verification
plt.plot((bins[1:] + bins[:-1])/2, scaled_n.cumsum())
# If you want the cumsum to start from 0, uncomment the line below
#plt.plot(np.concatenate([[0], (bins[1:] + bins[:-1])/2]), np.concatenate([[0], scaled_n.cumsum()]))
plt.ylim(top=110)
plt.show()
This is the resulting picture:
As others said, you can use seaborn. Here's how to reproduce my code above. You'd still need to add all the labels and styling you want.
import seaborn as sns
sns.histplot(data, bins=int(np.sqrt(length_of_data)), stat='percent')
sns.histplot(data, bins=int(np.sqrt(length_of_data)), stat='percent', cumulative=True, element='poly', fill=False, color='C1')
This is the resulting picture:

Plot point on time series line graph

I have this dataframe and I want to line plot it. As I have plotted it.
Graph is
Code to generate is
fig, ax = plt.subplots(figsize=(15, 5))
date_time = pd.to_datetime(df.Date)
df = df.set_index(date_time)
plt.xticks(rotation=90)
pd.DataFrame(df, columns=df.columns).plot.line( ax=ax,
xticks=pd.to_datetime(frame.Date))
I want a marker of innovationScore with value(where innovationScore is not 0) on open, close line. I want to show that that is the change when InnovationScore changes.
You have to address two problems - marking the corresponding spots on your curves and using the dates on the x-axis. The first problem can be solved by identifying the dates, where the score is not zero, then plotting markers on top of the curve at these dates. The second problem is more of a structural nature - pandas often interferes with matplotlib when it comes to date time objects. Using pandas standard plotting functions is good because it addresses common problems. But mixing pandas with matplotlib plotting (and to set the markers, you have to revert to matplotlib afaik) is usually a bad idea because they do not necessarily present the date time in the same format.
import pandas as pd
from matplotlib import pyplot as plt
#fake data generation, the following code block is just for illustration
import numpy as np
np.random.seed(1234)
n = 50
date_range = pd.date_range("20180101", periods=n, freq="D")
choice = np.zeros(10)
choice[0] = 3
df = pd.DataFrame({"Date": date_range,
"Open": np.random.randint(100, 150, n),
"Close": np.random.randint(100, 150, n),
"Innovation Score": np.random.choice(choice, n)})
fig, ax = plt.subplots()
#plot the three curves
l = ax.plot(df["Date"], df[["Open", "Close", "Innovation Score"]])
ax.legend(iter(l), ["Open", "Close", "Innovation Score"])
#filter dataset for score not zero
IS = df[df["Innovation Score"] > 0]
#plot markers on these positions
ax.plot(IS["Date"], IS[["Open", "Close"]], "ro")
#and/or set vertical lines to indicate the position
ax.vlines(IS["Date"], 0, max(df[["Open", "Close"]].max()), ls="--")
#label x-axis score not zero
ax.set_xticks(IS["Date"])
#beautify the output
ax.set_xlabel("Month")
ax.set_ylabel("Artifical score people take seriously")
fig.autofmt_xdate()
plt.show()
Sample output:
You can achieve it like this:
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], "ro-") # r is red, o is for larger marker, - is for line
plt.plot([3, 2, 1], "b.-") # b is blue, . is for small marker, - is for line
plt.show()
Check out also example here for another approach:
https://matplotlib.org/3.3.3/gallery/lines_bars_and_markers/markevery_prop_cycle.html
I very often get inspiration from this list of examples:
https://matplotlib.org/3.3.3/gallery/index.html

Plotting colored lines connecting individual data points of two swarmplots

I have:
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
# Generate random data
set1 = np.random.randint(0, 40, 24)
set2 = np.random.randint(0, 100, 24)
# Put into dataframe and plot
df = pd.DataFrame({'set1': set1, 'set2': set2})
data = pd.melt(df)
sb.swarmplot(data=data, x='variable', y='value')
The two random distributions plotted with seaborn's swarmplot function:
I want the individual plots of both distributions to be connected with a colored line such that the first data point of set 1 in the dataframe is connected with the first data point of set 2.
I realize that this would probably be relatively simple without seaborn but I want to keep the feature that the individual data points do not overlap.
Is there any way to access the individual plot coordinates in the seaborn swarmfunction?
EDIT: Thanks to #Mead, who pointed out a bug in my post prior to 2021-08-23 (I forgot to sort the locations in the prior version).
I gave the nice answer by Paul Brodersen a try, and despite him saying that
Madness lies this way
... I actually think it's pretty straight forward and yields nice results:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
# Generate random data
rng = np.random.default_rng(42)
set1 = rng.integers(0, 40, 5)
set2 = rng.integers(0, 100, 5)
# Put into dataframe
df = pd.DataFrame({"set1": set1, "set2": set2})
print(df)
data = pd.melt(df)
# Plot
fig, ax = plt.subplots()
sns.swarmplot(data=data, x="variable", y="value", ax=ax)
# Now connect the dots
# Find idx0 and idx1 by inspecting the elements return from ax.get_children()
# ... or find a way to automate it
idx0 = 0
idx1 = 1
locs1 = ax.get_children()[idx0].get_offsets()
locs2 = ax.get_children()[idx1].get_offsets()
# before plotting, we need to sort so that the data points
# correspond to each other as they did in "set1" and "set2"
sort_idxs1 = np.argsort(set1)
sort_idxs2 = np.argsort(set2)
# revert "ascending sort" through sort_idxs2.argsort(),
# and then sort into order corresponding with set1
locs2_sorted = locs2[sort_idxs2.argsort()][sort_idxs1]
for i in range(locs1.shape[0]):
x = [locs1[i, 0], locs2_sorted[i, 0]]
y = [locs1[i, 1], locs2_sorted[i, 1]]
ax.plot(x, y, color="black", alpha=0.1)
It prints:
set1 set2
0 3 85
1 30 8
2 26 69
3 17 20
4 17 9
And you can see that the data is linked correspondingly in the plot.
Sure, it's possible (but you really don't want to).
seaborn.swarmplot returns the axis instance (here: ax). You can grab the children ax.get_children to get all plot elements. You will see that for each set of points there is an element of type PathCollection. You can determine the x, y coordinates by using the PathCollection.get_offsets() method.
I do not suggest you do this! Madness lies this way.
I suggest you have a look at the source code (found here), and derive your own _PairedSwarmPlotter from _SwarmPlotter and change the draw_swarmplot method to your needs.

Heat map for a very large matrix, including NaNs

I am trying to see if NaNs are concentrated somewhere, or if there is any pattern for their distribution.
The idea is to use python to plot a heatMap of the matrix (which is 200K rows and 1k columns) and set a special color for NaN values (the rest of the values can be represented by the same color, this doesn't matter)
An example of a possible display:
Thank you all in advance
A 1:200 aspect ratio is pretty bad and, since you could run into memory issues, you should probably break it up into several Nx1k pieces.
That being said, here is my solution (inspired by your example image):
from mpl_toolkits.axes_grid1 import AxesGrid
# generate random matrix
xDim = 2000
yDim = 4000
# number of nans
nNans = xDim*yDim*.1
rands = np.random.rand(yDim, xDim)
# create a skewed distribution for the nans
x = np.clip(np.random.gamma(2, yDim*.125, size=nNans).astype(np.int),0 ,yDim-1)
y = np.random.randint(0,xDim,size=nNans)
rands[x,y] = np.nan
# find the nans:
isNan = np.isnan(rands)
fig = plt.figure()
# make axesgrid so we can put a histogram-like plot next to the data
grid = AxesGrid(fig, 111, nrows_ncols=(1, 2), axes_pad=0.05)
# plot the data using binary colormap
grid[0].imshow(isNan, cmap=cm.binary)
# plot the histogram
grid[1].plot(np.sum(isNan,axis=1), range(isNan.shape[0]))
# set ticks and limits, so the figure looks nice
grid[0].set_xticks([0,250,500,750,1000,1250,1500,1750])
grid[1].set_xticks([0,250,500,750])
grid[1].set_xlim([0,750])
grid.axes_llc.set_ylim([0, yDim])
plt.show()
Here is what it looks like:
# Learn about API authentication here: https://plot.ly/python/getting-started
# Find your api_key here: https://plot.ly/settings/api
import plotly.plotly as py
import plotly.graph_objs as go
data = [
go.Heatmap(
z=[[1, 20, 30],
[20, 1, 60],
[30, 60, 1]]
)
]
plot_url = py.plot(data, filename='basic-heatm
soruce: https://plot.ly/python/heatmaps/
What you could do is use a scatter plot:
import matplotlib.pyplot as plt
import numpy as np
# create a matrix with random numbers
A = np.random.rand(2000,10)
# make some NaNs in it:
for _ in range(1000):
i = np.random.randint(0,2000)
j = np.random.randint(0,10)
A[i,j] = np.nan
# get a matrix to plot with only the NaNs:
B = np.isnan(A)
# if NaN plot a point.
for i in range(2000):
for j in range(10):
if B[i,j]: plt.scatter(i,j)
plt.show()
when using python 2.6 or 2.7 consider using xrange instead of range for speedup.
Note. it could be faster to do:
C = np.where(B)
plt.scatter(C[0],C[1])

Changing the length of axis lines in matplotlib

I am trying to change the displayed length of the axis of matplotlib plot. This is my current code:
import matplotlib.pyplot as plt
import numpy as np
linewidth = 2
outward = 10
ticklength = 4
tickwidth = 1
fig, ax = plt.subplots()
ax.plot(np.arange(100))
ax.tick_params(right="off",top="off",length = ticklength, width = tickwidth, direction = "out")
ax.spines["top"].set_visible(False), ax.spines["right"].set_visible(False)
for line in ["left","bottom"]:
ax.spines[line].set_linewidth(linewidth)
ax.spines[line].set_position(("outward",outward))
Which generates the following plot:
I would like my plot to look like the following with axis line shortened:
I wasn't able to find this in ax[axis].spines method. I also wasn't able to plot this nicely using ax.axhline method.
You could add these lines to the end of your code:
ax.spines['left'].set_bounds(20, 80)
ax.spines['bottom'].set_bounds(20, 80)
for i in [0, -1]:
ax.get_yticklabels()[i].set_visible(False)
ax.get_xticklabels()[i].set_visible(False)
for i in [0, -2]:
ax.get_yticklines()[i].set_visible(False)
ax.get_xticklines()[i].set_visible(False)
To get this:

Categories

Resources