Plotting a cumulative histogram with exported data in Python

Plotting a cumulative histogram with exported data in Python - python

I am trying to plot a cumulative histogram similar to the one shown below. It shows the number of occurrences (y-axis) of the French pronoun “vous” in a text corpus (x-axis) represented from word 0 to 92,633. It’s been created using a corpus analysis application named TXM. TXM’s plots, however, are not adapted to the specific requirements of my publisher. I would like to produce my own plots exporting the data to python. The problem is that the data exported by TXM is a bit puzzling, and I am wondering how I it can be used to make plots:
it’s a one-column txt file with integers.
Each one of them indicates the position of “vous” in the text corpus. Word 2620 is one “vous,”
3376, another one, etc. One of my attempts with Matplotlib :
from matplotlib import pyplot as plt
pos = [2620,3367,3756,4522,4546,9914,9972,9979,9987,10013,10047,10087,10114,13635,13645,13646,13758,13771,13783,13796,23410,23420,28179,28265,28274,28297,28344,34579,34590,34612,40280,40449,40570,40932,40938,40969,40983,41006,41040,41069,41096,41120,41214,41474,41478,42524,42533,42534,45569,45587,45598,56450,57574,57587]
plt.bar(pos, 1)
plt.show()
But this doesn't come close.
What steps should I follow to complete the plot?
Desired plot:

With matplotlib, you could create the step plot as follows. where='post' means the value changes at every x-position and stays so until the next x-position.
The x-values are the positions in the text, a zero is prepended to let the graph start with zero occurrences. The text-length is appended at the end. The y-values are the numbers 0, 1, 2, ..., where the last value is repeated to draw the last step in full.
from matplotlib import pyplot as plt
from matplotlib.ticker import MultipleLocator, StrMethodFormatter
import numpy as np
pos = [2620,3367,3756,4522,4546,9914,9972,9979,9987,10013,10047,10087,10114,13635,13645,13646,13758,13771,13783,13796,23410,23420,28179,28265,28274,28297,28344,34579,34590,34612,40280,40449,40570,40932,40938,40969,40983,41006,41040,41069,41096,41120,41214,41474,41478,42524,42533,42534,45569,45587,45598,56450,57574,57587]
text_len = 92633
cum = np.arange(0, len(pos) + 1)
fig, ax = plt.subplots(figsize=(12, 3))
ax.step([0] + pos + [text_len], np.pad(cum, (0, 1), 'edge'), where='post', label=f'vous {len(pos)}')
ax.xaxis.set_major_locator(MultipleLocator(5000)) # x-ticks every 5000
ax.xaxis.set_major_formatter(StrMethodFormatter('{x:,.0f}')) # use the thousands separator
ax.yaxis.set_major_locator(MultipleLocator(5)) # have a y-tick every 5
ax.grid(b=True, ls=':') # show a grid with dotted lines
ax.autoscale(enable=True, axis='x', tight=True) # disable padding x-direction
ax.set_xlabel(f'T={text_len:,d}')
ax.set_ylabel('Occurrences')
ax.set_title("Progression of 'vous' in TCN")
plt.legend() # add a legend (uses the label of ax.step)
plt.tight_layout()
plt.show()

Related

Set log xticks in matplotlib for a linear plot

Consider
xdata=np.random.normal(5e5,2e5,int(1e4))
plt.hist(np.log10(xdata), bins=100)
plt.show()
plt.semilogy(xdata)
plt.show()
is there any way to display xticks of the first plot (plt.hist) as in the second plot's yticks? For good reasons I want to histogram the np.log10(xdata) of xdata but I'd like to set minor ticks to display as usual in a log scale (even considering that the exponent is linear...)
In other words, I want the x_axis of this plot:
to be like the y_axis
of the 2nd plot, without changing the spacing between major ticks (e.g., adding log marks between 5.5 and 6.0, without altering these values)

Proper histogram plot with logarithmic x-axis:
Explanation:
Cut off negative values
The randomly generated example data likely contains still some negative values
activate the commented code lines at the beginning to see the effect
logarithmic function isn't defined for values <= 0
while the 2nd plot just deals with y-axis log scaling (negative values are just out of range), the 1st plot doesn't work with negative values in the BINs range
probably real world working data won't be <= 0, otherwise keep that in mind
BINs should be aligned to log scale as well
otherwise the 'BINs widths' distribution looks off
switch # on the plt.hist( statements in the 1st plot section to see the effect)
xdata (not np.log10(xdata)) to be plotted in the histogram
that 'workaround' with plotting np.log10(xdata) probably was the root cause for the misunderstanding in the comments
Code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42) # just to have repeatable results for the answer
xdata=np.random.normal(5e5,2e5,int(1e4))
# MIN_xdata, MAX_xdata = np.min(xdata), np.max(xdata)
# print(f"{MIN_xdata}, {MAX_xdata}") # note the negative values
# cut off potential negative values (log function isn't defined for <= 0 )
xdata = np.ma.masked_less_equal(xdata, 0)
MIN_xdata, MAX_xdata = np.min(xdata), np.max(xdata)
# print(f"{MIN_xdata}, {MAX_xdata}")
# align the bins to fit a log scale
bins = 100
bins_log_aligned = np.logspace(np.log10(MIN_xdata), np.log10(MAX_xdata), bins)
# 1st plot
plt.hist(xdata, bins = bins_log_aligned) # note: xdata (not np.log10(xdata) )
# plt.hist(xdata, bins = 100)
plt.xscale('log')
plt.show()
# 2nd plot
plt.semilogy(xdata)
plt.show()

Just kept for now for clarification purpose. Will be deleted when the question is revised.
Disclaimer:
As Lucas M. Uriarte already mentioned that isn't an expected way of changing axis ticks.
x axis ticks and labels don't represent the plotted data
You should at least always provide that information along with such a plot.
The plot
From seeing the result I kinda understand where that special plot idea is coming from - still there should be a preferred way (e.g. conversion of the data in advance) to do such a plot instead of 'faking' the axis.
Explanation how that special axis transfer plot is done:
original x-axis is hidden
a twiny axis is added
note that its y-axis is hidden by default, so that doesn't need handling
twiny x-axis is set to log and the 2nd plot y-axis limits are transferred
subplots used to directly transfer the 2nd plot y-axis limits
use variables if you need to stick with your two plots
twiny x-axis is moved from top (twiny default position) to bottom (where the original x-axis was)
Code:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(42) # just to have repeatable results for the answer
xdata=np.random.normal(5e5,2e5,int(1e4))
plt.figure()
fig, axs = plt.subplots(2, figsize=(7,10), facecolor=(1, 1, 1))
# 1st plot
axs[0].hist(np.log10(xdata), bins=100) # plot the data on the normal x axis
axs[0].axes.xaxis.set_visible(False) # hide the normal x axis
# 2nd plot
axs[1].semilogy(xdata)
# 1st plot - twin axis
axs0_y_twin = axs[0].twiny() # set a twiny axis, note twiny y axis is hidden by default
axs0_y_twin.set(xscale="log")
# transfer the limits from the 2nd plot y axis to the twin axis
axs0_y_twin.set_xlim(axs[1].get_ylim()[0],
axs[1].get_ylim()[1])
# move the twin x axis from top to bottom
axs0_y_twin.tick_params(axis="x", which="both", bottom=True, top=False,
labelbottom=True, labeltop=False)
# Disclaimer
disclaimer_text = "Disclaimer: x axis ticks and labels don't represent the plotted data"
axs[0].text(0.5,-0.09, disclaimer_text, size=12, ha="center", color="red",
transform=axs[0].transAxes)
plt.tight_layout()
plt.subplots_adjust(hspace=0.2)
plt.show()

How to plot for frequency only?

Question
How can I plot the following scenario, just like shown in the attached image? This is for the purpose of visualising frequency allocation in a network
Scenario
I have a range of frequency values in a list-tuple like so, where the 1st value is the centre frequency, 2nd is total width, 3rd is guard band:
frequencies = [('195.71250000', '59.00000000', '2.50000000'), ('195.78750000', '59.00000000', '2.50000000'), ('195.86250000', '59.00000000', '2.50000000')]
and the range of these values are:
range = [('191.32500000', '196.12500000')]
Note: These are dummy values, the actual data is much larger but follows the same general structure

There are several ways to create this plot. One way is to use ax.vlines to plot the dashed lines for the frequencies and to use ax.bar for the rectangles representing the frequency ranges.
Here is an example where the frequencies are occupied at regular intervals within the range you have given (boundaries included) but with widths of randomly varying size. No guards are computed seeing as they should be automatically apparent thanks to the position of the frequencies and the widths, as far as I understand.
Also, the widths are much smaller compared to the sample data you have provided, else the bars will be very wide and will all overlap with one another, which would look very different from the image you have shared.
import numpy as np # v 1.19.2
import matplotlib.pyplot as plt # v 3.3.2
# Create sample dataset
rng = np.random.default_rng(seed=1) # random number generator
frequencies = np.arange(191.325, 196.125, step=0.3)
widths = rng.uniform(0.05, 0.25, size=frequencies.size)
# Create figure with single Axes and loop through frequencies and widths to plot
# vertical dashed lines for the frequencies and bars for the widths
fig, ax = plt.subplots(figsize=(10,3))
for freq, width in zip(frequencies, widths):
ax.vlines(x=freq, ymin=0, ymax=10, colors='tab:blue', linestyle='--', zorder=1)
ax.bar(x=freq, height=6, width=width, color='tab:blue', zorder=2)
# Additional formatting
ax.set_xlabel('Frequency (THZ)', labelpad=15, size=12)
ax.set_xticks(frequencies[::2])
ax.yaxis.set_visible(False)
for spine in ['top', 'left', 'right']:
ax.spines[spine].set_visible(False)
plt.show()

Change colour scheme label to log scale without changing the axis in matplotlib

I am quite new to python programming. I have a script with me that plots out a heat map using matplotlib. Range of X-axis value = (-180 to +180) and Y-axis value =(0 to 180). The 2D heatmap colours areas in Rainbow according to the number of points occuring in a specified area in the x-y graph (defined by the 'bin' (see below)).
In this case, x = values_Rot and y = values_Tilt (see below for code).
As of now, this script colours the 2D-heatmap in the linear scale. How do I change this script such that it colours the heatmap in the log scale? Please note that I only want to change the heatmap colouring scheme to log-scale, i.e. only the number of points in a specified area. The x and y-axis stay the same in linear scale (not in logscale).
A portion of the code is here.
rot_number = get_header_number(headers, AngleRot)
tilt_number = get_header_number(headers, AngleTilt)
psi_number = get_header_number(headers, AnglePsi)
values_Rot = []
values_Tilt = []
values_Psi = []
for line in data:
try:
values_Rot.append(float(line.split()[rot_number]))
values_Tilt.append(float(line.split()[tilt_number]))
values_Psi.append(float(line.split()[psi_number]))
except:
print ('This line didnt work, it may just be a blank space. The line is:' + line)
# Change the values here if you want to plot something else, such as psi.
# You can also change how the data is binned here.
plt.hist2d(values_Rot, values_Tilt, bins=25,)
plt.colorbar()
plt.show()
plt.savefig('name_of_output.png')

You can use a LogNorm for the colors, using plt.hist2d(...., norm=LogNorm()). Here is a comparison.
To have the ticks in base 2, the developers suggest adding the base to the LogLocator and the LogFormatter. As in this case the LogFormatter seems to write the numbers with one decimal (.0), a StrMethodFormatter can be used to show the number without decimals. Depending on the range of numbers, sometimes the minor ticks (shorter marker lines) also get a string, which can be suppressed assigning a NullFormatter for the minor colorbar ticks.
Note that base 2 and base 10 define exactly the same color transformation. The position and the labels of the ticks are different. The example below creates two colorbars to demonstrate the different look.
import matplotlib.pyplot as plt
from matplotlib.ticker import NullFormatter, StrMethodFormatter, LogLocator
from matplotlib.colors import LogNorm
import numpy as np
from copy import copy
# create some toy data for a standalone example
values_Rot = np.random.randn(100, 10).cumsum(axis=1).ravel()
values_Tilt = np.random.randn(100, 10).cumsum(axis=1).ravel()
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15, 4))
cmap = copy(plt.get_cmap('hot'))
cmap.set_bad(cmap(0))
_, _, _, img1 = ax1.hist2d(values_Rot, values_Tilt, bins=40, cmap='hot')
ax1.set_title('Linear norm for the colors')
fig.colorbar(img1, ax=ax1)
_, _, _, img2 = ax2.hist2d(values_Rot, values_Tilt, bins=40, cmap=cmap, norm=LogNorm())
ax2.set_title('Logarithmic norm for the colors')
fig.colorbar(img2, ax=ax2) # default log 10 colorbar
cbar2 = fig.colorbar(img2, ax=ax2) # log 2 colorbar
cbar2.ax.yaxis.set_major_locator(LogLocator(base=2))
cbar2.ax.yaxis.set_major_formatter(StrMethodFormatter('{x:.0f}'))
cbar2.ax.yaxis.set_minor_formatter(NullFormatter())
plt.show()
Note that log(0) is minus infinity. Therefore, the zero values in the left plot (darkest color) are left empty (white background) on the plot with the logarithmic color values. If you just want to use the lowest color for these zeros, you need to set a 'bad' color. In order not the change a standard colormap, the latest matplotlib versions wants you to first make a copy of the colormap.
PS: When calling plt.savefig() it is important to call it before plt.show() because plt.show() clears the plot.
Also, try to avoid the 'jet' colormap, as it has a bright yellow region which is not at the extreme. It may look nice, but can be very misleading. This blog article contains a thorough explanation. The matplotlib documentation contains an overview of available colormaps.
Note that to compare two plots, plt.subplots() needs to be used, and instead of plt.hist2d, ax.hist2d is needed (see this post). Also, with two colorbars, the elements on which the colorbars are based need to be given as parameter. A minimal change to your code would look like:
from matplotlib.ticker import NullFormatter, StrMethodFormatter, LogLocator
from matplotlib.colors import LogNorm
from matplotlib import pyplot as plt
from copy import copy
# ...
# reading the data as before
cmap = copy(plt.get_cmap('magma'))
cmap.set_bad(cmap(0))
plt.hist2d(values_Rot, values_Tilt, bins=25, cmap=cmap, norm=LogNorm())
cbar = plt.colorbar()
cbar.ax.yaxis.set_major_locator(LogLocator(base=2))
cbar.ax.yaxis.set_major_formatter(StrMethodFormatter('{x:.0f}'))
cbar.ax.yaxis.set_minor_formatter(NullFormatter())
plt.savefig('name_of_output.png') # needs to be called prior to plt.show()
plt.show()

How to plot discrete lines instead of bars?

I have plotted a histogram. But I want to plot discrete lines instead of three bars. Is there any way to do that?
import matplotlib.pyplot as plt
w1 = [-2,-2,-2,-2,0,0,0,1,1,1,1,1,1]
n,bins,patches = plt.hist(w1,bins=10)
plt.xlabel("bins")
plt.ylabel("counts")
plt.show()

If you just want to plot bars with smaller width
Use the argument rwidth, for relative width of each histogram bar compared to the bin size. Experiment different values for different visual results. Example:
w1=[-2,-2,-2,-2,0,0,0,1,1,1,1,1,1]
n,bins,patches=plt.hist(w1,bins=10, rwidth=0.1)
plt.xlabel("bins")
plt.ylabel("counts")
plt.show()
If you actually want to plot lines instead of bars
Loop over each value inside w1 and call plt.plot on a line from XY (value, 0) to XY (value, number of times value appears in w1). Example:
for value in w1:
plt.plot([value, value], [0, w1.count(value)], color='b')
plt.show()
Note that I've used the argument color='b' so that matplotlib wouldn't make different colors for each line. Also, by default matplotlib adds some whitespace to surrounding lines when we call plt.plot, so you may want to call plt.ylim(bottom=0), so that the bars do not appear to "float" above the plot.

Inside of the plt.hist(...) ad a variable: rwidth (relative width) with a value bellow 1, that way you will get bars with lower width.
Read more about that here: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html#matplotlib.pyplot.hist

I would propose to use a stem plot with the counts of the unique elements of the data. Of course this only makes sense for discrete data.
import numpy as np
import matplotlib.pyplot as plt
w1 = [-2,-2,-2,-2,0,0,0,1,1,1,1,1,1]
u, c = np.unique(w1, return_counts=True)
plt.stem(u,c, use_line_collection=True, basefmt="none")
plt.ylim(0,None)
plt.xlabel("bins")
plt.ylabel("counts")
plt.show()

Add padding between bars and Y-Axis

I am building a bar chart using matplotlib using the code below. When my first or last column of data is 0, my first column is wedged against the Y-axis.
An example of this. Note that the first column is ON the x=0 point.
If I have data in this column, I get a huge padding between the Y-Axis and the first column as seen here. Note the additional bar, now at X=0. This effect is repeated if I have data in my last column as well.
My code is as follows:
import matplotlib.pyplot as plt
import numpy as np
from matplotlib.ticker import MultipleLocator
binVals = [0,5531608,6475325,1311915,223000,609638,291151,449434,1398731,2516755,3035532,2976924,2695079,1822865,1347155,304911,3562,157,5,0,0,0,0,0,0,0,0]
binTot = sum(binVals)
binNorm = []
for v in range(len(binVals)):
binNorm.append(float(binVals[v])/binTot)
fig = plt.figure(figsize=(6,4))
ax1 = fig.add_subplot(1,1,1)
ax1.bar(range(len(binNorm)),binNorm,align='center', label='Values')
plt.legend(loc=1)
plt.title("Demo Histogram")
plt.xlabel("Value")
plt.xticks(range(len(binLabels)),binLabels,rotation='vertical')
plt.grid(b=True, which='major', color='grey', linestyle='--', alpha=0.35)
ax1.xaxis.grid(False)
plt.ylabel("% of Count")
plt.subplots_adjust(bottom=0.15)
plt.tight_layout()
plt.show()
How can I set a constant margin between the Y-axis and my first/last bar?
Additionally, I realize it's labeled "Demo Histogram", that is a because I missed it when correcting problems discussed here.

I can't run the code snippet you gave, and even with some modification I couldn't replicate the big space. Aside from that, if you need to enforce a border to matplotlib, you ca do somthing like this:
ax.set_xlim( min(your_data) - 10, None )
The first term tells the axis to put the border at 10 units of distance from the minimum of your data, the None parameter teels it to keep the present value.
to put it into contest:
from collections import Counter
from pylab import *
data = randint(20,size=1000)
res = Counter(data)
vals = arange(20)
ax = gca()
ax.bar(vals-0.4, [ res[i] for i in vals ], width=0.8)
ax.set_xlim( min(data)-1, None )
show()
searching around stackoverflow I just learned a new trick: you can call
ax.margins( margin_you_desire )
to let automatically let matplotlib put that amount of space around your plot. It can also be configured differently between x and y.
In your case the best solution would be something like
ax.margins(0.01, None)
The little catch is that the unit is in axes unit, referred to the size of you plot, so a margin of 1 will put space around your plot at both sizes big as your present plot

The problem is align='center'. Remove it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Plotting a cumulative histogram with exported data in Python - python

Related

Set log xticks in matplotlib for a linear plot

How to plot for frequency only?

Change colour scheme label to log scale without changing the axis in matplotlib

How to plot discrete lines instead of bars?

Add padding between bars and Y-Axis

Categories

Resources