For an assignment in class, I made a number generator that writes the output into a file. I need to read this file and make a histogram showing how many values are in specific ranges like 0-9, 10-19, 20-29, etc. The problem I'm running into is that I can't change the x-values on the bottom of the graph. I want it to be the same as 0-9, 10-19, etc but I can't find anything online that'll let me change it to that.
This is my code
import os
from matplotlib import pyplot as plt
os.chdir("/Users/elian/Desktop/School/Scripting/Week 7")
with open("Random.txt", "r") as file:
values = file.read()
contents = values.split()
mapped_contents = map(int, contents)
contents = list(mapped_contents)
file.close()
# histogram = plt.hist(contents, bins=binset, )
plt.hist(contents, bins=10)
plt.ylabel("Amount")
plt.xlabel("Range")
plt.title("Random Number Generator Histogram")
plt.show()
This is my first time working with matplotlib so I apologize if I'm not explaining everything right. I just want the histogram to show the amount of numbers that are in a specific range. I've tried range and xlim but I still run into the problem where the x-axis increments in 20s.
The command for placing values on the axis is plt.xticks(ticks, labels). However, in case of a histogram, the values are automatically arranged as per the data and bin size. So if you change the bin size, you would be able to see the values accordingly.
You can specify the x values in the bins parameter itself by passing a list of the values.
Example:
plt.hist(data, bins=[1, 2, 3, 4, 5])
The histogram would look as given in the link below.
histogram
Related
This is a bit of an odd problem I've encountered. I'm trying to read data from a CSV file in Python, and have the two resulting lines be inside of the same box, with different scales so they're both clear to read.
The CSV file looks like this:
Date,difference,current
11/19/20, 0, 606771
11/20/20, 14612, 621383
and the code looks like this:
data = pd.read_csv('data.csv')
time = data['Time']
ycurr = data['current']
ydif = data['difference']
fig, ax = plt.subplots()
line1, = ax.plot(time, ycurr, label='Current total')
line1.set_dashes([2, 2, 10, 2]) # 2pt line, 2pt break, 10pt line, 2pt break
line2, = ax.twinx().plot(time, ydif, dashes=[6, 2], label='Difference')
ax.legend()
plt.show()
I can display the graphs with the X-axis having Date values and Y-axis having difference values or current values just fine.
However, when I attempt to use subplots() and use the twinx() attribute with the second line, I can only see one of two lines.
I initially thought this might be a formatting issue in my code, so I updated the code to have the second line be ax2 = ax1.twin(x) and call upon the second line using this, but the result stayed the same. I suspect that this might be an issue with reading in the CSV data? I tried to do read in x = np.linspace(0, 10, 500) y = np.sin(x) y2 = np.sin(x-0.05) instead and that worked:
Everything is working as expected but probably not how you want it to work!
So each line only consists of two data points which in the end will give you a linear curve. Both of these curves share the same x-coordinates while the y-axis is scaled for each plot. And here comes the problem, both axes are scaled to display the data in the same way. This means, the curves lie perfectly on top of each other. It is difficult to see because both lines are dashed.
You can see what is going on by changing the colors of the line. For example add color='C1' to one of the curves.
By the way, what do you want to show with your plot? A curve consisting of two data points mostly doesn't show much and you are better of if you just show their values directly instead.
I got a time series which has got many columns. The time series is inthe dataframe called cum_returns. so I am currently plotting all graphs using cum_returns.plot()
lets say if I want to make the graph for columns A, C and F darker (or rather increase the line width for those 3 time series) is there an easy way to do that?
if you are using the command: cum_returns.plot() to plot everything, you can access each of the lines on the plot using:
import matplotlib.pyplot as plt
lines = plt.gca().lines
then, find out which line you want to edit, i.e. col 'A' is lines[0], 'B' is lines[1]..etc..
then change the linewidth using:
lines[x].set_linewidth(width) #x is the index of the line you want to edit, width is the new width
you can also do dir(lines[x]) to get a full list of the things you can do to it
I have the following simple python script.
import matplotlib.pyplot as plt
import random
size_of_numbers_to_be_plotted = 200
my_array = []
i = 0
while i < size_of_numbers_to_be_plotted:
my_array.append(random.randint(1, 10))
i += 1
plt.plot(my_array)
plt.ylabel('Random numbers')
plt.show()
It simply plots an array of numbers which could contain a random value between 1 to 10.
The plot looks okay with a few numbers in the array. Like say if size_of_numbers_to_be_plotted is set to 200 the plot looks okay and spaced out enough to understand like this.
But, it becomes pretty cramped for room if size_of_numbers_to_be_plotted is set to 2000.
Question:
How can I make the plot in such a way that it allocates some space between each number plotted rather than stuffing up all the numbers in a cramped for room plot?
That might require the plot to be made scrollable I guess since 2000 numbers would not all fit in a horizontal length covering my whole screen?
If instead of using plt.plot() you create a figure, you can pass a parameter that increases the size of the resulting figure.
Example:
import matplotlib.pyplot as plt
fig = plt.figure(num=None, figsize=(8, 6), facecolor='w', edgecolor='k')
You can play with these settings until you to the size of your liking.
Usually, however, when dealing with so much data points, you are not actually interested in the individual points, but rather in the pattern. Of course, I don't know your specific use case, but that may be something to think about. I hope this is what you were looking for.
I have a CDF plot with data of wifi usage in MB. For better understanding I would like to present the usage starting in KB and finishing in TB. I would like to know how to set a specific range for x axis to replace the produce by plt.plot() and show the axis x, per example, as [1KB 10KB 1MB 10MB 1TB 10TB], even the space between bins not representing the real values.
My code for now:
wifi = np.sort(matrix[matrix['wifi_total_mb']>0]['wifi_total_mb'].values)
g = sns.distplot(wifi, kde_kws=dict(cumulative=True))
plt.show()
Thanks
EDIT 1
I know that I can use plt.xticks and i already tried it: plt.xticks([0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]). These are values in MB that represents the sample range I specified before. But the plot is still wrong.
The result expected
In excel it is pretty easy makes what I want to. Look the image, with the same range I get the plot I wanted.
Thanks
It may be better to calculate the data to plot manually, instead of relying on some seaborn helper function like distplot. This also makes it easier to understand the underlying issue of histogramming with very unequal bin sizes.
Calculating histogram
The histogram of the data can be calculated by using np.histogram(). It can take the desired bins as argument.
In order to get the cummulative histogram, np.cumsum does the job.
Now there are two options here: (a) plotting the real data or (b) plotting the data enumerated by bin.
(a) Plotting the real data:
Because the bin sizes are pretty unequal, a logarithmic scaling seems adequate, which can be done by semilogx(x,y). The bin edges can be shown as xticks using set_xticks (and since the semilogx plot will not automatically set the labels correctly, we also need to set them to the bin edges' values).
(b) Plotting data enumerated by bin:
The second option is to plot the histogram values bin by bin, independent of the actual bin size. Is is very close to the Excel solution from the question. In this case the x values of the plot are simply values from 0 to number of bins and the xticklabels are the bin edges.
Here is the complete code:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#use the bin from the question
bins = [0, 0.00098, 0.00977, 1, 10, 1024, 10240, 1048576, 10485760, 24117248]
# invent some data
data = np.random.lognormal(2,4,10000)
# calculate histogram of the data into the given bins
hist, _bins = np.histogram(data, bins=bins)
# make histogram cumulative
cum_hist=np.cumsum(hist)
# normalize data to 1
norm_cum_hist = cum_hist/float(cum_hist.max())
fig, (ax, ax2) = plt.subplots(nrows=2)
plt.subplots_adjust(hspace=0.5, bottom=0.17)
# First option plots the actual data, i.e. the bin width is reflected
# by the spacing between values on x-axis.
ax.set_title("Plotting actual data")
ax.semilogx(bins[1:],norm_cum_hist, marker="s")
ax.set_xticks(bins[1:])
ax.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
# Second option plots the data bin by bin, i.e. every bin has the same width,
# independent of it's actual value.
ax2.set_title("Plotting bin by bin")
ax2.plot(range(len(bins[1:])),norm_cum_hist, marker="s")
ax2.set_xticks(range(len(bins[1:])))
ax2.set_xticklabels(bins[1:] ,rotation=45, horizontalalignment="right")
for axes in [ax, ax2]:
axes.set_ylim([0,1.05])
plt.show()
I tried searching for something similar, and the closest thing I could find was this which helped me to extract and manipulate the data, but now I can't figure out how to re-plot the histogram. I have some array of voltages, and I have first plotted a histogram of occurrences of those voltages. I want to instead make a histogram of events per hour ( so the y-axis of a normal histogram divided by the number of hours I took data ) and then re-plot the histogram with the manipulated y data.
I have an array which contains the number of events per hour ( composed of the original y axis from pyplot.hist divided by the number of hours data was taken ), and the bins from the histogram. I have composed that array using the following code ( taken from the answer linked above ):
import numpy
import matplotlib.pyplot as pyplot
mydata = numpy.random.normal(-15, 1, 500) # this seems to have to be 'uneven' on either side of 0, otherwise the code looks fine. FYI, my actual data is all positive
pyplot.figure(1)
hist1 = pyplot.hist(mydata, bins=50, alpha=0.5, label='set 1', color='red')
hist1_flux = [hist1[0]/5.0, 0.5*(hist1[1][1:]+hist1[1][:-1])]
pyplot.figure(2)
pyplot.bar(hist1_flux[1], hist1_flux[0])
This code doesn't exactly match what's going on in my code; my data is composed of 1000 arrays of 1000 data points each ( voltages ). I have made histograms of that, which gives me number of occurrences of a given voltage range ( or bin width ). All I want to do is re-plot a histogram of the number of events per hour (so yaxis of the histogram / 5 hours) with the same original bin width, but when I divide hist1[0]/5 and replot in the above way, the 'bin width' is all wrong.
I feel like there must be an easier way to do this, rather than manually replotting my own histograms.
Thanks in advance, and I'm really sorry if I've missed something obvious.
The problem, illustrated in the output of my sample code AND my original data is as follows:
Upper plots: code snippet output.
Lower plots: My actual data.
It's because the bar function takes an argument width, which is by default 0.8 (plt.bar(left, height, width=0.8, bottom=None, hold=None, **kwargs)), so you need to change it to the distance between two bars:
pyplot.bar(hist1_flux[1], hist1_flux[0],
width=hist1_flux[1][1] - hist1_flux[1][0])