I have a dataset from my simulations where I combine the results from each simulation seed into a bigger list using bl.extend(df['column'].tolist()).
I'm also running several simulation scenarios, so I append each scenario to a list of lists.
Finally, I'm computing the Probability Mass Function (PMF) of each list as follows (from How to plot a PMF of a sample?)
for idx,sublist in enumerate(pmf_list):
val, cnt = np.unique(sublist, return_counts=True)
pmf = cnt / float(len(sublist))
plot_pmf.append(np.column_stack((val, pmf)))
The issue is that I end up with a list of numpy arrays which I don't know how to plot. The minimum code to reproduce the problem is the following:
import numpy as np
list1 = np.empty([2, 2])
list2 = np.empty([2, 2])
list3 = np.empty([2, 2])
bl = [] # big list
bl.append(list1)
bl.append(list2)
bl.append(list3)
print bl
I can plot using plt.hist(bl[0]) but it doesn't give me the right results. See plot attached for the following list.
<type 'numpy.ndarray'>
[[0.00000000e+00 1.91734780e-01]
[1.00000000e+00 2.94277080e-02]
[2.00000000e+00 3.28276369e-01]
[3.00000000e+00 4.43357154e-01]
[4.00000000e+00 3.54294582e-03]
[5.00000000e+00 1.57306794e-03]
[6.00000000e+00 2.00530733e-03]
[7.00000000e+00 2.95245485e-05]
[8.00000000e+00 2.24386568e-05]
[9.00000000e+00 2.83435665e-05]
[1.00000000e+01 1.18098194e-06]
[1.20000000e+01 1.18098194e-06]]
Formatting the y-values I get:
0.1944084241
0.0415880165
0.3480178394
0.4031723062
0.0050902199
0.0033411939
0.0040175705
0.0001480127
0.0001031961
0.0001008373
0.0000058969
0.0000011794
0.0000047175
0.0000005897
very different from the y-values on the histogram plot
Does the following graph look right?
import matplotlib.pyplot as plt
import numpy as np
X = np.array([[0.00000000e+00, 1.91734780e-01],
[1.00000000e+00, 2.94277080e-02],
[2.00000000e+00, 3.28276369e-01],
[3.00000000e+00, 4.43357154e-01],
[4.00000000e+00, 3.54294582e-03],
[5.00000000e+00, 1.57306794e-03],
[6.00000000e+00, 2.00530733e-03],
[7.00000000e+00, 2.95245485e-05],
[8.00000000e+00, 2.24386568e-05],
[9.00000000e+00, 2.83435665e-05],
[1.00000000e+01, 1.18098194e-06],
[1.20000000e+01, 1.18098194e-06],])
plt.bar(x=X[:, 0], height=X[:, 1])
plt.show()
If you already have the first column as the possible values of the random variable, and the second column as the corresponding probability values, you could use a bar plot to visualize the PMF.
The histogram plot function plt.hist is for a vector of observed values. For example,
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
np.random.seed(0)
plt.hist(np.random.normal(size=1000))
plt.show()
Related
Update: Removing the screenshot, Below is the code from the screenshot:
import numpy as np
import matplotlib.pyplot as plt
x = np.arange(5)
y = np.array([[1,2],[1,2,3],[1,2,3,4],[1,2,3,4,5],[1,2,5,7,9]])
plt.plot(x,y) #gives ValueError: setting an array element with a sequence.
#A relaistic example
age = [20,30,40,50,60]
salary = np.array([[200,350,414],[300,500,612,700],[500,819],[900,1012],[812,712]])
plt.plot(age,salary) #gives ValueError: setting an array element with a sequence.
I am having two arrays each of size 5, elements of y are arrays, and I want them to be plotted against each x, for example at x = 0, I want to plot all the points from y[0], is there a way?
Update: Added another example above to show a realistic case , where I need to plot different salaries of different age people, each age people can have more than one salary.
List comprehension to the rescue!
import numpy as np
import matplotlib.pyplot as plt
age = [20,30,40,50,60]
salary = np.array([[200,350,414],[300,500,612,700],[500,819],[900,1012],[812,712]])
#creating x-y tuples
xy = [(k, j) for i, k in enumerate(age) for j in salary[i]]
#unpacking the tuples with zip
plt.scatter(*zip(*xy))
plt.show()
Sample output:
However, irregular numpy arrays should not be created, and this example works perfectly well with a normal list. Just saying.
As of now I am using the following workaround, but looking for a simpler solution:
indx = -1
for a in age:
indx+=1
for s in salary[indx]:
plt.plot(a,s,'o')
plt.show()
I have 136 numbers which have an overlapping distribution of 8 Gaussian distributions. I want to find it's means, and variances with each Gaussian distribution! Can you find any mistakes with my code?
file = open("1.txt",'r') #data is in 1.txt like 0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194...
y=[int (i) for i in list((file.read()).split(','))] # I want to make list which element is above data
x=list(range(1,len(y)+1)) # it is x values
z=list(zip(x,y)) # z elements consist as (1, 0), (2, 0), ...
Therefore, through the above process, for the 136 points (x,y) on the xy plane having the first given data as y values, a list z using this as an element was obtained.
Now I want to obtain each Gaussian distribution's mean, variance. At this time, the basic assumption is that the given data consists of overlapping 8 Gaussian distributions.
import numpy as np
from sklearn.mixture import GaussianMixture
data = np.array(z).reshape(-1,1)
model = GaussianMixture(n_components=8).fit(data)
print(model.means_)
file.close()
Actually, I don't know how to make it's code to print 8 means and variances... Anyone can help me?
You can use this, I have made a sample code for your visualizations -
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy
import matplotlib.pyplot as plt
%matplotlib inline
#Sample data
x = [0,0,0,0,0,0,1,0,0,1,4,4,6,14,25,43,71,93,123,194]
num_components = 2
#Fit a model onto the data
data = np.array(x).reshape(-1,1)
model = GaussianMixture(n_components=num_components).fit(data)
#Get list of means and variances
mu = np.abs(model.means_.flatten())
sd = np.sqrt(np.abs(model.covariances_.flatten()))
#Plotting
extend_window = 50 #this is for zooming into or out of the graph, higher it is , more zoom out
x_values = np.arange(data.min()-extend_window, data.max()+extend_window, 0.1) #For plotting smooth graphs
plt.plot(data, np.zeros(data.shape), linestyle='None', markersize = 10.0, marker='o') #plot the data on x axis
#plot the different distributions (in this case 2 of them)
for i in range(num_components):
y_values = scipy.stats.norm(mu[i], sd[i])
plt.plot(x_values, y_values.pdf(x_values))
I have a 2D dimensional histogram having bin size 10. I wish to know whether there is a numpy function (or any faster method) to obtain what points lie in each bin in the 2d grid. Is there a way to access the bin elements?
I hope this solve your problem. However, I believe other can improve my code because I am new in python.
Create Histogram with matplotlib
import matplotlib.pyplot as plt
rng = np.random.RandomState(10) # deterministic random data
a = np.hstack((rng.normal(size=100), rng.normal(loc=5, scale=2, size=1000)))
n ,bins ,patches = plt.hist(a, bins=10) # arguments are passed to np.histogram
plt.title("Histogram with '10' bins")
plt.show()
Reshape arrays and..
newbin = np.repeat(np.reshape(bins,(-1, len(bins))), a.shape, axis=0)
newa = np.repeat(np.reshape(a,(len(a),-1)),len(bins),axis=1)
#index_bin = (np.where(newbin[:,0] >np.reshape(a,(1,-1))[:,0] ) )[0][0]
index_bin = (newbin>newa).argmax(axis=1).T
test
print(a[0] , bins)
print(index_bin[0])
Output
1.331586504129518 [-2.13171211 -0.88255884 0.36659444 1.61574771 2.86490098 4.11405425
5.36320753 6.6123608 7.86151407 9.11066734 10.35982062]
3
I use the fuzzy-c-means clustering implementation and I would like the data X to form the number of clusters i define in the algorithm(I beleive that is how it works). But the behavior is confusing.
cm = FCM(n_clusters=6)
cm.fit(X)
This code generates a plot with 4 labels - [0,2,4,6]
cm = FCM(n_clusters=4)
cm.fit(X)
This code generates a plot with 4 labels - [0,1,2,3]
I expect labels [0,1,2,3,4,5] when i initialize the cluster number to be 6.
code:
from fcmeans import FCM
from matplotlib import pyplot as plt
from seaborn import scatterplot as scatter
# fit the fuzzy-c-means
fcm = FCM(n_clusters=6)
fcm.fit(X)
# outputs
fcm_centers = fcm.centers
fcm_labels = fcm.u.argmax(axis=1)
# plot result
%matplotlib inline
f, axes = plt.subplots(1, 2, figsize=(11,5))
scatter(X[:,0], X[:,1], ax=axes[0])
scatter(X[:,0], X[:,1], ax=axes[1], hue=fcm_labels)
scatter(fcm_centers[:,0], fcm_centers[:,1], ax=axes[1],marker="s",s=200)
plt.show()
Fuzzy c-means is a fuzzy clustering algorithm.
The labels are only an approximation to the fuzzy assignment.
Most likely two clusters are pretty weak, and hence never win the argmax operation used to produce the labels. That doesn't mean these clusters have not been used, you are just not using the full fuzzy result.
I'm using fuzzy-c-means version 1.7.0:
>>> import fcmeans
>>> fcmeans.__version__
'1.7.0'
Using synthetic data:
>>> from sklearn.datasets import load_iris
>>> iris = load_iris().data
>>> model = fcmeans.FCM(n_clusters = 2)
>>> model.fit(iris)
>>> pred = model.predict(iris)
>>> from collections import Counter
>>> Counter(pred)
Counter({0: 97, 1: 53})
So, the n_clusters applied correctly.
I read about it and looks like once the algorithm reaches the knee point(max number of clusters it can perform with the data), it wont create anything more than this. So in my question, 4 was the maximum number of clusters that the algo perform with the given dataset.
I need to create a histogram of a very large data set in python 3. However, I cannot use a list to create a histogram because the list would be too large given my data. I need a way to update a histogram as each data point is created. That way my computer is only ever dealing with a single point and updating the plot.
I've been using matplotlib. Tried plt.draw() but couldn't get it to work. (See code below)
#Proof of concept code
l = [1, 2, 3, 2, 3, 2]
n = 0
p = False
for x in range(0,6):
n = l[x]
if p == False:
fig = plt.hist(n)
p = True
else:
plt.draw()
I need a plot that looks like plt.hist(l). But have only been getting the first point plotted.
Are you familiar with Numpy? Numpy handles large arrays pretty well.
Here's an example using a random integer set from 1 to 3 (inclusive).
import matplotlib.pyplot as plt
import numpy as np
arr_random = np.random.randint(1,4,10000)
plt.hist(arr_random)
plt.show()
It's very efficient to compute plt.hist() with Numpy arrays.