normalization of values in python np array gone wrong?

normalization of values in python np array gone wrong? - python

I have a matrix of floats shaped (3000, 9).
Across 1 line, there is 1 ''simulation''.
Across columns, for a fixed line, there's the contents of the ''simulation''.
I want that for each simulation, the first 8 columns to be normalized to the sum of the 8 first columns.
That is, the first column's entry (for one fixed line) to become what was before, over the sum of the first 8 columns (for that same fixed line).
A trivial task, but I get from a nice, correct, graph (non-normalized), something totally unphysical when plotting with plt.scatter.
The last column of each line is what we are going to use for the x-axis to plot the first 8 columns (the y values).
So one line will represent 8 datapoints for 1 fixed value of x.
The non-normalized graph:
https://ibb.co/Msr8RVB
The normalized graph:
https://ibb.co/tJp7bZn
The datasets:
non-normalized: https://easyupload.io/oat9kq
My code:
import numpy as np
from matplotlib import pyplot as plt
non_norm = np.loadtxt("integration_results_3000samples_10_20_10_25_Wcm2_BenSimulationFromSlack.txt")
plt.figure()
for i in range(non_norm.shape[1]-1):
plt.scatter(non_norm[:, -1], non_norm[:, i], label="c_{}".format(i+47))
plt.xscale("log")
plt.savefig("non-norm_Ben3000samples.pdf", bbox_inches='tight')
norm = np.empty( (non_norm.shape[0], non_norm.shape[1]) )
norm[:, -1] = non_norm[:, -1]
for i in range(norm.shape[1]-1):
for j in range(norm.shape[0]):
norm[j, i] = np.true_divide(non_norm[j, i] , np.sum(non_norm[j, :-1]))
plt.figure()
for i in range(norm.shape[1]-1):
plt.scatter(norm[:, -1], norm[:, i], label="c_{}".format(i+47))
plt.xscale("log")
plt.savefig("norm_Ben3000samples.pdf", bbox_inches='tight')
Do you see what went wrong?
Thank you

When you're normalising a row that has just one value and 7 zeroes, the value becomes 1 and the rest of the row is 0? This is likely why your plot is messing up.
For example, the plot for the first column looks like this before and after normalization:

Related

Create a scatter plot from an ndarray using the position in the row as the x-axis value and the value in the array for the y-axis

much like the title says I am trying to create a graph that shows 1-6 on the x-axis (the values position in the row of the array) and its value on the y-axis. A snippet from the array is shown below, with each column representing a coefficient number from 1-6.
[0.99105 0.96213 0.96864 0.96833 0.96698 0.97381]
[0.99957 0.99709 0.9957 0.9927 0.98492 0.98864]
[0.9967 0.98796 0.9887 0.98613 0.98592 0.99125]
[0.9982 0.99347 0.98943 0.96873 0.91424 0.83831]
[0.9985 0.99585 0.99209 0.98399 0.97253 0.97942]
It's already set up as a numpy array. I think it's relatively straightforward, just drawing a complete mental blank.
Any ideas?

Do you want something like this?
a = np.array([[0.99105, 0.96213, 0.96864, 0.96833, 0.96698, 0.97381],
[0.99957, 0.99709, 0.9957, 0.9927, 0.98492, 0.98864],
[0.9967, 0.98796, 0.9887, 0.98613, 0.98592, 0.99125],
[0.9982, 0.99347, 0.98943, 0.96873, 0.91424, 0.83831],
[0.9985, 0.99585, 0.99209, 0.98399, 0.97253, 0.97942]])
import matplotlib.pyplot as plt
plt.scatter(x=np.tile(np.arange(a.shape[1]), a.shape[0])+1, y=a)
output:
Note that you can emulate the same with groups using:
plt.plot(a.T, marker='o', ls='')
x = np.arange(a.shape[0]+1)
plt.xticks(x, x+1)
output:

How to plot a line between points taken from different rows of 2D array in python?

I have a 2D array of size 2xN call it A and another 1xN call it B. I want to draw a vertical line between elements of the first row and second row (i.e., draw a line between A[0,0] and A[1,0] and the line is on the horizontal axis with value B[0] or say A[0,4] and A[1,4] with horizontal axis value of B[4], etc.) with their value on the horizontal axis being B[corresponding column].

I'll preface by saying I'm not 100% sure what you are asking for. With what I believe you are asking however, this does the trick.
from matplotlib import pyplot as plt
from random import randint
A = []
B = []
n = randint(1,10)
for i in range(n):
A.append([randint(1,10),randint(1,10)])
B.append(randint(1,10))
for i in range(n):
plt.plot(B[i], A[i][0], "ro")
plt.plot(B[i], A[i][1], "ro")
plt.plot((B[i], B[i]), (A[i][0], A[i][1]), "c-")
plt.show()

Plot missing points for complicated 3D list of points - Python

Hi I have a 3D list (I realise this may not be the best representation of my data so any advice here is appreciated) as such:
y_data = [
[[a,0],[b,1],[c,None],[d,6],[e,7]],
[[a,5],[b,2],[c,1],[d,None],[e,1]],
[[a,3],[b,None],[c,4],[d,9],[e,None]],
]
The y-axis data is such that each sublist is a list of values for one hour. The hours are the x-axis data. Each sublist of this has the following format:
[label,value]
So essentially:
line a is [0,5,3] on the y-axis
line b is [1,2,None] on the y-axis etc.
My x-data is:
x_data = [0,1,2,3,4]
Now when I plot this list directly i.e.
for i in range(0,5):
ax.plot(x_data, [row[i][1] for row in y_data], label=y_data[0][i][0])
I get a line graph however where the value is None the point is not drawn and the line not connected.
What I would like to do is to have a graph which will plot my data in it's current format, but ignore missing points and draw a line between the point before the missing data and the point after (i.e. interpolating the missing point).
I tried doing it like this https://stackoverflow.com/a/14399830/1800665 but I couldn't work out how to do this for a 3D list.
Thanks for any help!

The general approach that you linked to will work fine here ; it looks like the question you're asking is how to apply that approach to your data. I'd like to suggest that by factoring out the data you're plotting, you'll see more clearly how to do it.
import numpy as np
y_data = [
[[a,0],[b,1],[c,None],[d,6],[e,7]],
[[a,5],[b,2],[c,1],[d,None],[e,1]],
[[a,3],[b,None],[c,4],[d,9],[e,None]],
]
x_data = [0, 1, 2, 3, 4]
for i in range(5):
xv = []
yv = []
for j, v in enumerate(row[i][1] for row in y_data):
if v is not None:
xv.append(j)
yv.append(v)
ax.plot(xv, yv, label=y_data[0][i][0])
Here instead of using a mask like in the linked question/answer, I've explicitly built up the lists of valid data points that are to be plotted.

plotting high precision data

I have an array which contains error values as a function of two different quantities (alpha and eigRange).
I fill my array like this :
for j in range(n):
for i in range(alphaLen):
alpha = alpha_list[i]
c = train.eig(xt_, yt_,m-j, m,alpha, "cpu")
costListTrain[j, i] = cost.err(xt_, xt_, yt_, c)
normedValues=costListTrain/np.max(costListTrain.ravel())
where
n = 20
alpha_list = [0.0001,0.0003,0.0008,0.001,0.003,0.006,0.01,0.03,0.05]
My costListTrain array contains some values that have very small differences, e.g.:
2.809458902485728 2.809458905776425 2.809458913576337 2.809459011062461
2.030326752376704 2.030329906064879 2.030337351188699 2.030428976282031
1.919840839066182 1.919846470077076 1.919859731440199 1.920021453630778
1.858436351617677 1.858444223016128 1.858462730482461 1.858687054377165
1.475871326997542 1.475901926855846 1.475973476249240 1.476822830933632
1.475775410801635 1.475806023102173 1.475877601316863 1.476727286424228
1.475774284270633 1.475804896751524 1.475876475382906 1.476726165223209
1.463578292548192 1.463611627166494 1.463689466240788 1.464609083309240
1.462859608038034 1.462893157900139 1.462971489632478 1.463896516033939
1.461912706143012 1.461954067956570 1.462047793798572 1.463079574605320
1.450581041157659 1.452770209885761 1.454835202839513 1.459676311335618
1.450581041157643 1.452770209885764 1.454835202839484 1.459676311335624
1.450581041157651 1.452770209885735 1.454835202839484 1.459676311335610
1.450581041157597 1.452770209885784 1.454835202839503 1.459676311335620
1.450581041157575 1.452770209885757 1.454835202839496 1.459676311335619
1.450581041157716 1.452770209885711 1.454835202839499 1.459676311335613
1.450581041157667 1.452770209885744 1.454835202839509 1.459676311335625
1.450581041157649 1.452770209885750 1.454835202839476 1.459676311335617
1.450581041157655 1.452770209885708 1.454835202839442 1.459676311335622
1.450581041157571 1.452770209885700 1.454835202839498 1.459676311335622
as you can here the value are very very close together!
I am trying to plotting this data in a way where I have the two quantities in the x, y axes and the error value is represented by the dot color.
This is how I'm plotting my data:
alpha_list = np.log(alpha_list)
eigenvalues, alphaa = np.meshgrid(eigRange, alpha_list)
vMin = np.min(costListTrain)
vMax = np.max(costListTrain)
plt.scatter(x, y, s=70, c=normedValues, vmin=vMin, vmax=vMax, alpha=0.50)
but the result is not correct.
I tried to normalize my error value by dividing all values by the max, but it didn't work !
The only way that I could make it work (which is incorrect) is to normalize my data in two different ways. One is base on each column (which means factor1 is constant, factor 2 changing), and the other one based on row (means factor 2 is constant and factor one changing). But it doesn't really make sense because I need a single plot to show the tradeoff between the two quantities on the error values.
UPDATE
this is what I mean by last paragraph.
normalizing values base on max on each rows which correspond to eigenvalues:
maxsEigBasedTrain= np.amax(costListTrain.T,1)[:,np.newaxis]
maxsEigBasedTest= np.amax(costListTest.T,1)[:,np.newaxis]
normEigCostTrain=costListTrain.T/maxsEigBasedTrain
normEigCostTest=costListTest.T/maxsEigBasedTest
normalizing values base on max on each column which correspond to alphas:
maxsAlphaBasedTrain= np.amax(costListTrain,1)[:,np.newaxis]
maxsAlphaBasedTest= np.amax(costListTest,1)[:,np.newaxis]
normAlphaCostTrain=costListTrain/maxsAlphaBasedTrain
normAlphaCostTest=costListTest/maxsAlphaBasedTest
plot 1:
where no. eigenvalue = 10 and alpha changes (should correspond to column 10 of plot 1) :
where alpha = 0.0001 and eigenvalues change (should correspond to first row of plot1)
but as you can see the results are different from plot 1!
UPDATE:
just to clarify more stuff this is how I read my data:
from sklearn.datasets.samples_generator import make_regression
rng = np.random.RandomState(0)
diabetes = datasets.load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target
X_diabetes=np.c_[np.ones(len(X_diabetes)),X_diabetes]
ind = np.arange(X_diabetes.shape[0])
rng.shuffle(ind)
#===============================================================================
# Split Data
#===============================================================================
import math
cross= math.ceil(0.7*len(X_diabetes))
ind_train = ind[:cross]
X_train, y_train = X_diabetes[ind_train], y_diabetes[ind_train]
ind_val=ind[cross:]
X_val,y_val= X_diabetes[ind_val], y_diabetes[ind_val]
I also uploaded .csv files HERE
log.csv contain the original value before normalization for plot 1
normalizedLog.csv for plot 1
eigenConst.csv for plot 2
alphaConst.csv for plot 3

I think I found the answer. First of all there was one problem in my code. I was expecting the "No. of eigenvalue" correspond to rows but in my for loop they fill the columns. The currect answer is this :
for i in range(alphaLen):
for j in range(n):
alpha=alpha_list[i]
c=train.eig(xt_, yt_,m-j,m,alpha,"cpu")
costListTrain[i,j]=cost.err(xt_,xt_,yt_,c)
costListTest[i,j]=cost.err(xt_,xv_,yv_,c)
After asking questions from friends and colleagues I got this answer :
I would assume on default imshow and other plotting commands you
might want to use, do equally sized intervals on the values you are
plotting. if you can set that to logarithmic you should be fine.
Ideally, equally "populated bins" would proof most effective, i guess.
for plotting I just subtract the min value from the error and the add a small number and at the end take the log.
temp=costListTrain- costListTrain.min()
temp+=0.00000001
extent = [0, 20,alpha_list[0], alpha_list[-1]]
plt.imshow(np.log(temp),interpolation="nearest",cmap=plt.get_cmap('spectral'), extent = extent, origin="lower")
plt.colorbar()
and result is :

making binned boxplot in matplotlib with numpy and scipy in Python

I have a 2-d array containing pairs of values and I'd like to make a boxplot of the y-values by different bins of the x-values. I.e. if the array is:
my_array = array([[1, 40.5], [4.5, 60], ...]])
then I'd like to bin my_array[:, 0] and then for each of the bins, produce a boxplot of the corresponding my_array[:, 1] values that fall into each box. So in the end I want the plot to contain number of bins-many box plots.
I tried the following:
min_x = min(my_array[:, 0])
max_x = max(my_array[:, 1])
num_bins = 3
bins = linspace(min_x, max_x, num_bins)
elts_to_bins = digitize(my_array[:, 0], bins)
However, this gives me values in elts_to_bins that range from 1 to 3. I thought I should get 0-based indices for the bins, and I only wanted 3 bins. I'm assuming this is due to some trickyness with how bins are represented in linspace vs. digitize.
What is the easiest way to achieve this? I want num_bins-many equally spaced bins, with the first bin containing the lower half of the data and the upper bin containing the upper half... i.e., I want each data point to fall into some bin, so that I can make a boxplot.
thanks.

You're getting the 3rd bin for the maximum value in the array (I'm assuming you have a typo there, and max_x should be "max(my_array[:,0])" instead of "max(my_array[:,1])"). You can avoid this by adding 1 (or any positive number) to the last bin.
Also, if I'm understanding you correctly, you want to bin one variable by another, so my example below shows that. If you're using recarrays (which are much slower) there are also several functions in matplotlib.mlab (e.g. mlab.rec_groupby, etc) that do this sort of thing.
Anyway, in the end, you might have something like this (to bin x by the values in y, assuming x and y are the same length)
def bin_by(x, y, nbins=30):
"""
Bin x by y.
Returns the binned "x" values and the left edges of the bins
"""
bins = np.linspace(y.min(), y.max(), nbins+1)
# To avoid extra bin for the max value
bins[-1] += 1
indicies = np.digitize(y, bins)
output = []
for i in xrange(1, len(bins)):
output.append(x[indicies==i])
# Just return the left edges of the bins
bins = bins[:-1]
return output, bins
As a quick example:
In [3]: x = np.random.random((100, 2))
In [4]: binned_values, bins = bin_by(x[:,0], x[:,1], 2)
In [5]: binned_values
Out[5]:
[array([ 0.59649575, 0.07082605, 0.7191498 , 0.4026375 , 0.06611863,
0.01473529, 0.45487203, 0.39942696, 0.02342408, 0.04669615,
0.58294003, 0.59510434, 0.76255006, 0.76685052, 0.26108928,
0.7640156 , 0.01771553, 0.38212975, 0.74417014, 0.38217517,
0.73909022, 0.21068663, 0.9103707 , 0.83556636, 0.34277006,
0.38007865, 0.18697416, 0.64370535, 0.68292336, 0.26142583,
0.50457354, 0.63071319, 0.87525221, 0.86509534, 0.96382375,
0.57556343, 0.55860405, 0.36392931, 0.93638048, 0.66889756,
0.46140831, 0.01675165, 0.15401495, 0.10813141, 0.03876953,
0.65967335, 0.86803192, 0.94835281, 0.44950182]),
array([ 0.9249993 , 0.02682873, 0.89439141, 0.26415792, 0.42771144,
0.12292614, 0.44790357, 0.64692616, 0.14871052, 0.55611472,
0.72340179, 0.55335053, 0.07967047, 0.95725514, 0.49737279,
0.99213794, 0.7604765 , 0.56719713, 0.77828727, 0.77046566,
0.15060196, 0.39199123, 0.78904624, 0.59974575, 0.6965413 ,
0.52664095, 0.28629324, 0.21838664, 0.47305751, 0.3544522 ,
0.57704906, 0.1023201 , 0.76861237, 0.88862359, 0.29310836,
0.22079126, 0.84966201, 0.9376939 , 0.95449215, 0.10856864,
0.86655289, 0.57835533, 0.32831162, 0.1673871 , 0.55742108,
0.02436965, 0.45261232, 0.31552715, 0.56666458, 0.24757898,
0.8674747 ])]
Hope that helps a bit!

Numpy has a dedicated function for creating histograms the way you need to:
histogram(a, bins=10, range=None, normed=False, weights=None, new=None)
which you can use like:
(hist_data, bin_edges) = histogram(my_array[:,0], weights=my_array[:,1])
The key point here is to use the weights argument: each value a[i] will contribute weights[i] to the histogram. Example:
a = [0, 1]
weights = [10, 2]
describes 10 points at x = 0 and 2 points at x = 1.
You can set the number of bins, or the bin limits, with the bins argument (see the official documentation for more details).
The histogram can then be plotted with something like:
bar(bin_edges[:-1], hist_data)
If you only need to do a histogram plot, the similar hist() function can directly plot the histogram:
hist(my_array[:,0], weights=my_array[:,1])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

normalization of values in python np array gone wrong? - python

When you're normalising a row that has just one value and 7 zeroes, the value becomes 1 and the rest of the row is 0? This is likely why your plot is messing up. For example, the plot for the first column looks like this before and after normalization:

Related

Create a scatter plot from an ndarray using the position in the row as the x-axis value and the value in the array for the y-axis

How to plot a line between points taken from different rows of 2D array in python?

Plot missing points for complicated 3D list of points - Python

plotting high precision data

making binned boxplot in matplotlib with numpy and scipy in Python

Categories

Resources