I am bench-marking the effectiveness of a program that runs until it finds a solution and trying to create charts to show how the program tends to go about finding the solution. the program sometimes takes 500 attempts and sometimes takes 2000, I can show that they both steadily produced better and better answers until they find the target. I have hundreds of runs to examine so I would like to see how the average of all of the runs moves over time, however, numpy does not allow me to average data of different lengths. How can I get it to just average the data points that are available at each test number.
EX: trial1 = [33.4853, 32.3958, 30.2859, 33.2958, 30.1049, 29.3209]
trial2 = [45.2937, 44.2983, 42.2839, 42.1394, 41.2938, 39.2936, 38.1826, 36.2483, 39.2632, 37.1827, 35.9936, 32.4837, 31.5599, 29.3209]
BE = numpy.array([trial1, trial2])
BEave = numpy.average(BE, axis=0)
I would like to get back: BEave = [39.3895, 38.34705, 36.2849, 37.7176, 35.69935, 34.30725, 38.1826, 36.2483, 39.2632, 37.1827, 35.9936, 32.4837, 31.5599, 29.3209]]
You may create a large array of nans and fill in the trials up the respective maximum number of trials. The rest of the array row will stay nan. Then take the mean along the vertical axis, using numpy.nanmean.
import numpy as np
import matplotlib.pyplot as plt
trial1 = [33.4853, 32.3958, 30.2859, 33.2958, 30.1049, 29.3209]
trial2 = [45.2937, 44.2983, 42.2839, 42.1394, 41.2938, 39.2936, 38.1826,
36.2483, 39.2632, 37.1827, 35.9936, 32.4837, 31.5599, 29.3209]
m= np.max([len(trial1), len(trial2)])
# create array of nans
BE = np.ones( (2, m) )*np.nan
# fill it with trials up to the number of trial values
BE[0,:len(trial1)] = trial1
BE[1,:len(trial2)] = trial2
# nanmean = take mean, ignore nans
BEave = np.nanmean(BE, axis=0)
plt.plot(trial1, label="trial1", color="mediumpurple")
plt.plot(trial2, label="trial2", color="violet")
plt.plot(BEave, color="crimson", label="avg")
plt.legend()
plt.show()
Related
I have a number of spectra: wavelength/counts at a given temperature. The wavelength range is the same for each spectrum.
I would like to interpolate between the temperature and counts to create a large grid of spectra (temperature and counts (at a given wavelength range).
The code below is my current progress. When I try to get a spectrum for a given temperature I only get one value of counts when I need a range of counts representing the spectrum (I already know the wavelengths).
I think I am confused about arrays and interpolation. What am I doing wrong?
import pandas as pd
import numpy as np
from scipy import interpolate
image_template_one = pd.read_excel("mr_image_one.xlsx")
counts = np.array(image_template_one['counts'])
temp = np.array(image_template_one['temp'])
inter = interpolate.interp1d(temp, counts, kind='linear')
temp_new = np.linspace(30,50,0.5)
counts_new = inter(temp_new)
I am now think that I have two arrays; [wavelength,counts] and [wavelength, temperature]. Is this correct, and, do I need to interpolate between the arrays?
Example data
I think what you want to achieve can be done with interp2d:
from scipy import interpolate
# dummy data
data = pd.DataFrame({
'temp': [30]*6 + [40]*6 + [50]*6,
'wave': 3 * [a for a in range(400,460,10)],
'counts': np.random.uniform(.93,.95,18),
})
# make the interpolator
inter = interpolate.interp2d(data['temp'], data['wave'], data['counts'])
# scipy's interpolators return functions,
# which you need to call with the values you want interpolated.
new_x, new_y = np.linspace(30,50,100), np.linspace(400,450,100)
interpolated_values = inter(new_x, new_y)
I have a ton of ranges. They all consist of numbers. The range has a maximum and a minimum which can not be exceeded, but given the example that you have two ranges and one max point of the range reaches above the min area of the other. That would mean that you have a small area that covers both of them. You can write one range that includes the others.
I want to see if some ranges overlap or if I can find some ranges that cover most of the other. The goal would be to see if I can simplify them by using one smaller range that fits inside the other. For example 7,8 - 9,6 and 7,9 - 9,6 can be covered with one range.
You can see my attempt to visualize them. But when I use my entire dataset consisting of hundreds of ranges my graph is not longer useful.
I know that I can detect recurrent ranges using python. But I don't want to know how often a range occurs. I want to know how many ranges lay in the same numerical boundaries.I want see if I can have a couple of ranges covering all of them. Finally my goal is to have the masterranges sorted in categories. Meaning that I have range 1 covering 50 other ranges. then range 2 covering 25 ranges and so on.
My current program shows the penetration of ranges but I also want that in a printed output with the exact digits.
It would be nice if you share some ideas to solve that program or if you have any suggestions on tools within python 3.7
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]
]
for int in intervals:
plt.plot(int,[0,0], 'b', alpha = 0.2, linewidth = 100)
plt.show()
Here is an idea, You make a pandas data frame with the array. You substract the values in column2 - colum1 ( column 1 is x, and column 2 is y ). After that you create a histogram in which you take the range and the frecuency.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
intervals = [[3.6,4.5],
[3.6,4.5],
[7.8,9.6],
[7.9,9.6],
[7.8,9.6],
[3.4,4.1],
[2.8,3.4],
[8.25,9.83],
[3.62,3.96],
[8.25,9.83],
[0.62,0.68],
[2.15,2.49],
[0.8,1.0],
[0.8,1.0],
[3.1,3.9],
[6.7,8.3],
[1,1.5],
[1,1.2],
[1.5,1.8],
[1.8,2.5],
[3,4.0],
[6.5,8.0],
[1.129,1.35],
[2.82,3.38],
[1.69,3.38],
[3.38,6.21],
[2.25,2.82],
[5.649,6.214],
[1.920,6.214]]
intervals_ar = np.array(intervals)
df = pd.DataFrame({'Column1': intervals_ar[:, 0], 'Column2': intervals_ar[:, 1]})
df['Ranges'] = df['Column2'] - df ['Column1']
print(df)
frecuency_range = df['Ranges'].value_counts().sort_index()
print(frecuency_range)
df.Ranges.value_counts().sort_index().plot(kind = 'hist', bins = 5)
plt.title("Histogram Frecuency vs Range (column 2- column1)")
plt.show()
I am working on filling in missing data in a large (4GB) netcdf datafile (3 dimensions: time, longitude and latitude). The method is to fill in the masked values in data1 either with:
1) previous values from data1 or
2) with data from another (also masked dataset, data2) if the found value from data1 < the found value from data2.
So fare I have tried a couple of things, one is to make a very complex script with long for loops which never finished running after 24 hours. I have tried to reduce it, but i think it is still very much to complicated. I believe there is a much more simple procedure to do it than the way I am doing it now I just can't see how.
I have made a script where masked data is first replaced with zeroes in order to use the function np.where to get the index of my masked data (i did not find a function that returns the coordinates of masked data, so this is my work arround it). My problem is that my code is very long and i think time consuming for large datasets to run through. I believe there is a more simple way of doing it, but I haven't found another work arround it.
Here is what I have so fare: : (the first part is just to generate some matrices that are easy to work with):
if __name__ == '__main__':
import numpy as np
import numpy.ma as ma
from sortdata_helpers import decision_tree
# Generating some (easy) test data to try the algorithm on:
# data1
rand1 = np.random.randint(10, size=(10, 10, 10))
rand1 = ma.masked_where(rand1 > 5, rand1)
rand1 = ma.filled(rand1, fill_value=0)
rand1[0,:,:] = 1
#data2
rand2 = np.random.randint(10, size=(10, 10, 10))
rand2[0, :, :] = 1
coordinates1 = np.asarray(np.where(rand1 == 0)) # gives the locations of where in the data there are zeros
filled_data = decision_tree(rand1, rand2, coordinates1)
print(filled_data)
The functions that I defined to be called in the main script are these, in the same order as they are used:
def decision_tree(data1, data2, coordinates):
# This is the main function,
# where the decision between data1 or data2 is chosen.
import numpy as np
from sortdata_helpers import generate_vector
from sortdata_helpers import find_value
for i in range(coordinates.shape[1]):
coordinate = [coordinates[0, i], coordinates[1,i], coordinates[2,i]]
AET_vec = generate_vector(data1, coordinate) # makes vector to go back in time
AET_value = find_value(AET_vec) # Takes the vector and find closest day with data
PET_vec = generate_vector(data2, coordinate)
PET_value = find_value(PET_vec)
if PET_value > AET_value:
data1[coordinate[0], coordinate[1], coordinate[2]] = AET_value
else:
data1[coordinate[0], coordinate[1], coordinate[2]] = PET_value
return(data1)
def generate_vector(data, coordinate):
# This one generates the vector to go back in time.
vector = data[0:coordinate[0], coordinate[1], coordinate[2]]
return(vector)
def find_value(vector):
# Here the fist value in vector that is not zero is chosen as "value"
from itertools import dropwhile
value = list(dropwhile(lambda x: x == 0, reversed(vector)))[0]
return(value)
Hope someone has a good idea or suggestions on how to improve my code. I am still struggling with understanding indexing in python, and I think this can definately be done in a more smooth way than I have done here.
Thanks for any suggestions or comments,
I am trying to find the period of a sin curve and can find the right periods for sin(t).
However for sin(k*t), the frequency shifts. I do not know how it shifts.
I can adjust the value of interd below to get the right signal only if I know the dataset is sin(0.6*t).
Why can I get the right result for sin(t)?
Anyone can detect the right signal just based on my code ? Or just a small change?
The figure below is the power spectral density of sin(0.6*t).
The dataset is like:
1,sin(1*0.6)
2,sin(2*0.6)
3,sin(3*0.6)
.........
2000,sin(2000*0.6)
And my code:
timepoints = np.loadtxt('dataset', usecols=(0,), unpack=True, delimiter=",")
intensity = np.loadtxt('dataset', usecols=(1,), unpack=True, delimiter=",")
binshu = 300
lastime = 2000
interd = 2000.0/300
sp = np.fft.fft(intensity)
freq = np.fft.fftfreq(len(intensity),d=interd)
freqnum = np.fft.fftfreq(len(intensity),d=interd).argsort()
pl.xlabel("frequency(Hz)")
pl.plot(freq[freqnum]*6.28, np.sqrt(sp.real**2+sp.imag**2)[freqnum])
I think you're making it too complicated. If you consider timepoints to be in seconds then interd is 1 (difference between values in timepoints). This works fine for me:
import numpy as np
import matplotlib.pyplot as pl
# you can do this in one line, that's what 'unpack' is for:
timepoints, intensity = np.loadtxt('dataset', usecols=(0,1), unpack=True, delimiter=",")
interd = timepoints[1] - timepoints[0] # if this is 1, it can be ignored
sp = np.fft.fft(intensity)
freq = np.fft.fftfreq(len(intensity), d=interd)
pl.plot(np.fft.fftshift(freq), np.fft.fftshift(np.abs(sp)))
pl.xlabel("frequency(Hz)")
pl.show()
You'll also note that I didn't sort the frequencies, that's what fftshift is for.
Also, don't do np.sqrt(sp.imag**2 + sp.real**2), that's what np.abs is for :)
If you're not sampling enough (the frequency is higher than your sample rate, i.e., 2*pi/interd < 0.5*k), then there's no way for fft to know how much data you're missing, so it assumes you're not missing any. You can't expect it to know a priori. This is the data you're giving it:
I have an array which contains error values as a function of two different quantities (alpha and eigRange).
I fill my array like this :
for j in range(n):
for i in range(alphaLen):
alpha = alpha_list[i]
c = train.eig(xt_, yt_,m-j, m,alpha, "cpu")
costListTrain[j, i] = cost.err(xt_, xt_, yt_, c)
normedValues=costListTrain/np.max(costListTrain.ravel())
where
n = 20
alpha_list = [0.0001,0.0003,0.0008,0.001,0.003,0.006,0.01,0.03,0.05]
My costListTrain array contains some values that have very small differences, e.g.:
2.809458902485728 2.809458905776425 2.809458913576337 2.809459011062461
2.030326752376704 2.030329906064879 2.030337351188699 2.030428976282031
1.919840839066182 1.919846470077076 1.919859731440199 1.920021453630778
1.858436351617677 1.858444223016128 1.858462730482461 1.858687054377165
1.475871326997542 1.475901926855846 1.475973476249240 1.476822830933632
1.475775410801635 1.475806023102173 1.475877601316863 1.476727286424228
1.475774284270633 1.475804896751524 1.475876475382906 1.476726165223209
1.463578292548192 1.463611627166494 1.463689466240788 1.464609083309240
1.462859608038034 1.462893157900139 1.462971489632478 1.463896516033939
1.461912706143012 1.461954067956570 1.462047793798572 1.463079574605320
1.450581041157659 1.452770209885761 1.454835202839513 1.459676311335618
1.450581041157643 1.452770209885764 1.454835202839484 1.459676311335624
1.450581041157651 1.452770209885735 1.454835202839484 1.459676311335610
1.450581041157597 1.452770209885784 1.454835202839503 1.459676311335620
1.450581041157575 1.452770209885757 1.454835202839496 1.459676311335619
1.450581041157716 1.452770209885711 1.454835202839499 1.459676311335613
1.450581041157667 1.452770209885744 1.454835202839509 1.459676311335625
1.450581041157649 1.452770209885750 1.454835202839476 1.459676311335617
1.450581041157655 1.452770209885708 1.454835202839442 1.459676311335622
1.450581041157571 1.452770209885700 1.454835202839498 1.459676311335622
as you can here the value are very very close together!
I am trying to plotting this data in a way where I have the two quantities in the x, y axes and the error value is represented by the dot color.
This is how I'm plotting my data:
alpha_list = np.log(alpha_list)
eigenvalues, alphaa = np.meshgrid(eigRange, alpha_list)
vMin = np.min(costListTrain)
vMax = np.max(costListTrain)
plt.scatter(x, y, s=70, c=normedValues, vmin=vMin, vmax=vMax, alpha=0.50)
but the result is not correct.
I tried to normalize my error value by dividing all values by the max, but it didn't work !
The only way that I could make it work (which is incorrect) is to normalize my data in two different ways. One is base on each column (which means factor1 is constant, factor 2 changing), and the other one based on row (means factor 2 is constant and factor one changing). But it doesn't really make sense because I need a single plot to show the tradeoff between the two quantities on the error values.
UPDATE
this is what I mean by last paragraph.
normalizing values base on max on each rows which correspond to eigenvalues:
maxsEigBasedTrain= np.amax(costListTrain.T,1)[:,np.newaxis]
maxsEigBasedTest= np.amax(costListTest.T,1)[:,np.newaxis]
normEigCostTrain=costListTrain.T/maxsEigBasedTrain
normEigCostTest=costListTest.T/maxsEigBasedTest
normalizing values base on max on each column which correspond to alphas:
maxsAlphaBasedTrain= np.amax(costListTrain,1)[:,np.newaxis]
maxsAlphaBasedTest= np.amax(costListTest,1)[:,np.newaxis]
normAlphaCostTrain=costListTrain/maxsAlphaBasedTrain
normAlphaCostTest=costListTest/maxsAlphaBasedTest
plot 1:
where no. eigenvalue = 10 and alpha changes (should correspond to column 10 of plot 1) :
where alpha = 0.0001 and eigenvalues change (should correspond to first row of plot1)
but as you can see the results are different from plot 1!
UPDATE:
just to clarify more stuff this is how I read my data:
from sklearn.datasets.samples_generator import make_regression
rng = np.random.RandomState(0)
diabetes = datasets.load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target
X_diabetes=np.c_[np.ones(len(X_diabetes)),X_diabetes]
ind = np.arange(X_diabetes.shape[0])
rng.shuffle(ind)
#===============================================================================
# Split Data
#===============================================================================
import math
cross= math.ceil(0.7*len(X_diabetes))
ind_train = ind[:cross]
X_train, y_train = X_diabetes[ind_train], y_diabetes[ind_train]
ind_val=ind[cross:]
X_val,y_val= X_diabetes[ind_val], y_diabetes[ind_val]
I also uploaded .csv files HERE
log.csv contain the original value before normalization for plot 1
normalizedLog.csv for plot 1
eigenConst.csv for plot 2
alphaConst.csv for plot 3
I think I found the answer. First of all there was one problem in my code. I was expecting the "No. of eigenvalue" correspond to rows but in my for loop they fill the columns. The currect answer is this :
for i in range(alphaLen):
for j in range(n):
alpha=alpha_list[i]
c=train.eig(xt_, yt_,m-j,m,alpha,"cpu")
costListTrain[i,j]=cost.err(xt_,xt_,yt_,c)
costListTest[i,j]=cost.err(xt_,xv_,yv_,c)
After asking questions from friends and colleagues I got this answer :
I would assume on default imshow and other plotting commands you
might want to use, do equally sized intervals on the values you are
plotting. if you can set that to logarithmic you should be fine.
Ideally, equally "populated bins" would proof most effective, i guess.
for plotting I just subtract the min value from the error and the add a small number and at the end take the log.
temp=costListTrain- costListTrain.min()
temp+=0.00000001
extent = [0, 20,alpha_list[0], alpha_list[-1]]
plt.imshow(np.log(temp),interpolation="nearest",cmap=plt.get_cmap('spectral'), extent = extent, origin="lower")
plt.colorbar()
and result is :