Python - plot numpy array with gaps in the data - python

I need to plot some spectral data as a 2D image, where each data point corresponds to a spectrum with a specific date/time. I require to plot all spectra as follows:
- xx-axis - corresponds to the wavelenght
- yy-axis - corresponds to the date/time
- intensity - corresponds to the flux
If my datapoints were continuous/sequential in time I would just use matplotlib's imshow. However, not only the points are not all continuous/sequential in time but I have large time gaps between points.
here is some simulated data that mimics what I have:
import numpy as np
sampleSize = 100
data={}
for time in np.arange(0,5):
data[time] = np.random.sample(sampleSize)
for time in np.arange(14,20):
data[time] = np.random.sample(sampleSize)
for time in np.arange(30,40):
data[time] = np.random.sample(sampleSize)
for time in np.arange(25.5,35.5):
data[time] = np.random.sample(sampleSize)
for time in np.arange(80,120):
data[time] = np.random.sample(sampleSize)
if I needed to print only one of the subsets of data above; i would do:
mplt.imshow([data[time] for time in np.arange(0,5)], cmap ='Greys',aspect='auto',origin='lower',interpolation="none",extent=[-50,50,0,5])
mplt.show()
however, I have no idea how I can print all data in the same plot, while showing the gaps and keeping the yy-axis as the time. Any ideas?
thanks,
Jorge

Or you can use pandas to help you with sorting the keys, then reindex:
df = pd.DataFrame(data).T
plt.imshow(df.reindex(np.arange(df.index.max())),
cmap ='Greys',
aspect='auto',
origin='lower',
interpolation="none",
extent=[-50,50,0,5])
Output:

In the end I ended up using a different approach:
1) re-index the time in my data so that no two arrays has the same time and I avoid non-integer indexes
nTimes = 1
timeIndexes=[int(float(index)) for index in data.keys()]
while len(timeIndexes) != len(set(timeIndexes)):
nTimes += 1
timeIndexes=[int(nTimes*float(index)) for index in data.keys()]
timeIndexesDict = {str(int(nTimes*float(index))):data[index] for index in data.keys()}
lenData2Plot = max([int(key) for key in timeIndexesDict.keys()])
2) create an array of zeros with the number of columns like my data and a number of rows corresponding to my maximum re-indexed time
data2Plot = np.zeros((int(lenData2Plot)+1,sampleSize))
3) replace the rows in my array of zeros corresponding to my re-indeed times
for index in timeIndexesDict.keys():
data2Plot[int(index)][:] = timeIndexesDict[str(index)]
4) plot as I normally would plot an array with no gaps
mplt.imshow(data2Plot,
cmap='Greys',aspect='auto',origin='lower',interpolation="none",
extent=[-50,50,0,120])
mplt.show()

Related

Interpolate: spectra (wavelength, counts) at a given temperature, to create grid of temperature and counts

I have a number of spectra: wavelength/counts at a given temperature. The wavelength range is the same for each spectrum.
I would like to interpolate between the temperature and counts to create a large grid of spectra (temperature and counts (at a given wavelength range).
The code below is my current progress. When I try to get a spectrum for a given temperature I only get one value of counts when I need a range of counts representing the spectrum (I already know the wavelengths).
I think I am confused about arrays and interpolation. What am I doing wrong?
import pandas as pd
import numpy as np
from scipy import interpolate
image_template_one = pd.read_excel("mr_image_one.xlsx")
counts = np.array(image_template_one['counts'])
temp = np.array(image_template_one['temp'])
inter = interpolate.interp1d(temp, counts, kind='linear')
temp_new = np.linspace(30,50,0.5)
counts_new = inter(temp_new)
I am now think that I have two arrays; [wavelength,counts] and [wavelength, temperature]. Is this correct, and, do I need to interpolate between the arrays?
Example data
I think what you want to achieve can be done with interp2d:
from scipy import interpolate
# dummy data
data = pd.DataFrame({
'temp': [30]*6 + [40]*6 + [50]*6,
'wave': 3 * [a for a in range(400,460,10)],
'counts': np.random.uniform(.93,.95,18),
})
# make the interpolator
inter = interpolate.interp2d(data['temp'], data['wave'], data['counts'])
# scipy's interpolators return functions,
# which you need to call with the values you want interpolated.
new_x, new_y = np.linspace(30,50,100), np.linspace(400,450,100)
interpolated_values = inter(new_x, new_y)

Pandas: removing everything in a column after first value above threshold

I'm interested in the first time a random process crosses a threshold. I am storing the results from observing the process in a dataframe, and have plotted how many times several realisations of that process cross 0.9 after I observe it a the end of 14 rounds.
This image was created with this code
import matplotlib.pyplot as plt
plt.style.use('ggplot')
fin = pd.DataFrame(data=np.random.uniform(size=(100, 13))).T
pos = (fin>0.9).astype(float)
ax=fin.loc[:, pos.loc[12, :] != 1.0].plot(figsize=(12, 6), color='silver', legend=False)
fin.loc[:, pos.loc[12, :] == 1.0].plot(figsize=(12, 6), color='indianred', legend=False, ax=ax)
where fin contained the random numbers, and pos was 1 every time that process crossed 0.9.
I would like to now plot the first time the process in fin crosses 0.9 for each realisation (columns represent realisations, rows represent observation times)
I can find the first occurence of a value above 0.9 with idxmax() but I'm stumped about how to remove everything in the dataframe after that in each column.
import numpy as np
import pandas as pd
df = pd.DataFrame(data=np.random.uniform(size=(100, 10)))
maxes = df.idxmax()
It's just that I'm having real difficulty thinking through this.
If I understand correctly, you can use
df = df[df.index < maxes[0]]
IIUC, we can use a boolean matrix with cumprod:
df.where((df < .9).cumprod().astype(bool)).plot()
Output:

Using python to take a 32x32 matrices append many of these matrices to a single array then adding a timestamp index to each matrix

I am very new to coding python and I am working with a .CSV file that gives me a 32x32 matrix in a 1024 column row with a time stamp. I reshaped the data to give me 32x32 arrays and looped through each row appending the matrices to a numpy array.
`i = 0
while i < len(df_array):
if i == 0:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = spec
else:
spec = np.reshape(df_array[i][np.arange(1,1025)], (32,32))
spectrum_matrix = np.concatenate((spectrum_matrix, spec), axis = 0)
i = i + 1
print("job done")`
What I would like to do is to add the time stamp from the original data file and add them to each of the matrices thus allowing me to re sample the data over a 5 minute average. I also would like to plot the bins a to get a plot similar to this Drop size distribution
As a reference I am reading in the data .CSV with pandas and here is an example of a portion of the raw data: 01.06.2017;18:22:20;0.122;0.00;51;7.401;10375;18745;57;27;0.00;23.6;0.110;0;
<SPECTRUM>;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
The ;'s after the SPECTRUM is the 32x32 matrix.
Thanks in advance for any help!
Python and associated packages can do many things without loops
From my understanding of your data you have a (8640 x 32 x 32) Data Structure (time x size x velocity).
Pandas works very well with 2D data structures, however for higher dimensional data I would recommend you get familiar with xarray. With this package along with pandas you can create and manipulate your data without having to resort to loops.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import xarray as xr
import seaborn as sns
%matplotlib inline
#create random data
data = (np.random.binomial(n =5, p =0.2, size =(8640,32,32))*1000).astype(int)
#create labels for data
sizes= np.linspace(1,5,32)
velocities = np.linspace(1,1000, num = 32)
#make time range of 24 hours with 10sec intervals
ind = pd.date_range(start='2014-01-01', periods=8640, freq='10s')
#convert data to xarray 3D data structure
df = xr.DataArray(data, coords = [ind, sizes, velocities],
dims = ['time', 'size', 'speed'])
#make a 5 min average of the data
min_average= df.resample('300s', dim = 'time', how = 'mean')
#plot sample of data and 5 min average
my1d = min_average.isel(size = 5, speed= 10)
my1d.plot(label = '5 min avg')
plt.gca()
df.isel(size = 5, speed =10).plot(alpha = 0.3, c = 'r', label = 'raw_data')
plt.legend()
As for making a distribution plot like you linked things become a bit trickier but is possible:
#transform your data to only have mean speed for each time and size
#and convert to pandas dataframe
mean_speed =min_average.mean(dim = ['speed'])
#for some reason xarray make you name the new column when you convert
#to a pandas dataframe. I then get rid of the extra empty variable with
#a list comprehension
df= mean_speed.to_dataframe('').unstack().T
df.index = np.array([np.array(i)[1].astype(float) for i in df.index])
#make a contourplot of your new data
plt.contourf(df.columns, df.index, df.values, cmap ='PuBu_r')
plt.title('mean speed')
plt.ylabel('size')
plt.xlabel('time')
plt.colorbar()

plotting high precision data

I have an array which contains error values as a function of two different quantities (alpha and eigRange).
I fill my array like this :
for j in range(n):
for i in range(alphaLen):
alpha = alpha_list[i]
c = train.eig(xt_, yt_,m-j, m,alpha, "cpu")
costListTrain[j, i] = cost.err(xt_, xt_, yt_, c)
normedValues=costListTrain/np.max(costListTrain.ravel())
where
n = 20
alpha_list = [0.0001,0.0003,0.0008,0.001,0.003,0.006,0.01,0.03,0.05]
My costListTrain array contains some values that have very small differences, e.g.:
2.809458902485728 2.809458905776425 2.809458913576337 2.809459011062461
2.030326752376704 2.030329906064879 2.030337351188699 2.030428976282031
1.919840839066182 1.919846470077076 1.919859731440199 1.920021453630778
1.858436351617677 1.858444223016128 1.858462730482461 1.858687054377165
1.475871326997542 1.475901926855846 1.475973476249240 1.476822830933632
1.475775410801635 1.475806023102173 1.475877601316863 1.476727286424228
1.475774284270633 1.475804896751524 1.475876475382906 1.476726165223209
1.463578292548192 1.463611627166494 1.463689466240788 1.464609083309240
1.462859608038034 1.462893157900139 1.462971489632478 1.463896516033939
1.461912706143012 1.461954067956570 1.462047793798572 1.463079574605320
1.450581041157659 1.452770209885761 1.454835202839513 1.459676311335618
1.450581041157643 1.452770209885764 1.454835202839484 1.459676311335624
1.450581041157651 1.452770209885735 1.454835202839484 1.459676311335610
1.450581041157597 1.452770209885784 1.454835202839503 1.459676311335620
1.450581041157575 1.452770209885757 1.454835202839496 1.459676311335619
1.450581041157716 1.452770209885711 1.454835202839499 1.459676311335613
1.450581041157667 1.452770209885744 1.454835202839509 1.459676311335625
1.450581041157649 1.452770209885750 1.454835202839476 1.459676311335617
1.450581041157655 1.452770209885708 1.454835202839442 1.459676311335622
1.450581041157571 1.452770209885700 1.454835202839498 1.459676311335622
as you can here the value are very very close together!
I am trying to plotting this data in a way where I have the two quantities in the x, y axes and the error value is represented by the dot color.
This is how I'm plotting my data:
alpha_list = np.log(alpha_list)
eigenvalues, alphaa = np.meshgrid(eigRange, alpha_list)
vMin = np.min(costListTrain)
vMax = np.max(costListTrain)
plt.scatter(x, y, s=70, c=normedValues, vmin=vMin, vmax=vMax, alpha=0.50)
but the result is not correct.
I tried to normalize my error value by dividing all values by the max, but it didn't work !
The only way that I could make it work (which is incorrect) is to normalize my data in two different ways. One is base on each column (which means factor1 is constant, factor 2 changing), and the other one based on row (means factor 2 is constant and factor one changing). But it doesn't really make sense because I need a single plot to show the tradeoff between the two quantities on the error values.
UPDATE
this is what I mean by last paragraph.
normalizing values base on max on each rows which correspond to eigenvalues:
maxsEigBasedTrain= np.amax(costListTrain.T,1)[:,np.newaxis]
maxsEigBasedTest= np.amax(costListTest.T,1)[:,np.newaxis]
normEigCostTrain=costListTrain.T/maxsEigBasedTrain
normEigCostTest=costListTest.T/maxsEigBasedTest
normalizing values base on max on each column which correspond to alphas:
maxsAlphaBasedTrain= np.amax(costListTrain,1)[:,np.newaxis]
maxsAlphaBasedTest= np.amax(costListTest,1)[:,np.newaxis]
normAlphaCostTrain=costListTrain/maxsAlphaBasedTrain
normAlphaCostTest=costListTest/maxsAlphaBasedTest
plot 1:
where no. eigenvalue = 10 and alpha changes (should correspond to column 10 of plot 1) :
where alpha = 0.0001 and eigenvalues change (should correspond to first row of plot1)
but as you can see the results are different from plot 1!
UPDATE:
just to clarify more stuff this is how I read my data:
from sklearn.datasets.samples_generator import make_regression
rng = np.random.RandomState(0)
diabetes = datasets.load_diabetes()
X_diabetes, y_diabetes = diabetes.data, diabetes.target
X_diabetes=np.c_[np.ones(len(X_diabetes)),X_diabetes]
ind = np.arange(X_diabetes.shape[0])
rng.shuffle(ind)
#===============================================================================
# Split Data
#===============================================================================
import math
cross= math.ceil(0.7*len(X_diabetes))
ind_train = ind[:cross]
X_train, y_train = X_diabetes[ind_train], y_diabetes[ind_train]
ind_val=ind[cross:]
X_val,y_val= X_diabetes[ind_val], y_diabetes[ind_val]
I also uploaded .csv files HERE
log.csv contain the original value before normalization for plot 1
normalizedLog.csv for plot 1
eigenConst.csv for plot 2
alphaConst.csv for plot 3
I think I found the answer. First of all there was one problem in my code. I was expecting the "No. of eigenvalue" correspond to rows but in my for loop they fill the columns. The currect answer is this :
for i in range(alphaLen):
for j in range(n):
alpha=alpha_list[i]
c=train.eig(xt_, yt_,m-j,m,alpha,"cpu")
costListTrain[i,j]=cost.err(xt_,xt_,yt_,c)
costListTest[i,j]=cost.err(xt_,xv_,yv_,c)
After asking questions from friends and colleagues I got this answer :
I would assume on default imshow and other plotting commands you
might want to use, do equally sized intervals on the values you are
plotting. if you can set that to logarithmic you should be fine.
Ideally, equally "populated bins" would proof most effective, i guess.
for plotting I just subtract the min value from the error and the add a small number and at the end take the log.
temp=costListTrain- costListTrain.min()
temp+=0.00000001
extent = [0, 20,alpha_list[0], alpha_list[-1]]
plt.imshow(np.log(temp),interpolation="nearest",cmap=plt.get_cmap('spectral'), extent = extent, origin="lower")
plt.colorbar()
and result is :

how to plot on a smaller scale

I am using matplotlib and I'm finding some problems when trying to plot large vectors.
sometimes get "MemoryError"
My question is whether there is any way to reduce the scale of values ​​that i need to plot ?
In this example I'm plotting a vector with size 2647296!
is there any way to plot the same values ​​on a smaller scale?
It is very unlikely that you have so much resolution on your display that you can see 2.6 million data points in your plot. A simple way to plot less data is to sample e.g. every 1000th point: plot(x[::1000]). If that loses too much and it is e.g. important to see the extremal values, you could write some code to split the long vector into suitably many parts and take the minimum and maximum of each part, and plot those:
tmp = x[:len(x)-len(x)%1000] # drop some points to make length a multiple of 1000
tmp = tmp.reshape((1000,-1)) # split into pieces of 1000 points
tmp = tmp.reshape((-1,1000)) # alternative: split into 1000 pieces
figure(); hold(True) # plot minimum and maximum in the same figure
plot(tmp.min(axis=0))
plot(tmp.max(axis=0))
You can use a min/max for each block of data to subsample the signal.
Window size would have to be determined based on how accurately you want to display your signal and/or how large the window is compared to the signal length.
Example code:
from scipy.io import wavfile
import matplotlib.pyplot as plt
def value_for_window_min_max(data, start, stop):
min = data[start]
max = data[start]
for i in range(start,stop):
if data[i] < min:
min = data[i]
if data[i] > max:
max = data[i]
if abs(min) > abs(max):
return min
else:
return max
# This will only work properly if window_size divides evenly into len(data)
def subsample_data(data, window_size):
print len(data)
print len(data)/window_size
out_data = []
for i in range(0,(len(data)/window_size)):
out_data.append(value_for_window_min_max(data,i*window_size,i*window_size+window_size-1))
return out_data
sample_rate, data = wavfile.read('<path_to_wav_file>')
sub_amt = 10
sub_data = subsample_data(data, sub_amt)
print len(data)
print len(sub_data)
fig = plt.figure(figsize=(8,6), dpi=100)
fig.add_subplot(211)
plt.plot(data)
plt.title('Original')
plt.xlim([0,len(data)])
fig.add_subplot(212)
plt.plot(sub_data)
plt.xlim([0,len(sub_data)])
plt.title('Subsampled by %d'%sub_amt)
plt.show()
Output:

Categories

Resources