Parsing a file and creating histogram with python - python

I have to create an histogram from a source file that I have to parse:
for line in fp:
data = line.split('__')
if(len(data)==3 and data[2]!='\n' and data[1]!=''):
job_info = data[0].split(';')
[...]
job_times_req = data[2].split(';')
if(len(job_times_req)==6):
cpu_req = job_times_req[3]
The parsing is correct, I have try it, but now I would like to create an histogram on how many time I have called the X cpu. Example if I have called the first one 10 times, the second 4 times and so on I would like to see the hist of this.
I have try something like:
a.append(cpu_req )
plt.hist(a, 100)
plt.xlabel('CPU N°', fontsize=20)
plt.ylabel('Number of calls', fontsize= 20)
plt.show()
but is not working, how can I store the data in the correct way to show them in a histogram?
Solved with a simple cast
a.append(int(cpu_req))

Related

Very high memory usage with simple Python loop

I have the following code, which reads in a set of (small) observations, runs a cross-correlation calculation on them, and then saves some plots:
import matplotlib.pyplot as plt
import numpy as np
import astropy.units as u
from sunkit_image.time_lag import cross_correlation, get_lags, max_cross_correlation, time_lag
time=np.linspace(0,43200,num=int(43200/12))
timeu = time * u.s
for i in range(len(folders)): # loop over all dates
os.chdir('/Volumes/LaCie/timelags/RARs/'+folders[i])
print(folders[i])
for j in range(len(pairs)): # iterates over every pair of data sets
for x in range(36): # sets up a sliding 2-hour window that shifts 20 min at a time
ch_a = np.load('dc'+pairs[j][0]+'.npy',allow_pickle=True)[()][100*x:(100*x)+600,:,:] # read in only necessary data (but entire file is only ~6 Gb)
ch_b = np.load('dc'+pairs[j][1]+'.npy',allow_pickle=True)[()][100*x:(100*x)+600,:,:] # read in only necessary data (but entire file is only ~6 Gb)
ctime= timeu[100*x:(100*x)+600] # sets up the correct time array
print('ctime range:',ctime[0],ctime[-1],len(ctime))
max_cc_map = max_cross_correlation(ch_a, ch_b, ctime)
tl_map = time_lag(ch_a, ch_b, ctime)
del ch_a # trying to deal with memory issue
del ch_b # trying to deal with memory issue
plt.close('all') # making sure I don't just create endless open plots
fig = plt.figure()
ax = fig.add_subplot()
im = ax.imshow(np.flip(tl_map,axis=0), cmap="cubehelix", vmin=-6000, vmax=6000)
cax = make_axes_locatable(ax).append_axes("right", size="5%", pad="10%")
fig.colorbar(im, cax=cax,label=r"$\tau_{AB}$ [s]")
plt.tight_layout()
fig.savefig('timelag_'+pairs[j][0]+'_'+pairs[j][1]+'_'+str(x)+'.png',dpi=400)
fig = plt.figure()
ax = fig.add_subplot()
im = ax.imshow(np.flip(max_cc_map,axis=0), cmap="plasma",vmin=0,vmax=1)
cax = make_axes_locatable(ax).append_axes("right", size="5%", pad="10%")
fig.colorbar(im, cax=cax,label=r"Max Cross-correlation")
plt.tight_layout()
fig.savefig('maxcc_'+pairs[j][0]+'_'+pairs[j][1]+'_'+str(x)+'.png',dpi=400)
fig=plt.figure(figsize=(10,6))
values_tl, bins_tl, bars = plt.hist(np.ravel(np.asarray(tl_map)),bins=np.arange(-6000,6000,12000/50),log=True,label='Time Lags')
values_masked, bins_masked, bars = plt.hist(np.ravel(np.asarray(tl_map)[np.where(np.asarray(max_cc_map) > 0.25)])
,bins=np.arange(-6000,6000,12000/50),log=True,label='Masked CC > 0.25')
values_masked2, bins_masked2, bars = plt.hist(np.ravel(np.asarray(tl_map)[np.where(np.asarray(max_cc_map) > 0.5)])
,bins=np.arange(-6000,6000,12000/50),log=True,label='Masked CC > 0.5')
values_masked3, bins_masked3, bars = plt.hist(np.ravel(np.asarray(tl_map)[np.where(np.asarray(max_cc_map) > 0.75)])
,bins=np.arange(-6000,6000,12000/50),log=True,label='Masked CC > 0.75')
plt.ylabel('Pixel Occurrence')
plt.legend()
fig.savefig('hist_tl_cc_'+pairs[j][0]+'_'+pairs[j][1]+'_'+str(x)+'.png',dpi=400)
As noted in the comments, I've inserted a few lines to try to dump unnecessary data between iterations; I know a 3-deep for loop isn't the most efficient way to code, but the loop over the dates and channel pairs are very short -- almost all of the time/memory is spent in the innermost loop. The problem is that after a few minutes, the memory usage is oscillating between 30-55 GB. My Mac is becoming sluggish, and it's only at the beginning of the dataset. Is there something I'm missing here? Even if the entire files were being read in at the beginning instead of a subset, it's only ~ 12 Gb of data, and the code would crash if I was reading in the whole thing (i.e., it's definitely only reading in part of the raw data). I tried a with statement but that didn't take up less memory. Any suggestions would be very welcome!
Per loop you create 3 figures but you never close them. After each fig.savefig(...), you should close the figure with plt.close(fig).

How to compute the magnitude of a data file

I have two a two component data file (called RealData) that i am able to load and plot into python using matplotlib using the following code;
x = RealData[:,0]
y = RealData[:,1]
plt.plot(x,y
the first few lines of the data is
1431.11555,-0.02399
1430.15118,-0.02387
1429.18682,-0.02294
1428.22245,-0.02167
1427.25809,-0.02066
1426.29373,-0.02020
1425.32936,-0.02022
1424.36500,-0.02041
1423.40064,-0.02047
1422.43627,-0.02029
1421.47191,-0.01993
1420.50755,-0.01950
1419.54318,-0.01913
1418.57882,-0.01888
.........
I would like to plot the magnitude to the data so that the y component become positive, something like
|y| = squareRoot((-0.02399)^2 + (-0.02387)^2 + ... ))
I think this would involve some sort of for loop or while loop, however I am not sure how to construct it. any help?

Adding multiple images to a matplotlib subplot?

I am trying to make a matplottlib plot using some image data I have in numpy format, and was wondering if someone would be able to advise me on the best way to approach displaying multiples of these images within the boundaries of one subplot?
For example, using the following code...
n_samples = 10
sample_imgs, min_index = visualise_n_way(n_samples)
print(min_index)
print(sample_imgs.shape)
print(sample_imgs[0].shape)
print(x_train_w)
print(x_train_h)
img_matrix = []
for index in range(1, len(sample_imgs)):
img_matrix.append(np.reshape(sample_imgs[index], (x_train_w, x_train_h)))
img_matrix = np.asarray(img_matrix)
img_matrix = np.vstack(img_matrix)
f, ax = plt.subplots(1, 3, figsize = (10, 12))
f.tight_layout()
ax[0].imshow(np.reshape(sample_imgs[0], (x_train_w, x_train_h)),vmin=0, vmax=1,cmap='Greys')
ax[0].set_title("Test Image")
ax[1].imshow(img_matrix ,vmin=0, vmax=1,cmap='Greys')
ax[1].set_title("Support Set")
ax[2].imshow(np.reshape(sample_imgs[min_index], (x_train_w, x_train_h)),vmin=0, vmax=1,cmap='Greys')
ax[2].set_title("Image most similar to Test Image in Support Set")
I get the following image and output
1
(11, 784)
(784,)
28
28
Matplotlib Output
What I would like to do however is to have the second subplot, the one displaying img_matrix, to be the same size as the two either side of it, creating a grid of the images. Sort of like this
sketch.
I am at a loss as to how to do this however. I believe I may need to use something such as a gridspace, but I'm finding the documentation hard to follow for what I want to do.
Any help is greatly appreciated!

Struggling to correctly utilize the str.find() function

I'm trying to use the str.find() and it keeps raising an error, what am I doing wrong?
I have a matrix where the 1st column is numbers and the 2nd is an abbreviation assigned to those letters. the abbrevations are either ED, LI or NA, I'm trying to find the positions that correspond to those letters so that I can plot a scatter graph that is colour coded to match those 3 groups.
mat=sio.loadmat('PBMC_extract.mat') #loading the data file into python
data=mat['spectra']
data_name=mat['name'] #calling in varibale
data_name = pd.DataFrame(data_name) #converting intoa readable matrix
pca=PCA(n_components=20) # preforms pca on data with 20 components
pca.fit(data) #fits to data set
datatrans=pca.transform(data) #transforms data using PCA
# plotting the graph that accounts for majority of data and noise
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('Number of components')
plt.ylabel('Cumulative explained variance')
fig = plt.figure()
ax1 = Axes3D(fig)
#str.find to find individual positions of anticoagulants
str.find(data_name,'ED')
#renaming data for easiness
x_data=datatrans[0:35,0]
y_data=datatrans[0:35,1]
z_data=datatrans[0:35,2]
x2_data=datatrans[36:82,0]
y2_data=datatrans[36:82,1]
z2_data=datatrans[36:82,2]
x3_data=datatrans[83:97,0]
y3_data=datatrans[83:97,1]
z3_data=datatrans[83:97,2]
# scatter plot of score of PC1,2,3
ax1.scatter(x_data, y_data, z_data,c='b', marker="^")
ax1.scatter(x2_data, y2_data, z2_data,c='r', marker="o")
ax1.scatter(x3_data, y3_data, z3_data,c='g', marker="s")
ax1.set_xlabel('PC 1')
ax1.set_ylabel('PC 2')
ax1.set_zlabel('PC 3')
plt.show()
the error that keeps showing up is the following;
File "/Users/emma/Desktop/Final year project /working example of colouring data", line 49, in <module>
str.find(data_name,'ED')
TypeError: descriptor 'find' requires a 'str' object but received a 'DataFrame'
The error is because the find method expects a str object instead of a DataFrame object. As PiRK mentioned the problem is you're replacing the data_name variable here:
data_name = pd.DataFrame(data_name)
I believe it should be:
data = pd.DataFrame(data_name)
Also, although str.find(data_name,'ED') works, the suggested way to is to pass only the search term like this:
data_name.find('ED')
the proper syntax would be
data_name.find('ED')
look at the examples here
https://www.programiz.com/python-programming/methods/string/find
EDIT 1
though I just noticed data_name is a pandas DataFrame, so that won't work? What exactly are you trying to do?
your broken function call isn't even returning into a variable? So it's hard to answer your question?

Runs out of memory when plotting, Python

I'm retrieving a large number of data from a database, which I later plot using a scatterplot. However, I run out of memory, and the program aborts when I am using my full data. Just for the record it takes >30 minutes to run this program, and the length of the data list is about 20-30 million.
map = Basemap(projection='merc',
resolution = 'c', area_thresh = 10,
llcrnrlon=-180, llcrnrlat=-75,
urcrnrlon=180, urcrnrlat=82)
map.drawcoastlines(color='black')
# map.fillcontinents(color='#27ae60')
with lite.connect('database.db') as con:
start = 1406851200
end = 1409529600
cur = con.cursor()
cur.execute('SELECT latitude, longitude FROM plot WHERE unixtime >= {start} AND unixtime < {end}'.format(start = start, end = end))
data = cur.fetchall()
y,x = zip(*data)
x,y = map(x,y)
plt.scatter(x,y, s=0.05, alpha=0.7, color="#e74c3c", edgecolors='none')
plt.savefig('Plot.pdf')
plt.savefig('Plot.png')
I think my problem may be in the zip(*) function, but I really have no clue. I'm both interested in how I can preserve more memory by rewriting my existing code, and to split up the plotting process. My idea is to split the time period in half, then just do the same thing twice for the two time periods before saving the figure, however I am unsure on this will help me at all. If the problem is to actually plot it, I got no idea.
If you think the problem lies in the zip function, why not use a matplotlib array to massage your data into the right format? Something like this:
data = numpy.array(cur.fetchall())
lat = data[:,0]
lon = data[:,1]
x,y = map(lon, lat)
Also, your generated PDF will be very large and slow to render by the various PDF readers because it is a vectorized format by default. All your millions of data points will be stored as floats and rendered when the user opens the document. I recommend that you add the rasterized=True argument to your plt.scatter() call. This will save the result as a bitmap inside your PDF (see the docs here)
If this all doesn't help, I would investigate further by commenting out lines starting at the back. That is, first comment out plt.savefig('Plot.png') and see if the memory use goes down. If not, comment out the line before that, etc.

Categories

Resources