optimise zarr array processing - python

I have a list (mylist) of 80 5-D zarr files with the following structure (T, F, B, Az, El). The array has shape [24x4096x2016x24x8].
I want to extract sliced data and run a probability along some axis using the following function
def GetPolarData(mylist, freq, FreqLo, FreqHi):
This function will take the list of zarr files (T, F, B, Az, El), open them, used selected frequency to return an array
of files with Azimuth and Elevation probabilities
ChanIndx = FreqCut(FreqLo, FreqHi,freq)
if len(ChanIndx) != 0:
MyData = []
for i in range(len(mylist)):
print('Adding file {} : {}'.format(i,mylist[i][32:]))
zarrf = xr.open_zarr(mylist[i], group = 'arr')
m = zarrf.master.sum(dim = ['time','baseline'])
m = m[ChanIndx].sum(dim = ['frequency'])
c = zarrf.counter.sum(dim = ['time','baseline'])
c = c[ChanIndx].sum(dim = ['frequency'])
p = m.astype(float)/c.astype(float)
except Exception as e:
print("Something went wrong in Frequency selection")
print("This will be contribution to selected band")
print(f"Min {np.nanmin(MyData)*100:.3f}% ")
print(f"Max {np.nanmax(MyData)*100:.3f}% ")
print(f"Average {np.nanmean(MyData)*100:.3f}% ")
If I call the function using the following,
FreqLo = 470.
FreqHi = 854.
MyTVData =np.array(GetPolarData(AllZarrList,Freq, FreqLo, FreqHi))
I find it is taking so long to run (over 3hrs) on a 40 core, 256 GB RAM
Is there a way to make this runs faster?
Thank you

It seems like you could take advantage of parallelization here : each array is only read once, and they are all processed independently of each other.
XArray and others may do computation in parallel but for your application, using the multiprocessing library could help sharing the work among different cores more evenly.
The best tool to achieve good performances is the profile library, which can show the most time-consuming parts of your code. I suggest you run it on a single-process version of your code : it will be easier to use.


Creating loop with particular behaviour depending on data length

In my program I have a part of code that uses an Estimated Moving Average (EMA) 4 times, but each time with different length. The program uses one or more EMAs depending on how much data it gets.
For now the code is not looped, just copy pasted with minor tweeks. That makes making changes difficult because I have to change everything 4 times.
Can somebody help me loop the code in such a way it wont loose it behaviour pattern. The mock-up code is presented here:
import random
import numpy as np
def SI_sma(data, zakres):
smas=np.convolve(data, weights, 'valid')
return smas
def SI_ema(data, zakres):
weights_ema = np.exp(np.linspace(-1.,0.,zakres))
weights_ema /= weights_ema.sum()
return ema
while True:
if len(data)>zakres[0]:
smas=SI_sma(data=data, zakres=zakres[0])
ema=SI_ema(data=data, zakres=zakres[0])
print(smas[-1]) #calc using smas
print(ema[-1]) #calc using ema1
if len(data)>zakres[1]:
ema2=SI_ema(data=data, zakres=zakres[1])
print(ema2[-1]) #calc using ema2
if len(data)>zakres[2]:
ema3=SI_ema(data=data, zakres=zakres[2])
print(ema3[-1]) #calc using ema3
if len(data)>zakres[3]:
ema4=SI_ema(data=data, zakres=zakres[3])
print(ema4[-1]) #calc using ema4
input("press a key")
A variable number of variables is usually a bad idea. As you have found, it can make maintaining code cumbersome and error-prone. Instead, you can define a dict of results and use a for loop to iterate scenarios, defining len(data) just once.
ema = {}
while True:
n = len(data)
for i, val in enumerate(zakres):
if n > val:
if i == 1:
smas = SI_sma(data=data, zakres=val)
ema[i] = SI_ema(data=data, zakres=val)
You can then access results via ema[0], ..., ema[3] as required.

Implementing multiprocessing to deal with heavy input/output on HPC

I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra. With the help of #Paul Panzer, I already avoid reading the same file multiple times.
I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).
The code I have now is:
import numpy as np
from itertools import izip
import itertools
import fitsio
x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
for line1, line2 in izip(file_ID,file_c):
parts1 = line1.split()
parts2 = line2.split()
def data_analysis(n_galaxies):
n_num = 0
data = np.zeros((n_galaxies), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])
idx = np.lexsort((n3,n2,n1))
for kk,gg in itertools.groupby(zip(idx, n1[idx], n2[idx]), lambda x: x[1:]):
filename = "../../data/" + str(kk[0]) + "/spPlate-" + str(kk[0]) + "-" + str(kk[1]) + ".fits"
fits_spectra = fitsio.FITS(filename)
fluxx = fits_spectra[0].read()
n_element = fluxx.shape[1]
hdu = fits_spectra[0].read_header()
wave_start = hdu['CRVAL1']
logwave = wave_start + 0.0001 * np.arange(n_element)
wavegrid = np.power(10,logwave)
for ss, plate1, mjd1 in gg:
if n_num % 1000000 == 0:
print n_num
n3new = n3[ss]-1
flux = fluxx[n3new]
### following is my data reduction of individual spectra, I will skip here
### After all my analysis, I have the data storage as below:
data['spec'][n_num] = flux_intplt
data['x'][n_num] = x[ss]
data['y'][n_num] = y[ss]
data['r'][n_num] = r[ss]
data['s'][n_num] = s[ss]
n_num += 1
print n_num
data_output = FITS('./analyzedDATA/data_ALL.fits','rw')
I kind of understand that the multiprocessing need to remove one loop, but pass the index to the function. However, there are two loops in my function and those two are highly correlated, so I do not know how to approach. Since the most time-consuming part of this code is reading files from disk, so the multiprocessing need to take full advantage of cores to read multiple files at one time. Could any one shed a light on me?
Get rid of global vars, you can't use global vars with processes
Merge your multiple global vars into one container class or dict,
assigning different segments of the same spectra into one data set
Move your global with open(... into a def ...
Separate data_output into a own def ...
Try first, without multiprocessing, this concept:
for line1, line2 in izip(file_ID,file_c):
data_set = create data set from (line1, line2)
result = data_analysis(data_set)
Consider to use 2 processes one for file reading and one for file writing.
Use multiprocessing.Pool(processes=n) for data_analysis.
Communicate between processes using multiprocessing.Manager().Queue()

Multiprocess or threading with huge data structure for RAM and speed issues. Python 2.7

I'm writing an application about MST algorithm passing huge graph (like 100 / 150 milion edges) in Python 2.7 . Graph is setted up with Adjacency List using a classic class with method like :
def insertArcW(self, tail, head, weight):
if head in self.nodes and tail in self.nodes:
def insertNode(self, e):
newnode = Node(self.nextId, e)
self.nextId += 1
I'm also using Linked List (created with array) and queue from python stdlibrary(version 2.7).
With this piece of code the insert is really fast (due to less number of nodes compare to number of edges.):
n = []
for i in xrange(int(file_List[0])):
Problem comes with the insert of the edges:
for e in xrange(len(arc_List))
G.insertArcW(n[arc_List[e][0]].index, n[arc_List[e][1]].index,arc_List[e][2])
G.insertArcW(n[arc_List[e][1]].index, n[arc_List[e][0]].index,arc_List[e][2])
It's working great with 1 milion edges but with more it going to eat all of my ram (4GB , 64bit) but no freeze ! It can build the graph in a long time ! Considering that usage of CPU is limited to 19/25 % while doing this , there is a way of doing such things in multiprocess or multithread ? Like build the graph with two core doing same operation at same time but with different data ? I mean one core working with half of edges and another core with other half.
I'm practically new to this "place of programming" above all in Python.
EDIT : By using this function i'm setting up two list for nodes and edges ! I need to take information by a ".txt" file. Inserting the insertArcW and insertNode there is a oscillation of RAM between 2.4GB to 2.6GB . Now I can say that is stable (maybe due to "delete" of the two huge list of edges and node) but always at the same speed. Code :
f = open(graph + '.txt','r')
v = f.read()
file_List = re.split('\s+',v)
arc_List = []
n = []
p = []
for x in xrange(0,int(file_List[1])):
for i in xrange(int(file_List[0])):
for weight in xrange(1,int(file_List[1])+1):
i = 0
r = 0
while r < int(file_List[1]):
for k in xrange(2,len(file_List),2):
arc_List[r][0] = int(file_List[k])
arc_List[r][1] = int(file_List[k+1])
arc_List[r][2] = float(p[i])
G.insertArcW(n[arc_List[r][0]].index, n[arc_List[r][1]].index,arc_List[r][2])
G.insertArcW(n[arc_List[r][1]].index, n[arc_List[r][0]].index,arc_List[r][2])
print r

MemoryError with large sparse matrices

For a project I have built a program that constructs large matrices.
def ExpandSparse(LNew):
SpId = ssp.csr_matrix(np.identity(MS))
Sz = MS**LNew
HNew = ssp.csr_matrix((Sz,Sz))
Bulk = dict()
for i in range(LNew-1):
for j in range(LNew-1):
if i == j:
Ha = ssp.csr_matrix((8,8))
for i in range(LNew-1):
for j in range(LNew-2):
if j < 1:
Ha = ssp.csr_matrix(ssp.kron(Bulk[(i,j)],Bulk[(i,j+1)]))
Ha = ssp.csr_matrix(ssp.kron(Ha,Bulk[(i,j+1)]))
HNew = HNew + Ha
except MemoryError:
print('The matrix you tried to build requires too much memory space.')
return HNew
This does the job, however it does not work as well as I would have expected. The problem is that it won't allow for really large matrices. When LNewis larger than 13 I will get a MemoryError. My experiences with numpy suggest that, memorywise, I should be able to get LNew up to 18 or 19 before I get this error. Does this have to do with my code, or with the way scipy.sparse.kron() works with these matrices?
Another note that might be important is that I use Windows not Linux.
After some more reading on the working of the scipy.sparse.kron() function I have noticed that there is a third term named format you can enter. The default setting is None, but when it is put on 'csr' or another supported format it will only use the sparse format making it a lot more efficient, now for me it can build a 2097152 x 2097152 matrix. Here LNew is 21.

Python: numpy.corrcoef Memory Error

I was trying to calculate the correlation between a large set of data read from a text. For extremely large data set the program give a memory error. Can anyone please tell me how to correct this problem. Thanks
The following is my code:
enter code here
import numpy
from numpy import *
from array import *
from decimal import *
import sys
Threshold = 0.8;
TopMostData = 10;
FileName = sys.argv[1]
File = open(FileName,'r')
SignalData = numpy.empty((1, 128));
SignalData[:][:] = 0;
for line in File:
TempLine = line.split();
TempInt = [float(i) for i in TempLine]
SignalData = vstack((SignalData,TempInt))
del TempLine;
del TempInt;
TempData = SignalData;
SignalData = SignalData[1:,:]
SignalData = SignalData[:,65:128]
print "File Read | Data Stored" + " | Total Lines: " + str(len(SignalData))
CorrelationData = numpy.corrcoef(SignalData)
The following is the error:
Traceback (most recent call last):
File "Corelation.py", line 36, in <module>
CorrelationData = numpy.corrcoef(SignalData)
File "/usr/lib/python2.7/dist-packages/numpy/lib/function_base.py", line 1824, in corrcoef
return c/sqrt(multiply.outer(d, d))
You run out of memory as the comments show. If that happens because you are using 32-bit Python, even the method below will fail. But for the 64-bit Python and not-so-much-RAM situation we can do a lot as calculating the correlations is easily done piecewise, as you only need two lines in the memory simultaneously.
So, you may split your input into, say, 1000 row chunks, and then the resulting 1000 x 1000 matrices are easy to keep in memory. Then you can assemble your result into the big output matrix which is not necessarily in the RAM. I recommend this approach even if you have a lot of RAM, because this is much more memory-friendly. Correlation coefficient calculation is not an operation where fast random accesses would help a lot if the input can be kept in RAM.
Unfortunately, the numpy.corrcoef does not do this automatically, and we'll have to roll our own correlation coefficient calculation. Fortunately, that is not as hard as it sounds.
Something along these lines:
import numpy as np
# number of rows in one chunk
# the big table, which is usually bigger
bigdata = numpy.random.random((27000, 128))
numrows = bigdata.shape[0]
# subtract means form the input data
bigdata -= np.mean(bigdata, axis=1)[:,None]
# normalize the data
bigdata /= np.sqrt(np.sum(bigdata*bigdata, axis=1))[:,None]
# reserve the resulting table onto HDD
res = np.memmap("/tmp/mydata.dat", 'float64', mode='w+', shape=(numrows, numrows))
for r in range(0, numrows, SPLITROWS):
for c in range(0, numrows, SPLITROWS):
r1 = r + SPLITROWS
c1 = c + SPLITROWS
chunk1 = bigdata[r:r1]
chunk2 = bigdata[c:c1]
res[r:r1, c:c1] = np.dot(chunk1, chunk2.T)
Some notes:
the code above is tested above np.corrcoef(bigdata)
if you have complex values, you'll need to create a complex output array res and take the complex conjugate of chunk2.T
the code garbles bigdata to maintain performance and minimize memory use; if you need to preserve it, make a copy
The above code takes about 85 s to run on my machine, but the data will mostly fit in RAM, and I have a SSD disk. The algorithm is coded in such order to avoid too random access into the HDD, i.e. the access is reasonably sequential. In comparison, the non-memmapped standard version is not significantly faster even if you have a lot of memory. (Actually, it took a lot more time in my case, but I suspect I ran out of my 16 GiB and then there was a lot of swapping going on.)
You can make the actual calculations faster by omitting half of the matrix, because res.T == res. In practice, you can omit all blocks where c > r and then mirror them later on. On the other hand, the performance is most likely limited by the HDD preformance, so other optimizations do not necessarily bring much more speed.
Of course, this approach is easy to make parallel, as the chunk calculations are completely independent. Also the memmapped array can be shared between threads rather easily.

