Implementing multiprocessing to deal with heavy input/output on HPC

Implementing multiprocessing to deal with heavy input/output on HPC - python

I need to process over 10 million spectroscopic data sets. The data is structured like this: there are around 1000 .fits (.fits is some data storage format) files, each file contains around 600-1000 spectra in which there are around 4500 elements in each spectra (so each file returns a ~1000*4500 matrix). That means each spectra is going to be repeatedly read around 10 times (or each file is going to be repeatedly read around 10,000 times) if I am going to loop over the 10 million entries. Although the same spectra is repeatedly read around 10 times, it is not duplicate because each time I extract different segments of the same spectra. With the help of #Paul Panzer, I already avoid reading the same file multiple times.
I have a catalog file which contains all the information I need, like the coordinates x, y, the radius r, the strength s, etc. The catalog also contains the information to target which file I am going to read (identified by n1, n2) and which spectra in that file I am going to use (identified by n3).
The code I have now is:
import numpy as np
from itertools import izip
import itertools
import fitsio
x = []
y = []
r = []
s = []
n1 = []
n2 = []
n3 = []
with open('spectra_ID.dat') as file_ID, open('catalog.txt') as file_c:
for line1, line2 in izip(file_ID,file_c):
parts1 = line1.split()
parts2 = line2.split()
n1.append(int(parts1[0]))
n2.append(int(parts1[1]))
n3.append(int(parts1[2]))
x.append(float(parts2[0]))
y.append(float(parts2[1]))
r.append(float(parts2[2]))
s.append(float(parts2[3]))
def data_analysis(n_galaxies):
n_num = 0
data = np.zeros((n_galaxies), dtype=[('spec','f4',(200)),('x','f8'),('y','f8'),('r','f8'),('s','f8')])
idx = np.lexsort((n3,n2,n1))
for kk,gg in itertools.groupby(zip(idx, n1[idx], n2[idx]), lambda x: x[1:]):
filename = "../../data/" + str(kk[0]) + "/spPlate-" + str(kk[0]) + "-" + str(kk[1]) + ".fits"
fits_spectra = fitsio.FITS(filename)
fluxx = fits_spectra[0].read()
n_element = fluxx.shape[1]
hdu = fits_spectra[0].read_header()
wave_start = hdu['CRVAL1']
logwave = wave_start + 0.0001 * np.arange(n_element)
wavegrid = np.power(10,logwave)
for ss, plate1, mjd1 in gg:
if n_num % 1000000 == 0:
print n_num
n3new = n3[ss]-1
flux = fluxx[n3new]
### following is my data reduction of individual spectra, I will skip here
### After all my analysis, I have the data storage as below:
data['spec'][n_num] = flux_intplt
data['x'][n_num] = x[ss]
data['y'][n_num] = y[ss]
data['r'][n_num] = r[ss]
data['s'][n_num] = s[ss]
n_num += 1
print n_num
data_output = FITS('./analyzedDATA/data_ALL.fits','rw')
data_output.write(data)
I kind of understand that the multiprocessing need to remove one loop, but pass the index to the function. However, there are two loops in my function and those two are highly correlated, so I do not know how to approach. Since the most time-consuming part of this code is reading files from disk, so the multiprocessing need to take full advantage of cores to read multiple files at one time. Could any one shed a light on me?

Get rid of global vars, you can't use global vars with processes
Merge your multiple global vars into one container class or dict,
assigning different segments of the same spectra into one data set
Move your global with open(... into a def ...
Separate data_output into a own def ...
Try first, without multiprocessing, this concept:
for line1, line2 in izip(file_ID,file_c):
data_set = create data set from (line1, line2)
result = data_analysis(data_set)
data_output.write(data)
Consider to use 2 processes one for file reading and one for file writing.
Use multiprocessing.Pool(processes=n) for data_analysis.
Communicate between processes using multiprocessing.Manager().Queue()

Related

How can i handle file with 161 million line?

I tried to handle this code as I have a big file with the size 3 GB "mydata.dat" with 161991000 lines. Code is for calculating the distance between two points using DensityPeakCluster. number of points 18000
sample of the file like as
1 2 26.23
1 3 44.49
1 4 47.17
and so on until
1 18000 23.5
then
2 3 25.2
2 4 15.2
until 2 18000 0.25 and so on until 17999 18000 0.25
block one for the code is
class Graph(defaultdict):
def __init__(self, input_file, sep=" ", header=False, undirect=True):
super(Graph, self).__init__(dict)
self.edges_num = 0
with open(input_file) as f:
if header:
f.readline()
for line in f:
line = line.strip().split(sep)
self[line[0]][line[1]] = float(line[2])
self.edges_num += 1
if undirect:
self[line[1]][line[0]] = float(line[2])
self.edges_num += 1
def edges(self):
edges_list = []
for node1 in self:
for node2 in self[node1]:
edges_list.append((node1, node2))
return edges_list
block 2 of the code as the code is long to write it here
def edges_weight(self):
weight_list = []
for edge in self.edges():
node1, node2 = edge
weight_list.append([node1, node2, self[node1][node2]])
weight_list = sorted(weight_list, key=lambda x:x[2])
return weight_list
def get_weight(self, node1, node2):
return self[node1][node2]
def get_weights(self):
weights = []
for edge in self.edges():
weights.append(self.get_weight(edge[0], edge[1]))
return weights
if __name__=="__main__":
input_file = "./data/mydata.dat"
percent = 2.0
output_file = "./data/results"
G = Graph(input_file)
position = round(G.number_of_edges()*percent/100)
dc = G.edges_weight()[position][2]
print("average percentage of neighbours (hard coded): {}".format(percent))
print("Computing Rho with gaussian kernel of radius: {}".format(dc))
nodes = G.nodes()
for i in range(G.number_of_nodes()-1):
for j in range(i+1, G.number_of_nodes()):
node_i = nodes[i]
node_j = nodes[j]
dist_ij = G.get_weight(node_i, node_j)
what happened to me
1- I got killed so I tried to make reading from the file as
bigfile = open(input_file,'r')
tmp_lines = bigfile.readlines(1024*1024)
for line in tmp_lines:
line = line.strip().split(sep)
self[line[0]][line[1]] = float(line[2])
self.edges_num += 1
if undirect:
self[line[1]][line[0]] = float(line[2])
self.edges_num += 1
2- but got
dist_ij = G.get_weight(node_i, node_j) in get_weight
return self[node1][node2]
KeyError: '6336'
3- I tried to use google colab but didn't work as RAM is 12 GB and didn't enough for me .. i asked for buying a neW RAM but the problem still was I couldn't manage the code well so the RAM will be less for processing .. i'm stuck in this problem and couldn't know what should I do ?
**1- My problem is how to deal with a big file as I have ? what is the way that I should use to handle this size?
2- if I use NumPy to load the file can this decrease usage of memory?**

The most straight forward answer is to not load the whole file at once. This can even be done one line at a time. For example, suppose you wanted the sum:
filename = 'file.dat'
lines = (int(line.split(' ')[2]) for line in open(filename))
print(sum(lines))
Here we did not load all the lines into memory. We instead opened a file pointer and started a python generator. The generator holds the function "int(line.split(' ')[2])" and only executes that function when each line is called. The initiation of needing to call each line is started by the sum(), and sum only calls each line one at a time as needed, never loading more than one line into memory at a time. Hence, when we execute that line we start to add up all the values on the lines from the generator and keep a running total. The point is that the code uses no memory RAM (aside from the kernel overhead).
This could be done a piece at a time as well. Load all the zeros.
filename = 'file.dat'
lines = (line.split(' ') for line in open(filename))
zeros = (line for line in lines if line[0]=='0' or line[1]=='0')
print(sum(c for a,b,c in zeros))
This can of course be slower than loading some or all of the file into memory. Moreover you have to consider how many times you want to iterate over the file like this. It is preferred to only iterate over the lines a few times, gathering all the calculations you want. You then probably want to save those answers because re-iterating over the file again takes more time.
In considering loading the file into memory, you need to double check what exactly you want to load and how. For example, do you want to load the values 1 2 in the line 1 2 26.23? If not, then strip those out to take up less memory. For example
import numpy as np
filename = 'file.dat'
values = (float(line.split(' ')[2]) for line in open(filename))
X = np.fromiter(values,dtype='float32',count=161991000)
By specifying the count we told python EXACTLY how much memory to allocate in advance (instead of having python re-adjust the array every time it needs more memory). With a count of that size and dtype of float32, we know that this data will take up exactly 647.97mb in RAM. So, be careful not to write any operations that duplicate this data. If you write something that makes 5 copies of this that will eat up RAM quickly.
I think this gives you an idea of how to manage memory. :-)

Very large pandas dataframe - keeping count

Assume this is a sample of my data: dataframe
the entire dataframe is stored in a csv file (dataframe.csv) that is 40GBs so I can't open all of it at once.
I am hoping to find the most dominant 25 names for all genders. My instinct is to create a for loop that runs through the file (because I can't open it at once), and have a python dictionary that holds the counter for each name (that I will increment as I go through the data).
To be honest, I'm confused on where to even start with this (how to create the dictionary, since to_dict() does not appear to do what I'm looking for). And also, if this is even a good solution? Is there a more efficient way someone can think of?
SUMMARY -- sorry if the question is a bit long:
the csv file storing the data is very big and I can't open it at once, but I'd like to find the top 25 dominant names in the data. Any ideas on what to do and how to do it?
I'd appreciate any help I can get! :)

Thanks for your interesting task! I've implemented pure numpy + pandas solution. It uses sorted array to keep names and counts. Hence algorithm should be around O(n * log n) complexity.
I didn't any hash table in numpy, hash table definitely would be faster (O(n)). Hence I used existing sorting/inserting routines of numpy.
Also I used .read_csv() from pandas with iterator = True, chunksize = 1 << 24 params, this allows reading file in chunks and producing pandas dataframes of fixed size from each chunk.
Note! In the first runs (until program is debugged) set limit_chunks (number of chunks to process) in code to small value (like 5). This is to check that whole program runs correctly on partial data.
Program needs to run one time command python -m pip install pandas numpy to install these 2 packages if you don't have them.
Progress is printed once in a while, total megabytes done plus speed.
Result will be printed to console plus saved to res_fname file name, all constants configuring script are placed in the beginning of script. topk constant controls how many top names will be outputed to file/console.
Interesting how fast is my solution. If it is to slow maybe I devote some time to write nice HashTable class using pure numpy.
You can also try and run next code here online.
import os, math, time, sys
# Needs: python -m pip install pandas numpy
import pandas as pd, numpy as np
import pandas, numpy
fname = 'test.csv'
fname_res = 'test.res'
chunk_size = 1 << 24
limit_chunks = None # Number of chunks to process, set to None if to process whole file
all_genders = ['Male', 'Female']
topk = 1000 # How many top names to output
progress_step = 1 << 23 # in bytes
fsize = os.path.getsize(fname)
#el_man = enlighten.get_manager() as el_man
#el_ctr = el_man.counter(color = 'green', total = math.ceil(fsize / 2 ** 20), unit = 'MiB', leave = False)
tables = {g : {
'vals': np.full([1], chr(0x10FFFF), dtype = np.str_),
'cnts': np.zeros([1], dtype = np.int64),
} for g in all_genders}
tb = time.time()
def Progress(
done, total = min([fsize] + ([chunk_size * limit_chunks] if limit_chunks is not None else [])),
cfg = {'progressed': 0, 'done': False},
):
if not cfg['done'] and (done - cfg['progressed'] >= progress_step or done >= total):
if done < total:
while cfg['progressed'] + progress_step <= done:
cfg['progressed'] += progress_step
else:
cfg['progressed'] = total
sys.stdout.write(
f'{str(round(cfg["progressed"] / 2 ** 20)).rjust(5)} MiB of ' +
f'{str(round(total / 2 ** 20)).rjust(5)} MiB ' +
f'speed {round(cfg["progressed"] / 2 ** 20 / (time.time() - tb), 4)} MiB/sec\n'
)
sys.stdout.flush()
if done >= total:
cfg['done'] = True
with open(fname, 'rb', buffering = 1 << 26) as f:
for i, df in enumerate(pd.read_csv(f, iterator = True, chunksize = chunk_size)):
if limit_chunks is not None and i >= limit_chunks:
break
if i == 0:
name_col = df.columns.get_loc('First Name')
gender_col = df.columns.get_loc('Gender')
names = np.array(df.iloc[:, name_col]).astype('str')
genders = np.array(df.iloc[:, gender_col]).astype('str')
for g in all_genders:
ctab = tables[g]
gnames = names[genders == g]
vals, cnts = np.unique(gnames, return_counts = True)
if vals.size == 0:
continue
if ctab['vals'].dtype.itemsize < names.dtype.itemsize:
ctab['vals'] = ctab['vals'].astype(names.dtype)
poss = np.searchsorted(ctab['vals'], vals)
exist = ctab['vals'][poss] == vals
ctab['cnts'][poss[exist]] += cnts[exist]
nexist = np.flatnonzero(exist == False)
ctab['vals'] = np.insert(ctab['vals'], poss[nexist], vals[nexist])
ctab['cnts'] = np.insert(ctab['cnts'], poss[nexist], cnts[nexist])
Progress(f.tell())
Progress(fsize)
with open(fname_res, 'w', encoding = 'utf-8') as f:
for g in all_genders:
f.write(f'{g}:\n\n')
print(g, '\n')
order = np.flip(np.argsort(tables[g]['cnts']))[:topk]
snames, scnts = tables[g]['vals'][order], tables[g]['cnts'][order]
if snames.size > 0:
for n, c in zip(np.nditer(snames), np.nditer(scnts)):
n, c = str(n), int(c)
if c == 0:
continue
f.write(f'{c} {n}\n')
print(c, n.encode('ascii', 'replace').decode('ascii'))
f.write(f'\n')
print()

import pandas as pd
df = pd.read_csv("sample_data.csv")
print(df['First Name'].value_counts())
The second line will convert your csv into a pandas dataframe and the third line should print the occurances of each name.
https://dfrieds.com/data-analysis/value-counts-python-pandas.html

This doesn't seem to be a case where pandas is really going to be an advantage. But if you're committed to going down that route, change the read_csv chunksize paramater, then filter out the useless columns.
Perhaps consider using a different set of tooling such as a database or even vanilla python using a generator to populate a dict in the form of name:count.

Can a high CPU load (from other applications) affect python performance/accuracy?

I'm working on a code to read and display the results of a Finite Element Analysis (FEA) calculation. The results are stored in several (relatively big) text files that contain a list of nodes (ID number, location in space) and lists for the physical fields of relevance (ID of node, value of the field on that point).
However, I have noticed that when I'm running a FEA case in the background and I try to run my code at the same time it returns errors, not always the same one and not always at the same iteration, all seemly at random and without any modification to the code or to the input files whatsoever, just by hitting the RUN button seconds apart between runs.
Example of the errors that I'm getting are:
keys[key] = np.round(np.asarray(keys[key]),7)
TypeError: can't multiply sequence by non-int of type 'float'
#-------------------------------------------------------------------------
triang = tri.Triangulation(x, y)
ValueError: x and y arrays must have a length of at least 3
#-------------------------------------------------------------------------
line = [float(n) for n in line]
ValueError: could not convert string to float: '0.1225471E'
In case you are curious, this is my code (keep in mind that it is not finished yet and that I'm a mechanical engineer, not a programmer). Any feedback on how to make it better is also appreciated:
import matplotlib.pyplot as plt
import matplotlib.tri as tri
import numpy as np
import os
triangle_max_radius = 0.003
respath = 'C:/path'
fields = ['TEMPERATURE']
# Plot figure definition --------------------------------------------------------------------------------------
fig, ax1 = plt.subplots()
fig.subplots_adjust(left=0, right=1, bottom=0.04, top=0.99)
ax1.set_aspect('equal')
# -------------------------------------------------------------------------------------------------------------
# Read outputfiles --------------------------------------------------------------------------------------------
resfiles = [f for f in os.listdir(respath) if (os.path.isfile(os.path.join(respath,f)) and f[:3]=='csv')]
resfiles = [[f,int(f[4:])] for f in resfiles]
resfiles = sorted(resfiles,key=lambda x: (x[1]))
resfiles = [os.path.join(respath,f[:][0]).replace("\\","/") for f in resfiles]
# -------------------------------------------------------------------------------------------------------------
# Read data inside outputfile ---------------------------------------------------------------------------------
for result_file in resfiles:
keys = {}
keywords = []
with open(result_file, 'r') as res:
for line in res:
if line[0:2] == '##':
if len(line) >= 5:
line = line[:3] + line[7:]
line = line.replace(';',' ')
line = line.split()
if line:
if line[0] == '##':
if len(line) >= 3:
keywords.append(line[1])
keys[line[1]] = []
elif line[0] in keywords:
curr_key = line[0]
else:
line = [float(n) for n in line]
keys[curr_key].append(line)
for key in keys:
keys[key] = np.round(np.asarray(keys[key]),7)
for item in fields:
gob_temp = np.empty((0,4))
for node in keys[item]:
temp_coords, = np.where(node[0] == keys['COORDINATES'][:,0])
gob_temp_coords = [node[0], keys['COORDINATES'][temp_coords,1], keys['COORDINATES'][temp_coords,2], node[1]]
gob_temp = np.append(gob_temp,[gob_temp_coords],axis=0)
x = gob_temp[:,1]
y = gob_temp[:,2]
z = gob_temp[:,3]
triang = tri.Triangulation(x, y)
triangles = triang.triangles
xtri = x[triangles] - np.roll(x[triangles], 1, axis=1)
ytri = y[triangles] - np.roll(y[triangles], 1, axis=1)
maxi = np.max(np.sqrt(xtri**2 + ytri**2), axis=1)
triang.set_mask(maxi > triangle_max_radius)
ax1.tricontourf(triang,z,100,cmap='plasma')
ax1.triplot(triang,color="black",lw=0.2)
plt.show()
So back to the question, is it possible for the accuracy/performance of python to be affected by CPU load or any other 'external' factors? Or that's not an option and there's definitively something wrong with my code (which works well on other circumstances by the way)?

No, other processes only affect how often your process gets time slots to execute -- i.e., from a user's perspective, how quickly it completes its job.
If you're having errors under load, this means there are errors in your program's logic -- most probably, race conditions. They basically boil down to making assumptions about your environment that are no longer true when there's other activity in it. E.g.:
Your program is multithreaded, and the logic makes assumptions about which order threads are executed in. (This includes assumptions about how long some task would take to complete.)
Your program is using shared resources (files, streams etc) that other processes are also using at the same time. (E.g. some other program is in the process of (over)writing a file while you're trying to read it. Or, if you're reading from a stream, not all data are available yet.)

R readBin vs. Python struct

I am attempting to read a binary file using Python. Someone else has read in the data with R using the following code:
x <- readBin(webpage, numeric(), n=6e8, size = 4, endian = "little")
myPoints <- data.frame("tmax" = x[1:(length(x)/4)],
"nmax" = x[(length(x)/4 + 1):(2*(length(x)/4))],
"tmin" = x[(2*length(x)/4 + 1):(3*(length(x)/4))],
"nmin" = x[(3*length(x)/4 + 1):(length(x))])
With Python, I am trying the following code:
import struct
with open('file','rb') as f:
val = f.read(16)
while val != '':
print(struct.unpack('4f', val))
val = f.read(16)
I am coming to slightly different results. For example, the first row in R returns 4 columns as -999.9, 0, -999.0, 0. Python returns -999.0 for all four columns (images below).
Python output:
R output:
I know that they are slicing by the length of the file with some of the [] code, but I do not know how exactly to do this in Python, nor do I understand quite why they do this. Basically, I want to recreate what R is doing in Python.
I can provide more of either code base if needed. I did not want to overwhelm with code that was not necessary.

Deducing from the R code, the binary file first contains a certain number tmax's, then the same number of nmax's, then tmin's and nmin's. What the code does is reading the entire file, which is then chopped up in the 4 parts (tmax's, nmax's, etc..) using slicing.
To do the same in python:
import struct
# Read entire file into memory first. This is done so we can count
# number of bytes before parsing the bytes. It is not a very memory
# efficient way, but it's the easiest. The R-code as posted wastes even
# more memory: it always takes 6e8 * 4 bytes (~ 2.2Gb) of memory no
# matter how small the file may be.
#
data = open('data.bin','rb').read()
# Calculate number of points in the file. This is
# file-size / 16, because there are 4 numeric()'s per
# point, and they are 4 bytes each.
#
num = int(len(data) / 16)
# Now we know how much there are, we take all tmax numbers first, then
# all nmax's, tmin's and lastly all nmin's.
# First generate a format string because it depends on the number points
# there are in the file. It will look like: "fffff"
#
format_string = 'f' * num
# Then, for cleaner code, calculate chunk size of the bytes we need to
# slice off each time.
#
n = num * 4 # 4-byte floats
# Note that python has different interpretation of slicing indices
# than R, so no "+1" is needed here as it is in the R code.
#
tmax = struct.unpack(format_string, data[:n])
nmax = struct.unpack(format_string, data[n:2*n])
tmin = struct.unpack(format_string, data[2*n:3*n])
nmin = struct.unpack(format_string, data[3*n:])
print("tmax", tmax)
print("nmax", nmax)
print("tmin", tmin)
print("nmin", nmin)
If the goal is to have this data structured as a list of points(?) like (tmax,nmax,tmin,nmin), then append this to the code:
print()
print("Points:")
# Combine ("zip") all 4 lists into a list of (tmax,nmax,tmin,nmin) points.
# Python has a function to do this at once: zip()
#
i = 0
for point in zip(tmax, nmax, tmin, nmin):
print(i, ":", point)
i += 1

Here's a less memory-hungry way to do the same. It possibly is a bit faster too. (but that is difficult to check for me)
My computer did not have sufficient memory to run the first program with those huge files. This one does, but I still needed to create a list of ony tmax's first (the first 1/4 of the file), then print it, and then delete the list in order to have enough memory for nmax's, tmin's and nmin's.
But this one too says the nmin's inside the 2018 file are all -999.0. If that doesn't make sense, could you check what the R-code makes of it then? I suspect that it is just what's in the file. The other possibility is of course, that I got it all wrong (which I doubt). However, I tried the 2017 file too, and that one does not have such problem: all of tmax, nmax, tmin, nmin have around 37% -999.0 's.
Anyway, here's the second code:
import os
import struct
# load_data()
# data_store : object to append() data items (floats) to
# num : number of floats to read and store
# datafile : opened binary file object to read float data from
#
def load_data(data_store, num, datafile):
for i in range(num):
data = datafile.read(4) # process one float (=4 bytes) at a time
item = struct.unpack("<f", data)[0] # '<' means little endian
data_store.append(item)
# save_list() saves a list of float's as strings to a file
#
def save_list(filename, datalist):
output = open(filename, "wt")
for item in datalist:
output.write(str(item) + '\n')
output.close()
#### MAIN ####
datafile = open('data.bin','rb')
# Get file size so we can calculate number of points without reading
# the (large) file entirely into memory.
#
file_info = os.stat(datafile.fileno())
# Calculate number of points, i.e. number of each tmax's, nmax's,
# tmin's, nmin's. A point is 4 floats of 4 bytes each, hence number
# of points = file-size / (4*4)
#
num = int(file_info.st_size / 16)
tmax_list = list()
load_data(tmax_list, num, datafile)
save_list("tmax.txt", tmax_list)
del tmax_list # huge list, save memory
nmax_list = list()
load_data(nmax_list, num, datafile)
save_list("nmax.txt", nmax_list)
del nmax_list # huge list, save memory
tmin_list = list()
load_data(tmin_list, num, datafile)
save_list("tmin.txt", tmin_list)
del tmin_list # huge list, save memory
nmin_list = list()
load_data(nmin_list, num, datafile)
save_list("nmin.txt", nmin_list)
del nmin_list # huge list, save memory

Python: slow for loop performance on reading, extracting and writing from a list of thousands of files

I am extracting 150 different cell values from 350,000 (20kb) ascii raster files. My current code is fine for processing the 150 cell values from 100's of the ascii files, however it is very slow when running on the full data set.
I am still learning python so are there any obvious inefficiencies? or suggestions to improve the below code.
I have tried closing the 'dat' file in the 2nd function; no improvement.
dat = None
First: I have a function which returns the row and column locations from a cartesian grid.
def world2Pixel(gt, x, y):
ulX = gt[0]
ulY = gt[3]
xDist = gt[1]
yDist = gt[5]
rtnX = gt[2]
rtnY = gt[4]
pixel = int((x - ulX) / xDist)
line = int((ulY - y) / xDist)
return (pixel, line)
Second: A function to which I pass lists of 150 'id','x' and 'y' values in a for loop. The first function is called within and used to extract the cell value which is appended to a new list. I also have a list of files 'asc_list' and corresponding times in 'date_list'. Please ignore count / enumerate as I use this later; unless it is impeding efficiency.
def asc2series(id, x, y):
#count = 1
ls_id = []
ls_p = []
ls_d = []
for n, (asc,date) in enumerate(zip(asc, date_list)):
dat = gdal.Open(asc_list)
gt = dat.GetGeoTransform()
pixel, line = world2Pixel(gt, east, nort)
band = dat.GetRasterBand(1)
#dat = None
value = band.ReadAsArray(pixel, line, 1, 1)[0, 0]
ls_id.append(id)
ls_p.append(value)
ls_d.append(date)
Many thanks

In world2pixel you are setting rtnX and rtnY which you don't use.
You probably meant gdal.Open(asc) -- not asc_list.
You could move gt = dat.GetGeoTransform() out of the loop. (Rereading made me realize you can't really.)
You could cache calls to world2Pixel.
You're opening dat file for each pixel -- you should probably turn the logic around to only open files once and lookup all the pixels mapped to this file.
Benchmark, check the links in this podcast to see how: http://talkpython.fm/episodes/show/28/making-python-fast-profiling-python-code

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Implementing multiprocessing to deal with heavy input/output on HPC - python

Related

How can i handle file with 161 million line?

Very large pandas dataframe - keeping count

Can a high CPU load (from other applications) affect python performance/accuracy?

R readBin vs. Python struct

Python: slow for loop performance on reading, extracting and writing from a list of thousands of files

Categories

Resources