I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. 💗
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
try:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with gzip.open(directory + filename, 'rb') as file:
response = file.read()
feed = gtfs_realtime_pb2.FeedMessage()
feed.ParseFromString(response)
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
try:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
read_trips.add(trip.trip_update.trip.trip_id)
except KeyError:
continue
else:
continue
except OSError:
continue
I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
datapoints.append((key,ts,delay))
# ...
npdata = np.array(datapoints, dtype=int)
I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
try:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
this_trip_id)
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
next(nxt_updates)
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
read_trips.add(this_trip_id)
except KeyError:
continue
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)
Related
I am working with extremely high dimensional biological count data (single cell RNA sequencing where rows are cell ID and columns are genes).
Each dataset is a separate flat file (AnnData format). Each flat file can be broken down by various metadata attributes, including by cell type (eg: muscle cell, heart cell), subtypes (eg: a lung dataset can be split into normal lung and cancerous lung), cancer stage (eg: stage 1, stage 2), etc.
The goal is to pre-compute aggregate metrics for a specific metadata column, sub-group, dataset, cell-type, gene combination and keep that readily accessible such that when a person queries my web app for a plot, I can quickly retrieve results (refer to Figure below to understand what I want to create). I have generated Python code to assemble the dictionary below and it has sped up how quickly I can create visualizations.
Only issue now is that the memory footprint of this dictionary is very high (there are ~10,000 genes per dataset). What is the best way to reduce the memory footprint of this dictionary? Or, should I consider another storage framework (briefly saw something called Redis Hashes)?
One option to reduce your memory footprint but keep fast lookup is to use an hdf5 file as a database. This will be a single large file that lives on your disk instead of memory, but is structured the same way as your nested dictionaries and allows for rapid lookups by reading in only the data you need. Writing the file will be slow, but you only have to do it once and then upload to your web-app.
To test this idea, I've created two test nested dictionaries in the format of the diagram you shared. The small one has 1e5 metadata/group/dataset/celltype/gene entries, and the other is 10 times larger.
Writing the small dict to hdf5 took ~2 minutes and resulted in a file 140 MB in size while the larger dict-dataset took ~14 minutes to write to hdf5 and is a 1.4 GB file.
Querying the small and large hdf5 files similar amounts of time showing that the queries scale well to more data.
Here's the code I used to create the test dict-datasets, write to hdf5, and query
import h5py
import numpy as np
import time
def create_data_dict(level_counts):
"""
Create test data in the same nested-dict format as the diagram you show
The Agg_metric values are random floats between 0 and 1
(you shouldn't need this function since you already have real data in dict format)
"""
if not level_counts:
return {f'Agg_metric_{i+1}':np.random.random() for i in range(num_agg_metrics)}
level,num_groups = level_counts.popitem()
return {f'{level}_{i+1}':create_data_dict(level_counts.copy()) for i in range(num_groups)}
def write_dict_to_hdf5(hdf5_path,d):
"""
Write the nested dictionary to an HDF5 file to act as a database
only have to create this file once, but can then query it any number of times
(unless the data changes)
"""
def _recur_write(f,d):
for k,v in d.items():
#check if the next level is also a dict
sk,sv = v.popitem()
v[sk] = sv
if type(sv) == dict:
#this is a 'node', move on to next level
_recur_write(f.create_group(k),v)
else:
#this is a 'leaf', stop here
leaf = f.create_group(k)
for sk,sv in v.items():
leaf.attrs[sk] = sv
with h5py.File(hdf5_path,'w') as f:
_recur_write(f,d)
def query_hdf5(hdf5_path,search_terms):
"""
Query the hdf5_path with a list of search terms
The search terms must be in the order of the dict, and have a value at each level
Output is a dict of agg stats
"""
with h5py.File(hdf5_path,'r') as f:
k = '/'.join(search_terms)
try:
f = f[k]
except KeyError:
print('oh no! at least one of the search terms wasnt matched')
return {}
return dict(f.attrs)
################
# start #
################
#this "small_level_counts" results in an hdf5 file of size 140 MB (took < 2 minutes to make)
#all possible nested dictionaries are made,
#so there are 40*30*10*3*3 = ~1e5 metadata/group/dataset/celltype/gene entries
num_agg_metrics = 7
small_level_counts = {
'Gene':40,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#"large_level_counts" results in an hdf5 file of size 1.4 GB (took 14 mins to make)
#has 400*30*10*3*3 = ~1e6 metadata/group/dataset/celltype/gene combinations
num_agg_metrics = 7
large_level_counts = {
'Gene':400,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#Determine which test dataset to use
small_test = True
if small_test:
level_counts = small_level_counts
hdf5_path = 'small_test.hdf5'
else:
level_counts = large_level_counts
hdf5_path = 'large_test.hdf5'
np.random.seed(1)
start = time.time()
data_dict = create_data_dict(level_counts)
print('created dict in {:.2f} seconds'.format(time.time()-start))
start = time.time()
write_dict_to_hdf5(hdf5_path,data_dict)
print('wrote hdf5 in {:.2f} seconds'.format(time.time()-start))
#Search terms in order of most broad to least
search_terms = ['Metadata_1','Unique_Group_3','Dataset_8','Cell_Type_15','Gene_17']
start = time.time()
query_result = query_hdf5(hdf5_path,search_terms)
print('queried in {:.2f} seconds'.format(time.time()-start))
direct_result = data_dict['Metadata_1']['Unique_Group_3']['Dataset_8']['Cell_Type_15']['Gene_17']
print(query_result == direct_result)
Although Python dictionaries themselves are fairly efficient in terms of memory usage you are likely storing multiple copies of the strings you are using as dictionary keys. From your description of your data structure it is likely that you have 10000 copies of “Agg metric 1”, “Agg metric 2”, etc for every gene in your dataset. It is likely that these duplicate strings are taking up a significant amount of memory. These can be deduplicated with sys.inten so that although you still have as many references to the string in your dictionary, they all point to a single copy in memory. You would only need to make a minimal adjustment to your code by simply changing the assignment to data[sys.intern(‘Agg metric 1’)] = value. I would do this for all of the keys used at all levels of your dictionary hierarchy.
I'm trying to understand how can I write a DAQ in Python where I manage two signals (I and Q from an IQ mixer) from a NI device. My doubt concern two problems:
What are the main differences to use h5py instead of pandas? My data are not complex, I need only two matrices datasets, one for the I signal and one for the Q signal.
Is it more efficient to create the whole dataset and then occupy a lot of memory before storing it in an HDF5 file, or to open the HDF5 file each time to add a new row (new data) to the matrix?
#Frostman, this is primarily an opinion question. Remember, "Beauty is in the eye of the beholder."
The question about memory use is the more important consideration. The answer depends on memory required to hold your data (and if you have enough before writing to disk). Creating the whole dataset in memory is faster and easier. But, that's not an option if it doesn't fit. :-) Note: you don't want to write data 1 row at a time. That is the slowest way to work with HDF5 data. If you need to save incrementally, write "a lot" of rows at 1 time (say 1,000).
There is a related consideration: which of these packages is fast enough (& easy enough) to keep up with I/O requirement to data acquisition? (I have no expertise in this areas, but know high sample rates will quickly create a lot of data.)
From a technical perspective, the difference in the packages is "cosmetic" (IMHO). (FYI, you can also use PyTables to create HDF5 data.) In other words, all 3 can easily create a HDF5 file with the data described in your question. The question (for you), is which package do you want to learn? And, which package do you plan use for later post-processing? (I assume you want to open the file and "do something" with the data later.)
All else being equal, I would to create the data with the same package I plan to use for downstream processing. Why? h5py and pandas use different schema to store the data. So, reading the data will be easier if you write and read with the same package. (That said, you can manipulate HDF5 data between the packages.)
If the downstream processing requirement has not been decided, I would select h5py if either of these are true: a) you are comfortable with NumPy, or b) you need to use NumPy for other operations. h5py is pretty easy to learn if you know NumPy.
Otherwise, you might prefer pandas. Many claim it is easier to use (vs h5py and PyTables). I am not a pandas expert, so can't comment. I prefer h5py and PyTables primarly because the 2d data schema is saved in a table format that is easy to review with HDFView. (Also, I use NumPy 90+% of the time, so h5py/PyTables are natural extensions.)
If you want to compare code for each, look at the answers to this question: How to write large multiple arrays to a h5 file in layers? They show code required to store data similar to yours for all 3 packages: h5py, pytables and pandas.
Especially when acquiring long measurements, it is handy to write directly into an HDF5 file. This is my preferred way, because any interrupt (power failure etc.) wont result in data loss.
This is my solution using a while-loop that collects in each cycle all available samples from the DAQ and stores them immediately into the HDF5 file. You could imaging some real-time display during each loop cycle, but be aware of the loop duration (set parameter[debug_output] = 3 to see some more statistics like buffer size of each cycle)
changing the boolean hdf5_write to False causes the code to store into Numpy array data which sooner or later will fills the memory. If True, all the samples are written directly into a growing HDF5 file.
import nidaqmx
import datetime
import time
import numpy as np
import h5py
def hdf5_write_parameter(h5_file, parameter, group_name='parameter'):
# add parameter group
param_grp = h5_file.create_group(group_name)
# write single item
for key, item in parameter.items():
try:
if item is None:
item = 'None'
if isinstance(item, dict):
# recursive write each dictionary
hdf5_write_parameter(h5_file, item, group_name+'/'+key)
else:
h5_file.create_dataset("/"+group_name+"/{}".format(key), data=item)
except:
print("[hdf5_write_parameter]: failed to write:", key, "=", item)
return
run_bool = True # should be controlled by GUI or caller thread
measurement_duration = 1 # in seconds
filename = 'test_acquisition'
hdf5_write = True # a hdf5 file with ending '.h5' is created, False = numpy array
# check if device is available
system = nidaqmx.system.System.local()
system.driver_version
for device in system.devices:
print(device) # plot devices
ADC_DEVICE_NAME = device.name # 'PCI6024e'
print('ADC: init measure for', measurement_duration, 'seconds')
# Setup ADC
parameter = {
"channels": 8, # number of AI channels
"channel_name": ADC_DEVICE_NAME + '/ai0:7',
"log_rate": int(20000), # Samples per second
"adc_min_value": -5.0, # minimum ADC value in Volts
"adc_max_value": 5.0, # maximum ADC value in Volts
"timeout": measurement_duration + 2.0, # timeout to detect external clock on read
"debug_output": 1,
"measurement_duration": measurement_duration,
}
parameter["buffer_size"] = int(parameter["log_rate"]) # buffer size in samples
# must be bigger than loop duration!
parameter["requested_samples"] = parameter["log_rate"] * measurement_duration
parameter["hdf5_write"] = hdf5_write # write in array
if parameter['hdf5_write']:
filename += '.h5'
f = h5py.File(filename, 'w') # create a h5-file object if True
data = f.create_dataset('data', (0, parameter["channels"]),
maxshape=(None, parameter["channels"]), chunks=True)
else:
filename += '.csv'
# pre-allocate array, we might get up to 1 buffer more than requested...
data = np.empty((parameter["requested_samples"]+parameter["buffer_size"], parameter["channels"]), dtype=np.float64)
data[:] = np.nan
with nidaqmx.Task() as task:
task.ai_channels.add_ai_voltage_chan(parameter["channel_name"],
terminal_config=nidaqmx.constants.TerminalConfiguration.RSE,
min_val=parameter["adc_min_value"],
max_val=parameter["adc_max_value"],
units=nidaqmx.constants.VoltageUnits.VOLTS
)
task.timing.cfg_samp_clk_timing(rate=parameter["log_rate"],
sample_mode=nidaqmx.constants.AcquisitionType.CONTINUOUS)
# helper variables
total_samples = 0
i = 0
last_display = -1
parameter["acquisition_start"] = str(datetime.datetime.now())
if 1:
print("ADC: --- acquisition started:", parameter["acquisition_start"])
print("ADC: Requested samples:", parameter["requested_samples"], "Acquisition duration:",
measurement_duration)
task.control(nidaqmx.constants.TaskMode.TASK_COMMIT)
time_adc_start = time.perf_counter()
# ############################# READING LOOP ##########################
while run_bool and total_samples < parameter["requested_samples"] and time.perf_counter() - time_adc_start < parameter[
"timeout"]:
i = i + 1
if parameter["debug_output"] >= 1:
elapsed_time = np.floor(time.perf_counter() - time_adc_start) # in sec
if elapsed_time != last_display:
print("ADC: ...", round(elapsed_time), "of", measurement_duration, "sec:",
total_samples, "acquired ...")
last_display = elapsed_time
# high-lvl read function: always create a new array
data_buff = np.asarray(
task.read(number_of_samples_per_channel=nidaqmx.constants.READ_ALL_AVAILABLE)).T
time_adc_end = time.perf_counter()
samples_from_buffer = data_buff.shape[0]
# get nr of samples and acumulate to total_samples
total_samples = int(total_samples + samples_from_buffer)
if parameter["debug_output"] >= 2:
print("ADC: iter", i, "total:", total_samples, "smp from buffer", samples_from_buffer,
"time elapsed", time.perf_counter() - time_adc_start)
if samples_from_buffer > 0:
# prepair buffer and hdf5 dataset
if parameter["hdf5_write"]: # sequential write to hdf5 file
chunk_start = data.shape[0]
# resize dataset in file
data.resize(data.shape[0] + samples_from_buffer, axis=0)
else:
# prepair buffer to fit in pre-allocated array 'data'
chunk_start = int(np.count_nonzero(~np.isnan(data)) / parameter["channels"])
if parameter['channels'] == 1:
data_buff = data_buff[:, np.newaxis]
if parameter["debug_output"] >= 3:
print("Non-empty data shape: (", data.shape,
"), buffer shape:", data_buff.shape,
"chunk start:", chunk_start)
# write buffer to HDF5 file or into numpy array
data[chunk_start:chunk_start + samples_from_buffer, :] = data_buff
# ############################# READING LOOP #########################
parameter["acquisition_stop"] = str(datetime.datetime.now())
if parameter["debug_output"] >= 1:
print("ADC: requested points: ", parameter["requested_samples"])
print("ADC: total aqcuired points", total_samples, "in", time_adc_end - time_adc_start)
print("ADC: data array shape:", data.shape)
print("ADC: --- aqcuisition finished:", parameter["acquisition_stop"])
print("ADC: sample rate:", round(1/((time_adc_end-time_adc_start)/parameter["requested_samples"])))
# prepare data nparray for return
if not parameter["hdf5_write"]:
# shrink numpy array by all nan's (from oversize with buffer size)
total_written = int(np.count_nonzero(~np.isnan(data)) / parameter["channels"])
if parameter["debug_output"] >= 2:
print("resize data array by cutting", data.shape[0] - total_written, "tailing NaN's")
data = np.resize(data, (total_written, parameter["channels"]))
# add more parameter to wrtie into the hdf5 file
parameter["total_samples"] = total_samples
parameter["total_acquisition_time"] = time_adc_end - time_adc_start
parameter["data_shape"] = data.shape
if parameter['hdf5_write']:
hdf5_write_parameter(f, parameter) # write parameter
f.close()
I do have thousands of .npy files stored in my hard disk, each containing a single matrix with dimensions [128, T], where T is variable (on average T=800). Each .npy file has size around 2Mb, depending on the matrix shape.
These matrices are then passed to a generator, which yields batches of 32 to a neural network. The Python code used to pass the matrices into the generator is:
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
which, given a list of paths of the .npy files, returns a list of the corresponding NumPy matrices.
This code takes, on average, 0.6s to return a list of 32 matrices. I am using append because this is usually a quick operation.
I am aware that the speed of the hard disk buffer does have an influence on timings but, right now, I really would like to shrink the amount of time required as much as possible by just modifying the code in a smart way.
As an alternative, I tried implementing multi-processing:
from multiprocessing import Pool
def reader(filename):
return np.load(filename)
def load_multiprocess(path_list, n_cores=5):
pool = Pool(n_cores)
np_list = pool.map(reader, path_list)
return np_list
However, the performance is much worse. I had a look around stackoverflow, and I got the idea that my specific application could not benefit from multiprocessing.
To summarize, I am looking for any kind of advice for one of these two tasks:
Improving the speed of the first code (even 0.1s less would mean a lot).
Using multiprocessing in the right way, if possible.
SOLUTION AND BENCHMARK
Out of the three methods here proposed, user7138814's solution seems to generally improve a lot the execution speed. However, things seem to change when the data is loaded while training a neural network: even though mapping is by itself still the quicker method for loading data, the overall training time seems to increase, I have no idea where and why, as timings using the mapping load are always better.
Below, I will do a benchmark of the three methods.First, define the methods:
import numpy as np
# my initial method
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
# Aaj Kaal's method
def load_batch1(path_list):
return [np.load(path) for path in path_list]
# user7138814's method
def load_batch2(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path, mmap_mode='r'))
return np_list
I defined a list of paths as follows:
batches_list = []
batch_size = 32
for n in range(0,150):
batches_list.append(X_path_list[n*batch_size:n*batch_size+batch_size])
The list contains 150 batches of 32 paths each, it should be enough to calculate the mean.
Then, each method is executed using passing to it exactly the same data.
import time
# my initial method
timing0 = []
for l in batches_list:
start = time.time()
load_batch(l)
end = time.time()
timing0.append(end-start)
print(np.mean(timing0))
# Aaj Kaal's method
timing1 = []
for l in batches_list:
start = time.time()
load_batch1(l)
end = time.time()
timing1.append(end-start)
print(np.mean(timing1))
# user7138814's method
timing2 = []
for l in batches_list:
start = time.time()
load_batch2(l)
end = time.time()
timing2.append(end-start)
print(np.mean(timing2))
Output (mean timing in seconds over 150 executions):
0.022530150413513184
0.022546884218851725
0.009580903053283692
Results seem to be consistent when changing length of batches_list and batch_size.
Maybe memory mapping the files will be beneficial due to lazy loading. If you would use for example
np.load(filename, mmap_mode='r')
the creation of the numpy array becomes almost a no-op, but later in the pipeline you pay the price. This could provide a speedup if it results in processing the data in parallel with reading from disk.
Did you try using use list comprehension. Replace
def load_batch(path_list):
np_list = []
for path in path_list:
np_list.append(np.load(path))
return np_list
with
def load_batch(path_list):
return [np.load(path) for path in path_list]
In fact you can get rid of the function and directly use list comprehension. If functional call is required use lambda
I am creating a sparse matrix file, by extracting the features from an input file. The input file contains in each row, one film id, and then followed by some feature IDs and that features score.
6729792 4:0.15568 8:0.198796 9:0.279261 13:0.17829 24:0.379707
the first number is the ID of the film, and then the value to the left of the colon is feature ID and the value to the right is the score of that feature.
Each line represents one film, and the number of feature:score pairs vary from one film to another.
here is how I construct my sparse matrix.
import sys
import os
import os.path
import time
import numpy as np
from Film import Film
import scipy
from scipy.sparse import coo_matrix, csr_matrix, rand
def sparseCreate(self, Debug):
a = rand(self.total_rows, self.total_columns, format='csr')
l, m = a.shape[0], a.shape[1]
f = tb.open_file("sparseFile.h5", 'w')
filters = tb.Filters(complevel=5, complib='blosc')
data_matrix = f.create_carray(f.root, 'data', tb.Float32Atom(), shape=(l, m), filters=filters)
index_film = 0
input_data = open('input_file.txt', 'r')
for line in input_data:
my_line = np.array(line.split())
id_film = my_line[0]
my_line = np.core.defchararray.split(my_line[1:], ":")
self.data_matrix_search_normal[str(id_film)] = index_film
self.data_matrix_search_reverse[index_film] = str(id_film)
for element in my_line:
if int(element[0]) in self.selected_features:
column = self.index_selected_feature[str(element[0])]
data_matrix[index_film, column] = float(element[1])
index_film += 1
self.selected_matrix = data_matrix
json.dump(self.data_matrix_search_reverse,
open(os.path.join(self.output_path, "data_matrix_search_reverse.json"), 'wb'),
sort_keys=True, indent=4)
my_films = Film(
self.selected_matrix, self.data_matrix_search_reverse, self.path_doc, self.output_path)
x_matrix_unique = self.selected_matrix[:, :]
r_matrix_unique = np.asarray(x_matrix_unique)
f.close()
return my_films
Question:
I feel that this function is too slow on big datasets, and it takes too long to calculate.
How can I improve and accelerate it? maybe using MapReduce? What is wrong in this function that makes it too slow?
IO + conversions (from str, to str, even 2 times to str of the same var, etc) + splits + explicit loops. Btw, there is CSV python module which may be used to parse your input file, you can experiment with it (I suppose you use space as delimiter). Also I' see you convert element[0] to int/str which is bad - you create many tmp. object. If you call this function several times, you may to try to reuse some internal objects (array?). Also, you can try to implement it in another style: with map or list comprehension, but experiments are needed...
General idea of Python code optimization is to avoid explicit Python byte-code execution and to prefer native/C Python functions (for anything). And sure try to solve so many conversions. Also if input file is yours you can format it to fixed length of fields - this helps you to avoid split/parse totally (only string indexing).
My first post:
Before beginning, I should note I am relatively new to OOP, though I have done DB/stat work in SAS, R, etc., so my question may not be well posed: please let me know if I need to clarify anything.
My question:
I am attempting to import and parse large CSV files (~6MM rows and larger likely to come). The two limitations that I've run into repeatedly have been runtime and memory (32-bit implementation of Python). Below is a simplified version of my neophyte (nth) attempt at importing and parsing in reasonable time. How can I speed up this process? I am splitting the file as I import and performing interim summaries due to memory limitations and using pandas for the summarization:
Parsing and Summarization:
def ParseInts(inString):
try:
return int(inString)
except:
return None
def TextToYearMo(inString):
try:
return 100*inString[0:4]+int(inString[5:7])
except:
return 100*inString[0:4]+int(inString[5:6])
def ParseAllElements(elmValue,elmPos):
if elmPos in [0,2,5]:
return elmValue
elif elmPos == 3:
return TextToYearMo(elmValue)
else:
if elmPos == 18:
return ParseInts(elmValue.strip('\n'))
else:
return ParseInts(elmValue)
def MakeAndSumList(inList):
df = pd.DataFrame(inList, columns = ['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14'])
return df[['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14']].groupby(
['x1','x2','x3','x4','x5']).sum().reset_index()
Function Calls:
def ParsedSummary(longString,delimtr,rowNum):
keepColumns = [0,3,2,5,10,9,11,12,13,14,15,16,17,18]
#Do some other stuff that takes very little time
return [pse.ParseAllElements(longString.split(delimtr)[i],i) for i in keepColumns]
def CSVToList(fileName, delimtr=','):
with open(fileName) as f:
enumFile = enumerate(f)
listEnumFile = set(enumFile)
for lineCount, l in enumFile:
pass
maxSplit = math.floor(lineCount / 10) + 1
counter = 0
Summary = pd.DataFrame({}, columns = ['x1','x2','x3','x4','x5',
'x6','x7','x8','x9','x10',
'x11','x12','x13','x14'])
for counter in range(0,10):
startRow = int(counter * maxSplit)
endRow = int((counter + 1) * maxSplit)
includedRows = set(range(startRow,endRow))
listOfRows = [ParsedSummary(row,delimtr,rownum)
for rownum, row in listEnumFile if rownum in includedRows]
Summary = pd.concat([Summary,pse.MakeAndSumList(listOfRows)])
listOfRows = []
counter += 1
return Summary
(Again, this is my first question - so I apologize if I simplified too much or, more likely, too little, but I am at a loss as to how to expedite this.)
For runtime comparison:
Using Access I can import, parse, summarize, and merge several files in this size-range in <5 mins (though I am right at its 2GB lim). I'd hope I can get comparable results in Python - presently I'm estimating ~30 min run time for one file. Note: I threw something together in Access' miserable environment only because I didn't have admin rights readily available to install anything else.
Edit: Updated parsing code. Was able to shave off five minutes (est. runtime at 25m) by changing some conditional logic to try/except. Also - runtime estimate doesn't include pandas portion - I'd forgotten I'd commented that out while testing, but its impact seems negligible.
If you want to optimize performance, don't roll your own CSV reader in Python. There is already a standard csv module. Perhaps pandas or numpy have faster csv readers; I'm not sure.
From https://softwarerecs.stackexchange.com/questions/7463/fastest-python-library-to-read-a-csv-file:
In short, pandas.io.parsers.read_csv beats everybody else, NumPy's loadtxt is impressively slow and NumPy's from_file and load impressively fast.