Load PostgreSQL database with data from a NetCDF file

Load PostgreSQL database with data from a NetCDF file - python

I have a netCDF file with eight variables. (sorry, can´t share the actual file)
Each variable have two dimensions, time and station. Time is about 14 steps and station is currently 38000 different ids.
So for 38000 different "locations" (actually just an id) we have 8 variables and 14 different times.
$ncdump -h stationdata.nc
netcdf stationdata {
dimensions:
station = 38000 ;
name_strlen = 40 ;
time = UNLIMITED ; // (14 currently)
variables:
int time(time) ;
time:long_name = "time" ;
time:units = "seconds since 1970-01-01" ;
char station_name(station, name_strlen) ;
station_name:long_name = "station_name" ;
station_name:cf_role = "timeseries_id" ;
float var1(time, station) ;
var1:long_name = "Variable 1" ;
var1:units = "m3/s" ;
float var2(time, station) ;
var2:long_name = "Variable 2" ;
var2:units = "m3/s" ;
...
This data needs to be loaded into a PostGres database so that the data can be join to some geometries matching the station_name for later visualization .
Currently I have done this in Python with the netCDF4-module. Works but it takes forever!
Now I am looping like this:
times = rootgrp.variables['time']
stations = rootgrp.variables['station_name']
for timeindex, time in enumerate(times):
stations = rootgrp.variables['station_name']
for stationindex, stationnamearr in enumerate(stations):
var1val = var1[timeindex][stationindex]
print "INSERT INTO ncdata (validtime, stationname, var1) \
VALUES ('%s','%s', %s);" % \
( time, stationnamearr, var1val )
This takes several minutes on my machine to run and I have a feeling it could be done in a much more clever way.
Anyone has any idea on how this can be done in a smarter way? Preferably in Python.

Not sure this is the right way to do it but I found a good way to solve this and thought I should share it.
In the first version the script took about one hour to run. After a rewrite of the code it now runs in less than 30 sec!
The big thing was to use numpy arrays and transponse the variables arrays from the NetCDF reader to become rows and then stack all columns to one matrix. This matrix was then loaded in the db using psycopg2 copy_from function. I got the code for that from this question
Use binary COPY table FROM with psycopg2
Parts of my code:
dates = num2date(rootgrp.variables['time'][:],units=rootgrp.variables['time'].units)
var1=rootgrp.variables['var1']
var2=rootgrp.variables['var2']
cpy = cStringIO.StringIO()
for timeindex, time in enumerate(dates):
validtimes=np.empty(var1[timeindex].size, dtype="object")
validtimes.fill(time)
# Transponse and stack the arrays of parameters
# [a,a,a,a] [[a,b,c],
# [b,b,b,b] => [a,b,c],
# [c,c,c,c] [a,b,c],
# [a,b,c]]
a = np.hstack((
validtimes.reshape(validtimes.size,1),
stationnames.reshape(stationnames.size,1),
var1[timeindex].reshape(var1[timeindex].size,1),
var2[timeindex].reshape(var2[timeindex].size,1)
))
# Fill the cStringIO with text representation of the created array
for row in a:
cpy.write(row[0].strftime("%Y-%m-%d %H:%M")+'\t'+ row[1] +'\t' + '\t'.join([str(x) for x in row[2:]]) + '\n')
conn = psycopg2.connect("host=postgresserver dbname=nc user=user password=passwd")
curs = conn.cursor()
cpy.seek(0)
curs.copy_from(cpy, 'ncdata', columns=('validtime', 'stationname', 'var1', 'var2'))
conn.commit()

There are a few simple improvements you can make to speed this up. All these are independent, you can try all of them or just a couple to see if it's fast enough. They're in roughly ascending order of difficulty:
Use the psycopg2 database driver, it's faster
Wrap the whole block of inserts in a transaction. If you're using psycopg2 you're already doing this - it auto-opens a transaction you have to commit at the end.
Collect up several rows worth of values in an array and do a multi-valued INSERT every n rows.
Use more than one connection to do the inserts via helper processes - see the multiprocessing module. Threads won't work as well because of GIL (global interpreter lock) issues.
If you don't want to use one big transaction you can set synchronous_commit = off and set a commit_delay so the connection can return before the disk flush actually completes. This won't help you much if you're doing all the work in one transaction.
Multi-valued inserts
Psycopg2 doesn't directly support multi-valued INSERT but you can just write:
curs.execute("""
INSERT INTO blah(a,b) VALUES
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s),
(%s,%s);
""", parms);
and loop with something like:
parms = []
rownum = 0
for x in input_data:
parms.extend([x.firstvalue, x.secondvalue])
rownum += 1
if rownum % 5 == 0:
curs.execute("""INSERT ...""", tuple(parms))
del(parms[:])

Organize your loop to access all the variables for each time. In other words, read and write a record at a time rather than a variable at a time. This can speed things up enormously, especially if the source netCDF dataset is stored on a file system with large disk blocks, e.g. 1MB or larger. For an explanation of why this is faster and a discussion of order-of-magnitude resulting speedups, see this NCO speedup discussion, starting with entry 7.

Related

Hierarchical dictionary (reducing memory footprint or using a database)

I am working with extremely high dimensional biological count data (single cell RNA sequencing where rows are cell ID and columns are genes).
Each dataset is a separate flat file (AnnData format). Each flat file can be broken down by various metadata attributes, including by cell type (eg: muscle cell, heart cell), subtypes (eg: a lung dataset can be split into normal lung and cancerous lung), cancer stage (eg: stage 1, stage 2), etc.
The goal is to pre-compute aggregate metrics for a specific metadata column, sub-group, dataset, cell-type, gene combination and keep that readily accessible such that when a person queries my web app for a plot, I can quickly retrieve results (refer to Figure below to understand what I want to create). I have generated Python code to assemble the dictionary below and it has sped up how quickly I can create visualizations.
Only issue now is that the memory footprint of this dictionary is very high (there are ~10,000 genes per dataset). What is the best way to reduce the memory footprint of this dictionary? Or, should I consider another storage framework (briefly saw something called Redis Hashes)?

One option to reduce your memory footprint but keep fast lookup is to use an hdf5 file as a database. This will be a single large file that lives on your disk instead of memory, but is structured the same way as your nested dictionaries and allows for rapid lookups by reading in only the data you need. Writing the file will be slow, but you only have to do it once and then upload to your web-app.
To test this idea, I've created two test nested dictionaries in the format of the diagram you shared. The small one has 1e5 metadata/group/dataset/celltype/gene entries, and the other is 10 times larger.
Writing the small dict to hdf5 took ~2 minutes and resulted in a file 140 MB in size while the larger dict-dataset took ~14 minutes to write to hdf5 and is a 1.4 GB file.
Querying the small and large hdf5 files similar amounts of time showing that the queries scale well to more data.
Here's the code I used to create the test dict-datasets, write to hdf5, and query
import h5py
import numpy as np
import time
def create_data_dict(level_counts):
"""
Create test data in the same nested-dict format as the diagram you show
The Agg_metric values are random floats between 0 and 1
(you shouldn't need this function since you already have real data in dict format)
"""
if not level_counts:
return {f'Agg_metric_{i+1}':np.random.random() for i in range(num_agg_metrics)}
level,num_groups = level_counts.popitem()
return {f'{level}_{i+1}':create_data_dict(level_counts.copy()) for i in range(num_groups)}
def write_dict_to_hdf5(hdf5_path,d):
"""
Write the nested dictionary to an HDF5 file to act as a database
only have to create this file once, but can then query it any number of times
(unless the data changes)
"""
def _recur_write(f,d):
for k,v in d.items():
#check if the next level is also a dict
sk,sv = v.popitem()
v[sk] = sv
if type(sv) == dict:
#this is a 'node', move on to next level
_recur_write(f.create_group(k),v)
else:
#this is a 'leaf', stop here
leaf = f.create_group(k)
for sk,sv in v.items():
leaf.attrs[sk] = sv
with h5py.File(hdf5_path,'w') as f:
_recur_write(f,d)
def query_hdf5(hdf5_path,search_terms):
"""
Query the hdf5_path with a list of search terms
The search terms must be in the order of the dict, and have a value at each level
Output is a dict of agg stats
"""
with h5py.File(hdf5_path,'r') as f:
k = '/'.join(search_terms)
try:
f = f[k]
except KeyError:
print('oh no! at least one of the search terms wasnt matched')
return {}
return dict(f.attrs)
################
# start #
################
#this "small_level_counts" results in an hdf5 file of size 140 MB (took < 2 minutes to make)
#all possible nested dictionaries are made,
#so there are 40*30*10*3*3 = ~1e5 metadata/group/dataset/celltype/gene entries
num_agg_metrics = 7
small_level_counts = {
'Gene':40,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#"large_level_counts" results in an hdf5 file of size 1.4 GB (took 14 mins to make)
#has 400*30*10*3*3 = ~1e6 metadata/group/dataset/celltype/gene combinations
num_agg_metrics = 7
large_level_counts = {
'Gene':400,
'Cell_Type':30,
'Dataset':10,
'Unique_Group':3,
'Metadata':3,
}
#Determine which test dataset to use
small_test = True
if small_test:
level_counts = small_level_counts
hdf5_path = 'small_test.hdf5'
else:
level_counts = large_level_counts
hdf5_path = 'large_test.hdf5'
np.random.seed(1)
start = time.time()
data_dict = create_data_dict(level_counts)
print('created dict in {:.2f} seconds'.format(time.time()-start))
start = time.time()
write_dict_to_hdf5(hdf5_path,data_dict)
print('wrote hdf5 in {:.2f} seconds'.format(time.time()-start))
#Search terms in order of most broad to least
search_terms = ['Metadata_1','Unique_Group_3','Dataset_8','Cell_Type_15','Gene_17']
start = time.time()
query_result = query_hdf5(hdf5_path,search_terms)
print('queried in {:.2f} seconds'.format(time.time()-start))
direct_result = data_dict['Metadata_1']['Unique_Group_3']['Dataset_8']['Cell_Type_15']['Gene_17']
print(query_result == direct_result)

Although Python dictionaries themselves are fairly efficient in terms of memory usage you are likely storing multiple copies of the strings you are using as dictionary keys. From your description of your data structure it is likely that you have 10000 copies of “Agg metric 1”, “Agg metric 2”, etc for every gene in your dataset. It is likely that these duplicate strings are taking up a significant amount of memory. These can be deduplicated with sys.inten so that although you still have as many references to the string in your dictionary, they all point to a single copy in memory. You would only need to make a minimal adjustment to your code by simply changing the assignment to data[sys.intern(‘Agg metric 1’)] = value. I would do this for all of the keys used at all levels of your dictionary hierarchy.

openpyxl performance in read-only mode

I have a question about the performance of openpyxl when reading files.
I am trying to read the same xlsx file using ProcessPoolExecutor, single file Maybe 500,000 to 800,000 rows.
In read-only mode calling sheet.iter_rows(), when not using ProcessPoolExecutor, reading the entire worksheet, it takes about 1s to process 10,000 rows of data. But when I set the max_row and min_row parameters with ProcessPoolExecutor, it is different.
totalRows: 200,000
1 ~ 10000 take 1.03s
10001 ~ 20000 take 1.73s
20001 ~ 30000 take 2.41s
30001 ~ 40000 take 3.27s
40001 ~ 50000 take 4.06s
50001 ~ 60000 take 4.85s
60001 ~ 70000 take 5.93s
70001 ~ 80000 take 6.64s
80001 ~ 90000 take 7.72s
90001 ~ 100000 take 8.18s
100001 ~ 110000 take 9.42s
110001 ~ 120000 take 10.04s
120001 ~ 130000 take 10.61s
130001 ~ 140000 take 11.17s
140001 ~ 150000 take 11.52s
150001 ~ 160000 take 12.48s
160001 ~ 170000 take 12.52s
170001 ~ 180000 take 13.01s
180001 ~ 190000 take 13.25s
190001 ~ 200000 take 13.46s
total: take 33.54s
Obviously, just looking at the results of each process, the time consumed is indeed less.
But the overall time consumption has increased.
And the further back the scope, the more time each process consumes.
Read 200,000 rows with a single process only takes about 20s.
I'm not very clear with iterators and haven't looked closely at the source code of openpyxl.
From the time consumption, even if the range is set, the iterator still needs to start processing from row 1, I don't know if this is the case.
I'm not a professional programmer, if you happen to have relevant experience, please try to be as simple as possible
codes here!!!
import openpyxl
from time import perf_counter
from concurrent.futures import ProcessPoolExecutor
def read(file, minRow, maxRow):
start = perf_counter()
book = openpyxl.load_workbook(filename=file, read_only=True, keep_vba=False, data_only=True, keep_links=False)
sheet = book.worksheets[0]
val = [[cell.value for cell in row] for row in sheet.iter_rows(min_row=minRow, max_row=maxRow)]
book.close()
end = perf_counter()
print(f'{minRow} ~ {maxRow}', 'take {0:.2f}s'.format(end-start))
return val
def parallel(file: str, rowRanges: list[tuple]):
futures = []
with ProcessPoolExecutor(max_workers=6) as pool:
for minRow, maxRow in rowRanges:
futures.append(pool.submit(read, file, minRow, maxRow))
return futures
if __name__ == '__main__':
file = '200000.xlsx'
start = perf_counter()
tasks = getRowRanges(file)
parallel(file, tasks)
end = perf_counter()
print('total: take {0:.2f}s'.format(end-start))

Q :"... a question about the performance ..."... please try to be as simple as possible ...
A :Having 6 Ferrari sport racing cars ( ~ max_workers = 6 )does not provide a warranty to move 6 drivers ( ~ The Workload )
from start to the end
in 1 / 6 of the time.
That does not work,even if we have a 6-lane wide racing track ( which we have not ), as you have already reported, there is a bottleneck ( a 1-lane wide only bridge, on the way from the start to the end of the race ).
Actually,there are more performance-devastating bottlenecks ( The Bridge as the main performance blocker and a few smaller, less blocking, nevertheless performance further degrading bridges ), some avoidable, some not :
the file-I/O has been no faster than ~ 10k [rows/s] in a pure solo serial runso never expect the same speed to appear "across" the same (single, single lane) bridge ( the shared file-I/O hardware interface ) for any next, concurrently running Ferrari, competing for using the same resource, already used for the first process to read from file ( real-hardware latencies matter, a lot ... the Devil is in details )
another, avoidable, degradation comes with expensive add-on costs, paid for each and every list.append(). Here, try to choose a different object, avoiding a list-based storage at all and pre-allocate a block-storage ( one time paid RAM-allocation costs ) having an advantage of a know size of the result-storage, and keep storing data on-the-fly, best in cache-line respectful blocks than incrementally ( might be too technical, yet if performance is to get maxed-up, these details matter )
dual-iterator SLOC is nice for a workbook example, yet if performance is or focus, try to find another way, perhaps using even a simpler XLS-reader ( without as many machinery under the hood, as VBA interpreter et al ), which can export the row-wise consumed cells into a plain-text, that can get collected way way faster, than the as-is code did in a triplet-of-nested-iterators "syntax-sugared" SLOC [ [ ... for cell in row ] for row in sheet.iterator(...) ]
last comes also the process instantiation costs, that enter the revised Amdahl's Law, reformulated so that it takes into account also the overheads and atomicity of (blocks of) work. For ( technically independent ) details may see this and these - where interactive speedup-simulator calculators are often linked to test the principal ceiling any such parallelisation efforts will never be able to overcome.
Last, but by no means the least - The MEMORY: take your .xlsx file size and multiply it by ~ 50x and next by 6 workers ~ that amount of physical memory is expected to be used ( see doc: "Memory use is fairly high in comparison with other libraries and applications and is approximately 50 times the original file size, e.g. 2.5 GB for a 50 MB Excel file" credit to #Charlie Clark ) If your system does not have that much physical-RAM, the O/S starts to suffocate as truing to allocate that and goes into a RAM-swap-"thrashing" mode ( moving blocks-of-RAM to disk-swap area and back and there and back, as interleaving the 6 workers going forwards in Virtual-Memory-managed address space simulated inside a small physical-RAM at awfully high (more than 5(!) orders of magnitude longer) disk-I/O latencies, trying to cross the already blocking performance bottleneck, yeah - The Bridge ... where traffic-jam is already at max, as 6 workers try to do the very same - move some more data across the even more blocked bottleneck ) all that at awfully great latency skyrocketing jump on doing so (see URL on latencies above). A hint may, yet need not save us, plus this and this may reduce, better straight prevent further inefficiencies

I believe to have the same problem as OP.
The puzzling part is that once min_row and max_row is set on sheet.iter_rows(), concurrent execution does not apply anymore, as if there was some sort of global lock in effect.
The following code is trying to dump data from one single large sheet from an Excel file. The idea is to take advantage of the min_row and max_row on sheet.iter_rows to lock down a reading window and ThreadPoolExecutor for concurrent execution.
# artificially create a set of row index ranges,
# 10,000 rows per set till 1,000,000th row
# something like [(1, 10_000), (10_001, 20_000), .....]
def _ranges():
_i = 1
_n = 10_000
while _i <= 1_000_000:
yield _i, _i + _n - 1
_i += _n
def write_to_file(file, mn, mx):
print(f'write to file {mn}-{mx}')
wb = load_workbook(file, read_only=True
, data_only=True, keep_links=False, keep_vba=False)
sheet = wb[wb.sheetnames[0]]
out_file = _dst / f"{mn}-{mx}.txt"
row_count = 1
with out_file.open('w', encoding='utf8') as f:
rows = sheet.iter_rows(values_only=True, min_row=mn, max_row=mx)
for row in rows:
print(f'section {mn}-{mx} write {row_count}')
f.write(' '.join([str(c).replace('\n', ' ') for c in row]) + '\n')
row_count += 1
def main():
fut = []
with futures.ThreadPoolExecutor() as ex:
for mn, mx in _ranges():
fut.append(ex.submit(write_to_file, _file, mn, mx))
futures.wait(fut)
All write_to_file() do kick off all at once.
Iteration over rows, however, seems to behave in strict sequential fashion.
With a little change:
def write_to_file(file, mn, mx):
print(f'write to file {mn}-{mx}')
wb = load_workbook(file, read_only=True
, data_only=True, keep_links=False, keep_vba=False)
sheet = wb[wb.sheetnames[0]]
out_file = _dst / f"{mn}-{mx}.txt"
row_count = 1
with out_file.open('w', encoding='utf8') as f:
rows = sheet.iter_rows(values_only=True)
# ^^^^^^^^^^^^^^^^^___min_row/max_row not set
for row in rows:
print(f'section {mn}-{mx} write {row_count}')
f.write(' '.join([str(c).replace('\n', ' ') for c in row]) + '\n')
row_count += 1
Section 20001-30000 writes first!
The chaotic effect of concurrent execution takes place.
But, without min_row and max_row, there is no point to have concurrent execution at all.

How do I improve the speed of this parser using python?

I am currently parsing historic delay data from a public transport network in Sweden. I have ~5700 files (one from every 15 seconds) from the 27th of January containing momentary delay data for vehicles on active trips in the network. It's, unfortunately, a lot of overhead / duplicate data, so I want to parse out the relevant stuff to do visualizations on it.
However, when I try to parse and filter out the relevant delay data on a trip level using the script below it performs really slow. It has been running for over 1,5 hours now (on my 2019 Macbook Pro 15') and isn't finished yet.
How can I optimize / improve this python parser?
Or should I reduce the number of files, and i.e. the frequency of the data collection, for this task?
Thank you so much in advance. 💗
from google.transit import gtfs_realtime_pb2
import gzip
import os
import datetime
import csv
import numpy as np
directory = '../data/tripu/27/'
datapoints = np.zeros((0,3), int)
read_trips = set()
# Loop through all files in directory
for filename in os.listdir(directory)[::3]:
try:
# Uncompress and parse protobuff-file using gtfs_realtime_pb2
with gzip.open(directory + filename, 'rb') as file:
response = file.read()
feed = gtfs_realtime_pb2.FeedMessage()
feed.ParseFromString(response)
print("Filename: " + filename, "Total entities: " + str(len(feed.entity)))
for trip in feed.entity:
if trip.trip_update.trip.trip_id not in read_trips:
try:
if len(trip.trip_update.stop_time_update) == len(stopsOnTrip[trip.trip_update.trip.trip_id]):
print("\t","Adding delays for",len(trip.trip_update.stop_time_update),"stops, on trip_id",trip.trip_update.trip.trip_id)
for i, stop_time_update in enumerate(trip.trip_update.stop_time_update[:-1]):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(trip.trip_update.stop_time_update[i+1].arrival.delay-trip.trip_update.stop_time_update[i].arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(trip.trip_update.stop_time_update[i+1].arrival.time)
key = int(str(trip.trip_update.stop_time_update[i].stop_id) + str(trip.trip_update.stop_time_update[i+1].stop_id))
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key,ts,delay]]), axis=0)
read_trips.add(trip.trip_update.trip.trip_id)
except KeyError:
continue
else:
continue
except OSError:
continue

I suspect the problem here is repeatedly calling np.append to add a new row to a numpy array. Because the size of a numpy array is fixed when it is created, np.append() must create a new array, which means that it has to copy the previous array. On each loop, the array is bigger and so all these copies add a quadratic factor to your execution time. This becomes significant when the array is quite big (which apparently it is in your application).
As an alternative, you could just create an ordinary Python list of tuples, and then if necessary convert that to a complete numpy array at the end.
That is (only the modified lines):
datapoints = []
# ...
datapoints.append((key,ts,delay))
# ...
npdata = np.array(datapoints, dtype=int)

I still think the parse routine is your bottleneck (even if it did come from Google), but all those '.'s were killing me! (And they do slow down performance somewhat.) Also, I converted your i, i+1 iterating to using two iterators zipping through the list of updates, this is a little more advanced style of working through a list. Plus the cur/next_update names helped me keep straight when you wanted to reference one vs. the other. Finally, I remove the trailing "else: continue", since you are at the end of the for loop anyway.
for trip in feed.entity:
this_trip_update = trip.trip_update
this_trip_id = this_trip_update.trip.trip_id
if this_trip_id not in read_trips:
try:
if len(this_trip_update.stop_time_update) == len(stopsOnTrip[this_trip_id]):
print("\t", "Adding delays for", len(this_trip_update.stop_time_update), "stops, on trip_id",
this_trip_id)
# create two iterators to walk through the list of updates
cur_updates = iter(this_trip_update.stop_time_update)
nxt_updates = iter(this_trip_update.stop_time_update)
# advance the nxt_updates iter so it is one ahead of cur_updates
next(nxt_updates)
for cur_update, next_update in zip(cur_updates, nxt_updates):
# Store the delay data point (arrival difference of two ascending nodes)
delay = int(nxt_update.arrival.delay - cur_update.arrival.delay)
# Store contextual metadata (timestamp and edgeID) for the unique delay data point
ts = int(next_update.arrival.time)
key = "{}/{}".format(cur_update.stop_id, next_update.stop_id)
# Append data to numpy array
datapoints = np.append(datapoints, np.array([[key, ts, delay]]), axis=0)
read_trips.add(this_trip_id)
except KeyError:
continue
This code should be equivalent to what you posted, and I don't really expect major performance gains either, but perhaps this will be more maintainable when you come back to look at it in 6 months.
(This probably is more appropriate for CodeReview, but I hardly ever go there.)

U-SQL Python extension: very slow performance

I'm doing something seemingly trivial that takes much longer than I would expect it to. I'm loading a 70MB file, running it through a reducer which calls a Python script that does not modify the data, and writing the data back to a new file.
It takes 42 minutes when I run it through the Python script, it takes less than one minute (including compilation) if I don't.
I'm trying to understand:
What am I doing wrong?
What is going on underneath the hood that takes so long?
I store the input and output files on Azure Data Lake Store. I'm using parallelism 1, a TSV input file of about 70MB (2000 rows, 2 columns). I'm just passing the data through. It takes 42 minutes until the job finishes.
I generated the test input data with this Python script:
import base64
# create a roughly 70MB TSV file with 2000 rows and 2 columns: ID (integer) and roughly 30KB data (string)
fo = open('testinput.tsv', 'wb')
for i in range(2000):
fo.write(str(i).encode() + b'\t' + base64.b85encode(bytearray(os.urandom(30000))) + b'\n')
fo.close()
This is the U-SQL script I use:
REFERENCE ASSEMBLY [ExtPython];
DECLARE #myScript = #"
def usqlml_main(df):
return df
";
#step1 =
EXTRACT
col1 string,
col2 string
FROM "/test/testinput.tsv" USING Extractors.Tsv();;
#step2 =
REDUCE #ncsx_files ON col1
PRODUCE col1 string, col2 string
USING new Extension.Python.Reducer(pyScript:#myScript);
OUTPUT #step2
TO "/test/testoutput.csv"
USING Outputters.Tsv(outputHeader: true);

I have the same issue.
I have a 116 mb csv file I want to read in (and then do stuff). When trying to read in the file and do nothing in the python script it times out after 5 hours, I even tried reducing the file to 9,28 mb it also times out after 5 hours.
However, when reduced to 1,32 mb the job finishes after 16 mins. (With results as expected).
REFERENCE ASSEMBLY [ExtPython];
DECLARE #myScript = #"
def usqlml_main(df):
return df
";
#train =
EXTRACT txt string,
casegroup string
FROM "/test/t.csv"
USING Extractors.Csv();
#train =
SELECT *,
1 AS Order
FROM #train
ORDER BY Order
FETCH 10000;
#train =
SELECT txt,
casegroup
FROM #train; // 1000 rows: 16 mins, 10000 rows: times out at 5 hours.
#m =
REDUCE #train ON txt, casegroup
PRODUCE txt string, casegroup string
USING new Extension.Python.Reducer(pyScript:#myScript);
OUTPUT #m
TO "/test/t_res.csv"
USING Outputters.Csv();

REDUCE #ROWSET ALL
If you don't reduce on ALL, it will invoke the python function per row.
If you want to use parallelism, you could create temporary groups to reduce on.

Fetching huge data from Oracle in Python

I need to fetch huge data from Oracle (using cx_oracle) in python 2.6, and to produce some csv file.
The data size is about 400k record x 200 columns x 100 chars each.
Which is the best way to do that?
Now, using the following code...
ctemp = connection.cursor()
ctemp.execute(sql)
ctemp.arraysize = 256
for row in ctemp:
file.write(row[1])
...
... the script remain hours in the loop and nothing is writed to the file... (is there a way to print a message for every record extracted?)
Note: I don't have any issue with Oracle, and running the query in SqlDeveloper is super fast.
Thank you, gian

You should use cur.fetchmany() instead.
It will fetch chunk of rows defined by arraysise (256)
Python code:
def chunks(cur): # 256
global log, d
while True:
#log.info('Chunk size %s' % cur.arraysize, extra=d)
rows=cur.fetchmany()
if not rows: break;
yield rows
Then do your processing in a for loop;
for i, chunk in enumerate(chunks(cur)):
for row in chunk:
#Process you rows here
That is exactly how I do it in my TableHunter for Oracle.

add print statements after each line
add a counter to your loop indicating progress after each N rows
look into a module like 'progressbar' for displaying a progress indicator

I think your code is asking the database for the data one row at the time which might explain the slowness.
Try:
ctemp = connection.cursor()
ctemp.execute(sql)
Results = ctemp.fetchall()
for row in Results:
file.write(row[1])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Load PostgreSQL database with data from a NetCDF file - python

Related

Hierarchical dictionary (reducing memory footprint or using a database)

openpyxl performance in read-only mode

How do I improve the speed of this parser using python?

U-SQL Python extension: very slow performance

Fetching huge data from Oracle in Python

Categories

Resources