I have a python to sql script that reads netcdf file and inserts climatic data to a postgresql table, one row at the time.
This of course takes forever, and now I would like to figure out how I can optimize this code.
I have been thinking about making a huge list, and then use the copy command. However, I am unsure how one would work that out. Another way might be to write to a csv file and then copy this csv file to the postgres database using the COPY command in Postgresql. I guess that would be quicker than inserting one row at a time.
If you have any suggestions on how this could be optimized, then I would really appreciate it. The netcdf file is available here (need to register though):
http://badc.nerc.ac.uk/browse/badc/cru/data/cru_ts/cru_ts_3.21/data/pre
# NetCDF to PostGreSQL database
# CRU-TS 3.21 precipitation and temperature data. From NetCDF to database table
# Requires Python2.6, Postgresql, Psycopg2, Scipy
# Tested using Vista 64bit.
# Import modules
import psycopg2, time, datetime
from scipy.io import netcdf
# Establish connection
db1 = psycopg2.connect("host=192.168.1.162 dbname=dbname user=username password=password")
cur = db1.cursor()
### Create Table
print str(time.ctime())+ " Creating precip table."
cur.execute("DROP TABLE IF EXISTS precip;")
cur.execute("CREATE TABLE precip (gid serial PRIMARY KEY not null, year int, month int, lon decimal, lat decimal, pre decimal);")
### Read netcdf file
f = netcdf.netcdf_file('/home/username/output/project_v2/inputdata/precipitation/cru_ts3.21.1901.2012.pre.dat.nc', 'r')
##
### Create lathash
print str(time.ctime())+ " Looping through lat coords."
temp = f.variables['lat'].data.tolist()
lathash = {}
for entry in temp:
print str(entry)
lathash[temp.index(entry)] = entry
##
### Create lonhash
print str(time.ctime())+ " Looping through long coords."
temp = f.variables['lon'].data.tolist()
lonhash = {}
for entry in temp:
print str(entry)
lonhash[temp.index(entry)] = entry
##
### Loop through every observation. Set timedimension and lat and long observations.
for _month in xrange(1344):
if _month < 528:
print(str(_month))
print("Not yet")
else:
thisyear = int((_month)/12+1901)
thismonth = ((_month) % 12)+1
thisdate = datetime.date(thisyear,thismonth, 1)
print(str(thisdate))
_time = int(_month)
for _lon in xrange(720):
for _lat in xrange(360):
data = [int(thisyear), int(thismonth), lonhash[_lon], lathash[_lat], f.variables[('pre')].data[_time, _lat, _lon]]
cur.execute("INSERT INTO precip (year, month, lon, lat, pre) VALUES "+str(tuple(data))+";")
db1.commit()
cur.execute("CREATE INDEX idx_precip ON precip USING btree(year, month, lon, lat, pre);")
cur.execute("ALTER TABLE precip ADD COLUMN geom geometry;")
cur.execute("UPDATE precip SET geom = ST_SetSRID(ST_Point(lon,lat), 4326);")
cur.execute("CREATE INDEX idx_precip_geom ON precip USING gist(geom);")
db1.commit()
cur.close()
db1.close()
print str(time.ctime())+ " Done!"
Use psycopg2's copy_from.
It expects a file-like object, but that can be your own class that reads and processes the input file and returns it on demand via the read() and readlines() methods.
If you're not confident doing that, you could - as you said - generate a CSV tempfile and then COPY that. For very best performance you'd generate the CSV (Python's csv module is useful) then copy it to the server and use server-side COPY thetable FROM '/local/path/to/file', thus avoiding any network overhead.
Most of the time it's easier to use copy ... from stdin via something like psql's \copy or psycopg2's copy_from, and plenty fast enough. Especially if you couple it with producer/consumer feeding via Python's multiprocessing module (not as complicated as it sounds) so your code to parse the input isn't stuck waiting while the database writes rows.
For some more advice on speeding up bulk loading see How to speed up insertion performance in PostgreSQL - but I can see you're already doing at least some of that right, like creating indexes at the end and batching work into transactions.
I had a similar demand, and I rewrote the Numpy array into a PostgreSQL binary input file format. The main drawback is that all columns of the target table need to be inserted, which gets tricky if you need to encode your geometry WKB, however you can use a temporary unlogged table to load the netCDF file into, then select that data into another table with the proper geometry type.
Details here: https://stackoverflow.com/a/8150329/327026
Related
I'm looking for an efficient way to import data from a CSV file to a Postgresql table using python in batches as I have quite large files and the server I'm importing the data to is far away. I need an efficient solution as everything I tried was either slow or just didn't work. I'm using SQLlahcemy.
I wanted to use raw SQL but it's so hard to parameterize and I need multiple loops to execute the query for multiple rows
I was given the task of manipulating & migrating some data from CSV files into a remote Postgres Instance.
I decided to use the Python script below:
import csv
import uuid
import psycopg2
import psycopg2.extras
import time
#Instant Time at the start of the Script
start = time.time()
psycopg2.extras.register_uuid()
#List of CSV Files that I want to manipulate & migrate.
file_list=["Address.csv"]
conn = psycopg2.connect("host=localhost dbname=address user=postgres password=docker")
cur = conn.cursor()
i = 1
for f in file_list:
f = open(f)
csv_f = csv.reader(f)
next(csv_f)
for row in csv_f:
# Some simple manipulations on each row
#Inserting a uuid4 into the first column
row.pop(0)
row.insert(0,uuid.uuid4())
row.pop(10)
row.insert(10,False)
row.pop(13)
#Tracking the number of rows inserted
print(i)
i = i + 1
#INSERT QUERY
postgres_insert_query = """ INSERT INTO "public"."address"("address_id","address_line_1","locality_area_street","address_name","app_version","channel_type","city","country","created_at","first_name","is_default","landmark","last_name","mobile","pincode","territory","updated_at","user_id") VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
record_to_insert = row
cur.execute(postgres_insert_query,record_to_insert)
f.close()
conn.commit()
conn.close()
print(time.time()-start)
The script worked quite well and promptly when testing it locally. But connecting to a remote Database Server added a lot more latency.
As a workaround, I migrated the manipulated data into my local postgres instance.
I then generated a .sql file of the migrated data & manually imported the .sql file on the remote server.
Alternatively, you can also use Python's Multithreading features, to launch multiple concurrent connections to the remote server and dedicate an isolated batch process to each connection, and flush the data.
This should make your migration considerably faster.
I have personally not tried the multi threading approach as it wasn't required in my case. But it seems darn efficient.
Hope this helped ! :)
Resources:
CSV Manipulation using Python for Beginners.
use copy_from command, it copies all the rows to table.
path=open('file.csv','r')
next(path)
cur.copy_from(path,'table_name',columns=('id','name','email'))
As I'm new to sqlite databases, I highly appreciate every useful comment, answer or reference to interesting threads and websites. Here's my situation:
I have a directory with 400 txt files each with the size of ~7GB. The relevant information in these files are written into a sqlite database resulting in a 17.000.000x4 table, which takes approximately 1 day. Later on the database will be queried only by me to further analyze the data.
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel. For instance, I could run several processes in parallel, each process taking only one of the 400 txt files as input and writing the results to the database. So is it possible to let several processes write to a database in parallel?
EDIT1: Answer w.r.t. W4t3randWinds comment: It is possible (and faster) to process 1 file per core, write the results into a database and merge all databases after that. However, write into 1 database using multi threading is not possible.
Furthermore, I was wondering whether it would be more efficient to create several databases instead of one big database? For instance, does it make sense to create a database per txt file resulting in 400 databases consisting of a 17.000.000/400 x 4 table?
At last, I'm storing the database as a file on my machine. However, I also read about the possibility to set up a server. So when does it make sense to use a server and more specifically, would it make sense to use a server in my case?
Please see below my code for the creation of the database.
### SET UP
# set up database
db = sqlite3.connect("mydatabase.db")
cur = db.cursor()
cur.execute("CREATE TABLE t (sentence, ngram, word, probability);")
# set up variable to store db rows
to_db = []
# set input directory
indir = '~/data/'
### PARSE FILES
# loop through filenames in indir
for filename in os.listdir(indir):
if filename.endswith(".txt"):
filename = os.path.join(indir, filename)
# open txt files in dir
with io.open(filename, mode = 'r', encoding = 'utf-8') as mytxt:
### EXTRACT RELEVANT INFORMATION
# for every line in txt file
for i, line in enumerate(mytxt):
# strip linebreak
line = line.strip()
# read line where the sentence is stated
if i == 0 or i % 9 == 0:
sentence = line
ngram = " ".join(line.split(" ")[:-1])
word = line.split(" ")[-1]
# read line where the result is stated
if (i-4) == 0 or (i-4) % 9 == 0:
result = line.split(r'= ')[1].split(r' [')[0]
# make a tuple representing a new row of db
db_row = (sentence, ngram, word, result)
to_db.append(db_row)
### WRITE TO DATABASE
# add new row to db
cur.executemany("INSERT INTO t (sentence, ngram, word, results) VALUES (?, ?, ?, ?);", to_db)
db.commit()
db.close()
The whole process of creating the database could be significantly accelerated, if it is possible to write to a database in parallel
I am not sure of that. You only have little processing, so the whole process is likely to be io bound. SQLite is a very nice tool, but it only support one single thread to write into it.
Possible improvements:
use x threads to read and process the text file, a single one to write to the database in large chunks and a queue. As the process is IO bound, the Python Global Interprocess Lock should not be a problem
use a full featured database like PostgreSQL or MariaDB on a separate machine and multiple processes on the client machine each processing its own set of input files
In either case, I am unsure of the benefit...
I do daily updates to an SQLite database using python mutlithreading. It works beautifully. Two different tables have nearly 20,000,000 records one with 8 fields the other with 10. This is on my laptop which is 4 years old.
If you are having performance issues I recommend looking into how your tables are constructed (a proper primary key and indexes) and your equipment. If you are still using an HDD you will gain amazing performance by upgrading to an SSD.
I have a very big database and I want to send part of that database (1/1000) to someone I am collaborating with to perform test runs. How can I (a) select 1/1000 of the total rows (or something similar) and (b) save the selection as a new .db file.
This is my current code, but I am stuck.
import sqlite3
import json
from pprint import pprint
conn = sqlite3.connect('C:/data/responses.db')
c = conn.cursor()
c.execute("SELECT * FROM responses;")
Create a another database with similar table structure as the original db. Sample records from original database and insert into new data base
import sqlite3
conn = sqlite3.connect("responses.db")
sample_conn = sqlite3.connect("responses_sample.db")
c = conn.cursor()
c_sample = sample_conn.cursor()
rows = c.execute("select no, nm from responses")
sample_rows = [r for i, r in enumerate(rows) if i%10 == 0] # select 1/1000 rows
# create sample table with similar structure
c_sample.execute("create table responses(no int, nm varchar(100))")
for r in sample_rows:
c_sample.execute("insert into responses (no, nm) values ({}, '{}')".format(*r))
c_sample.close()
sample_conn.commit()
sample_conn.close()
Simplest way to do this would be:
Copy the database file in your filesystem same as you would any other file (e.g. ctrl+c then ctrl+v in windows to make responses-partial.db or something)
Then open this new copy in an sqlite editor such as http://sqlitebrowser.org/ run the delete query to remove however many rows you want to. Then you might want to run compact database from file menu.
Close sqlite editor and confirm file size is smaller
Email the copy
Unless you need to create a repeatable system I wouldn't bother with doing this in python. But you could perform similar steps in python (copy the file, open it it run delete query, etc) if you need to.
The easiest way to do this is to
make a copy of the database file;
delete 999/1000th of the data, either by keeping the first few rows:
DELETE FROM responses WHERE SomeID > 1000;
or, if you want really random samples:
DELETE FROM responses
WHERE rowid NOT IN (SELECT rowid
FROM responses
ORDER BY random()
LIMIT (SELECT count(*)/1000 FROM responses));
run VACUUM to reduce the file size.
I have a dataset in a CSV file consisting of 2500 lines. The file is structured that (simplified) way:
id_run; run_name; receptor1; receptor2; receptor3_value; [...]; receptor50_value
Each receptor of the file is already in a table and have a unique id.
I need to upload each line to a table with this format:
id_run; id_receptor; receptor_value
1; 1; 2.5
1; 2; 3.2
1; 3, 2.1
[...]
2500, 1, 2.4
2500, 2, 3.0
2500, 3, 1.1
Actually, I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.
For 2500 runs (so 2500 lines in the CSV file) and 50 receptors, my Python program generates ~110000 records in the text file to be uploaded.
I'm dropping the foreign keys of the destination table and restoring them after the upload.
Using this method, it takes actually ~8 seconds to generate the text file and 1 second to copy the file to the table.
Is there a way, method, library or anything else I could use to accelerate the preparation of the data for the upload so that 90% of the time required isn't for the writing of the text file?
Edit:
Here is my (updated) code. I'm now using a bulk writing to the text file. It looks likes it faster (uploaded 110 000 lines in 3.8 seconds).
# Bulk write to file
lines = []
for line_i, line in enumerate(run_specs):
# the run_specs variable consists of the attributes defining a run
# (id_run, run_name, etc.). So basically a line in the CSV file without the
# receptors data
sc_uid = get_uid(db, table_name) # function to get the unique ID of the run
for rec_i, rec in enumerate(rec_uids):
# the rec_uids variable is the unique IDs in the database for the
# receptors in the CSV file
line_to_write = '%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i])
lines.append(line_to_write)
# write to file
fn = r"data\tmp_data_bulk.txt"
with open(fn, 'w') as tmp_data:
tmp_data.writelines(lines)
# get foreign keys of receptor_results
rr_fks = DB.get_fks(conn, 'receptor_results') # function to get foreign keys
# drop the foreign keys
for key in rr_fks:
DB.drop_fk(conn, 'receptor_results', key[0]) # funciton to drop FKs
# upload data with custom function using the COPY SQL command
DB.copy_from(conn, fn, 'receptor_results', ['sc_uid', 'rec_uid', 'value'],\
" ", False)
# restore foreign keys
for key in rr_fks:
DB.create_fk(conn, 'receptor_results', key[0], key[1], key[2])
# commit to database
conn.commit()
Edit #2:
Using cStringIO library, I replaced the creation of a temporary text file with a filelike object, but the speed gains is very very small.
Code changed:
outf = cStringIO.StringIO()
for rec_i, rec in enumerate(rec_uids):
outf.write('%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i]))
cur.copy_from(outf, 'receptor_results')
Yes, there is something you can do to speed up writing the data to the file in advance: don't bother!
You already fit the data into memory, so that isn't an issue. So, instead of writing the lines to a list of strings, write them to a slightly different object - a StringIO instance. Then the data can stay in memory and serve as the parameter to psycopg2's copy_from function.
filelike = StringIO.StringIO('\n'.join(['1\tA', '2\tB', '3\tC']))
cursor.copy_from(filelike, 'your-table-name')
Notice that the StringIO must contain the newlines, the field separators and so on - just as the file would have.
I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.
It is a heavy and unnecessary round-trip for all your data. Since you already have it in memory, you should just translate it into a multi-row insert directly:
INSERT INTO table(col1, col2) VALUES (val1, val2), (val3, val4), ...
i.e. concatenate your data into such a query and execute it as is.
In your case you would probably generate and execute 50 such inserts, with 2500 rows in each, according to your requirements.
It will be the best-performing solution ;)
I'm quite new to Python, so any help will be appreciated. I am trying to extract and sort data from 2000 .mdb files using mdbtools on Linux. So far I was able to just take the .mdb file and dump all the tables into .csv. It creates huge mess since there are lots of files that need to be processed.
What I need is to extract particular sorted data from particular table. Like for example, I need the table called "Voltage". The table consists of numerous cycles and each cycle has several rows also. The cycles usually go in chronological order, but in some cases time stamp get recorded with delay. Like cycle's one first row can have later time than cycles 1 first row. I need to extract the latest row of the cycle based on time for the first or last five cycles. For example, in table below, I will need the second row.
Cycle# Time Data
1 100.59 34
1 101.34 54
1 98.78 45
2
2
2 ...........
Here is the script I use. I am using the command python extract.py table_files.mdb. But I would like the script to just be invoked with ./extract.py. The path to filenames should be in the script itself.
import sys, subprocess, os
DATABASE = sys.argv[1]
subprocess.call(["mdb-schema", DATABASE, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()
# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
if table != '':
filename = table.replace(" ","_") + ".csv"
file = open(filename, 'w')
print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", DATABASE, table],
stdout=subprocess.PIPE).communicate()[0]
file.write(contents)
file.close()
Personally, I wouldn't spend a whole lot of time fussing around trying to get mdbtools, unixODBC and pyodbc to work together. As Pedro suggested in his comment, if you can get mdb-export to dump the tables to CSV files then you'll probably save a fair bit of time by just importing those CSV files into SQLite or MySQL, i.e., something that will be more robust than using mdbtools on the Linux platform.
A few suggestions:
Given the sheer number of .mdb files (and hence .csv files) involved, you'll probably want to import the CSV data into one big table with an additional column to indicate the source filename. That will be much easier to manage than ~2000 separate tables.
When creating your target table in the new database you'll probably want to use a decimal (as opposed to float) data type for the [Time] column.
At the same time, rename the [Cycle#] column to just [Cycle]. "Funny characters" in column names can be a real nuisance.
Finally, to select the "last" reading (largest [Time] value) for a given [SourceFile] and [Cycle] you can use a query something like this:
SELECT
v1.SourceFile,
v1.Cycle,
v1.Time,
v1.Data
FROM
Voltage v1
INNER JOIN
(
SELECT
SourceFile,
Cycle,
MAX([Time]) AS MaxTime
FROM Voltage
GROUP BY SourceFile, Cycle
) v2
ON v1.SourceFile=v2.SourceFile
AND v1.Cycle=v2.Cycle
AND v1.Time=v2.MaxTime
To bring it directly to Pandas in python3 I wrote this little snippet
import sys, subprocess, os
from io import StringIO
import pandas as pd
VERBOSE = True
def mdb_to_pandas(database_path):
subprocess.call(["mdb-schema", database_path, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", database_path],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
sys.stdout.flush()
# Dump each table as a stringio using "mdb-export",
out_tables = {}
for rtable in tables:
table = rtable.decode()
if VERBOSE: print('running table:',table)
if table != '':
if VERBOSE: print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", database_path, table],
stdout=subprocess.PIPE).communicate()[0]
temp_io = StringIO(contents.decode())
print(table, temp_io)
out_tables[table] = pd.read_csv(temp_io)
return out_tables
There's an alternative to mdbtools for Python: JayDeBeApi with the UcanAccess driver. It uses a Python -> Java bridge which slows things down, but I've been using it with considerable success and comes with decent error handling.
It takes some practice setting it up, but if you have a lot of databases to wrangle, it's well worth it.