I have a block of data, currently as a list of n-tuples but the format is pretty flexible, that I'd like to append to a Postgres table - in this case, each n-tuple corresponds to a row in the DB.
What I had been doing up to this point is writing these all to a CSV file and then using postgres' COPY to bulk load all of this into the database. This works, but is suboptimal, I'd prefer to be able to do this all directly from python. Is there a method from within python to replicate the COPY type bulk load in Postgres?
If you're using the psycopg2 driver, the cursors provide a copy_to and copy_from function that can read from any file-like object (including a StringIO buffer).
There are examples in the files examples/copy_from.py and examples/copy_to.py that come with the psycopg2 source distribution.
This excerpt is from the copy_from.py example:
conn = psycopg2.connect(DSN)
curs = conn.cursor()
curs.execute("CREATE TABLE test_copy (fld1 text, fld2 text, fld3 int4)")
# anything can be used as a file if it has .read() and .readline() methods
data = StringIO.StringIO()
data.write('\n'.join(['Tom\tJenkins\t37',
'Madonna\t\N\t45',
'Federico\tDi Gregorio\t\N']))
data.seek(0)
curs.copy_from(data, 'test_copy')
Related
I am using Streamlit, where one function is to upload a .csv file. The uploader returns a io.StringIO or io.BytesIO object.
I need to upload this object to my postgres database. There I have a column, holding an array of bytea:
CREATE TABLE files (
id int4 NOT NULL,
docs_array bytea[] NOT NULL
...
)
;
Usually, I would use the SQL query like
UPDATE files SET docs_array = array_append(docs_array, pg_read_binary_file('/Users/XXX//Testdata/test.csv')::bytea) WHERE id = '1';
however, since I have a stringIO object, this does not work.
I have tried
sql = UPDATE files SET docs_array = array_append(docs_array, %s::bytea) WHERE id = '%s';
cursor.execute(sql, (file, testID,) )
and cursor.execute(sql, (psycopg.Binary(file), testID,) )
yet I always get one of the following errors
can't adapt type '_io.BytesIO
can't escape _io.BytesIO to binary
can't adapt type '_io.StringIO
can't escape _io.StringIO to binary
How can I load the object?
UPDATE:
*thanks Mike Organek for the suggestion!
example file.read() looks like b'"Datum/Uhrzeit","Durchschnittsverbrauch Strom (kWh/100km)","Durchschnittsverbrauch Verbrenner (l/100km)","Fahrstrecke (km)","Fahrzeit (h)","Durchschnittsgeschwindigkeit (km/h)"\r\n"2015-11-28T11:44:06Z","7,6","8,5","1.792","14:01","128"\r\n"2015-11-28T12:28:45Z","7,7","8,5","1.473","14:21","103"\r\n"2015-12-24T06:04:43Z","5,5","8,3","4.848","48:01","101"\r\n"2015-12-24T12:15:25Z","27,2","8,0","290","3:20","88"\r\n'
yet if I try to execute
cursor.execute(sql, (file.read(), testID,) )
only "\x" is loaded to the db. The whole data is lost for whatever reason
Screenshot
Yet, if I define file as b'"Datum/Uhrzeit","Durchschnittsverbrauch Strom (kWh/100km)"..."8,5"\r\n' - everything works. So my only guess is the problem lies somewhere with io object and .read()?
Mike Organek pointed out the solution:
file.read() in cursor.execute(sql, (file.read(), testID,)) to get the binary data while not reading the file previously did the trick.
Question 1 of 2
I'm trying to import data from CSV file to Vertica using Python, using Uber's vertica-python package. The problem is that whitespace-only data elements are being loaded into Vertica as NULLs; I want only empty data elements to be loaded in as NULLs, and non-empty whitespace data elements to be loaded in as whitespace instead.
For example, the following two rows of a CSV file are both loaded into the database as ('1','abc',NULL,NULL), whereas I want the second one to be loaded as ('1','abc',' ',NULL).
1,abc,,^M
1,abc, ,^M
Here is the code:
# import vertica-python package by Uber
# source: https://github.com/uber/vertica-python
import vertica_python
# write CSV file
filename = 'temp.csv'
data = <list of lists, e.g. [[1,'abc',None,'def'],[2,'b','c','d']]>
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f, escapechar='\\', doublequote=False)
writer.writerows(data)
# define query
q = "copy <table_name> (<column_names>) from stdin "\
"delimiter ',' "\
"enclosed by '\"' "\
"record terminator E'\\r' "
# copy data
conn = vertica_python.connect( host=<host>,
port=<port>,
user=<user>,
password=<password>,
database=<database>,
charset='utf8' )
cur = conn.cursor()
with open(filename, 'rb') as f:
cur.copy(q, f)
conn.close()
Question 2 of 2
Are there any other issues (e.g. character encoding) I have to watch out for using this method of loading data into Vertica? Are there any other mistakes in the code? I'm not 100% convinced it will work on all platforms (currently running on Linux; there may be record terminator issues on other platforms, for example). Any recommendations to make this code more robust would be greatly appreciated.
In addition, are there alternative methods of bulk inserting data into Vertica from Python, such as loading objects directly from Python instead of having to write them to CSV files first, without sacrificing speed? The data volume is large and the insert job as is takes a couple of hours to run.
Thank you in advance for any help you can provide!
The copy statement you have should perform the way you want with regards to the spaces. I tested it using a very similar COPY.
Edit: I missed what you were really asking with the copy, I'll leave this part in because it might still be useful for some people:
To fix the whitespace, you can change your copy statement:
copy <table_name> (FIELD1, FIELD2, MYFIELD3 AS FILLER VARCHAR(50), FIELD4, FIELD3 AS NVL(MYFIELD3,'') ) from stdin
By using filler, it will parse that into something like a variable which you can then assign to your actual table field using AS later in the copy.
As for any gotchas... I do what you have on Solaris often. The only one thing I noticed is you are setting the record terminator, not sure if this is really something you need to do depending on environment or not. I've never had to do it switching between linux, windows and solaris.
Also, one hint, this will return a resultset that will tell you how many rows were loaded. Do a fetchone() and print it out and you'll see it.
The only other thing I can recommend might be to use reject tables in case any rows reject.
You mentioned that it is a large job. You may need to increase your read timeout by adding 'read_timeout': 7200, to your connection or more. I'm not sure if None would disable the read timeout or not.
As for a faster way... if the file is accessible directly on the vertica node itself, you could just reference the file directly in the copy instead of doing a copy from stdin and have the daemon load it directly. It's much faster and has a number of optimizations that you can do. You could then use apportioned load, and if you have multiple files to load you can just reference them all together in a list of files.
It's kind of a long topic, though. If you have any specific questions let me know.
In Python, is there a more or less hacky way to open a compressed SQLite database without having to write a temporary file somewhere?
Something like:
import bz2
import sqlite3
dbfile = bz2.BZ2File("/path/to/file.bz2", "wb")
dbconn = sqlite3.connect(dbfile)
cursor = dbconn.cursor()
...
This of course raises:
ValueError: database parameter must be string or APSW Connection object
The underlying C-library directly uses the filename string. Thus there is no way to transparently work on it from Python.
See the code on Github
Depending on your OS, you might be able to use a RAM-disk to work on the file. If your sqlite-file is bigger than that, it might be time to switch to another DB-system, like Postgres.
I have a script built using procedural programming that uses a sqlite database file. The script processes a CSV file, then uses a standard cursor to pass its particulars to a single SQLite DB.
After this, the script extracts from the DB to produce a number of spreadsheets in Excel via xlwt.
The problem with this is that the script only handles one input file at a time, whereas I will need to be iterating through about 70-90 of these files on any given day.
I've been trying to rewrite the script as object-oriented, but I'm having trouble with sharing the cursor.
The original input file comes in a zip archive that I would extract via Linux or Mac OS X command lines. Previously this was done manually; now I've managed to write classes and loop through totally ad-hoc numbers of multiple input files via the multi version of tkfiledialog.
Furthermore the original input file (ie one of the 70-90) is a text file in csv format with a DRF extension (obv., not really important) that gets picked by a simple tkfiledialog box:
FILENAMER = tkFileDialog.askopenfilename(title="Open file", \
filetypes=[("txt file",".DRF"),("txt file", ".txt"),\
("All files",".*")])
The DRF file itself is issued daily by location and date, ie 'BHP0123.DRF' is for the BHP location, issued for 23 January. To keep everything as straightforward as possible the procedural script further decomposes the DRF to just the BHP0123 part or prefix, then uses it to build a SQLite DB.
FBASENAME = os.path.basename(FILENAMER)
FBROOT = os.path.splitext(FBNAMED)[0]
OUTPUTDATABASE = 'sqlite_' + FBROOT + '.db'
Basically with the program as a procedural script I just had to create one DB, one connection and one cursor, which could be shared by all the functions in the script:
conn = sqlite3.connect(OUTPUTDATABASE) # <-- originally :: Is this the core problem?
curs = conn.cursor()
conn.text_factory = sqlite3.OptimizedUnicode
In the procedural version these variables above are global.
Procedurally I have
1) one function to handle formatting, and
2) another to handle the calculations needed. The DRF is indexed with about 2500 fields per row; I discard the majority and only use about 400-500 of these per row.
The formatting function parses out the CSV via a for-loop (discards junk characters, incomplete data, etc), then passes the formatted data for the calculator to process and chew on. The core problem seems to be that on the one hand I need the DB connection to be constant for each input DRF file, but on the other that connection can only be shared by the formatter and calculator, and 'regenerated' for each DRF.
Crucially, I've tried to rewrite as little of the formatter and calculator as possible.
I've tried to create a separate dbcxn class, then create an instance to share, but I'm confused as to how to handle the output DB situation with the cursor (and pass it intact to both formatter and calculator):
class DBcxn(object):
def __init__(self, OUTPUTDATABASE):
OUTPUTDATABASE = ?????
self.OUTPUTDATABASE = OUTPUTDATABASE
def init_db_cxn(self, OUTPUTDATABASE):
conn = sqlite3.connect(OUTPUTDATABASE) # < ????
self.conn = conn
curs = conn.cursor()
self.curs = curs
conn.text_factory = sqlite3.OptimizedUnicode
dbtest = DBcxn( ???? )
If anyone might suggest a way of untangling this I'd be very grateful. Please let me know if you need more information.
Cheers
Massimo Savino
I have a program in Python composed of 4 source files. One of them is the main file which imports the other 3. As I work with a small Sqlite database, I am creating tables in one of the "secondary" source files, but when I access the database again from the main source file, the tables just populated before are empty.
Can I save the tables' content in a more consistent way? I am quite surprised with what is happening.
So in the main file I typed:
conn = sqlite3.connect("bayes.db")
cur = conn.cursor()
cur.execute("select count(*) from TableA")
print cur.fetchone()
The result is 0 (rows).
Just before, in another source file I do the same thing and get size=8 of TableA.
You must call the commit function in order to save your changes in the database. You can see the full documentation here: http://docs.python.org/2/library/sqlite3.html#sqlite3.Connection.commit