I am using Streamlit, where one function is to upload a .csv file. The uploader returns a io.StringIO or io.BytesIO object.
I need to upload this object to my postgres database. There I have a column, holding an array of bytea:
CREATE TABLE files (
id int4 NOT NULL,
docs_array bytea[] NOT NULL
...
)
;
Usually, I would use the SQL query like
UPDATE files SET docs_array = array_append(docs_array, pg_read_binary_file('/Users/XXX//Testdata/test.csv')::bytea) WHERE id = '1';
however, since I have a stringIO object, this does not work.
I have tried
sql = UPDATE files SET docs_array = array_append(docs_array, %s::bytea) WHERE id = '%s';
cursor.execute(sql, (file, testID,) )
and cursor.execute(sql, (psycopg.Binary(file), testID,) )
yet I always get one of the following errors
can't adapt type '_io.BytesIO
can't escape _io.BytesIO to binary
can't adapt type '_io.StringIO
can't escape _io.StringIO to binary
How can I load the object?
UPDATE:
*thanks Mike Organek for the suggestion!
example file.read() looks like b'"Datum/Uhrzeit","Durchschnittsverbrauch Strom (kWh/100km)","Durchschnittsverbrauch Verbrenner (l/100km)","Fahrstrecke (km)","Fahrzeit (h)","Durchschnittsgeschwindigkeit (km/h)"\r\n"2015-11-28T11:44:06Z","7,6","8,5","1.792","14:01","128"\r\n"2015-11-28T12:28:45Z","7,7","8,5","1.473","14:21","103"\r\n"2015-12-24T06:04:43Z","5,5","8,3","4.848","48:01","101"\r\n"2015-12-24T12:15:25Z","27,2","8,0","290","3:20","88"\r\n'
yet if I try to execute
cursor.execute(sql, (file.read(), testID,) )
only "\x" is loaded to the db. The whole data is lost for whatever reason
Screenshot
Yet, if I define file as b'"Datum/Uhrzeit","Durchschnittsverbrauch Strom (kWh/100km)"..."8,5"\r\n' - everything works. So my only guess is the problem lies somewhere with io object and .read()?
Mike Organek pointed out the solution:
file.read() in cursor.execute(sql, (file.read(), testID,)) to get the binary data while not reading the file previously did the trick.
Related
So, I'm trying to save an image into my PostgreSQL table in Python using psycopg2
INSERT Query (Insert.py)
#Folder/Picture is my path/ID is generated/IDnum is increment
picopen = open("Folder/Picture."+str(ID)+"."+str(IDnum)+".jpg", 'rb').read()
filename = ("Picture."+str(ID)+"."+str(IDnum)+".jpg")
#Sample file name is this Picture.1116578795.7.jpg
#upload_name is where I want the file name, upload_content is where I store the image
SQL = "INSERT INTO tbl_upload (upload_name, upload_content) VALUES (%s, %s)"
data = (filename, psycopg2.Binary(picopen))
cur.execute(SQL, data)
conn.commit()
now to recover the saved Image I perform this query (recovery.py)
cur.execute("SELECT upload_content, upload_name from tbl_upload")
for row in cur:
mypic = cur.fetchone()
open('Folder/'+row[1], 'wb').write(str(mypic[0]))
now what happens is when I execute the recovery.py it does generate a ".jpg" file but I can't view or open it.
If it helps I'm doing it using Python 2.7 and Centos7.
for the sake of additional information, I get this on the image viewer when I open the generated file.
Error interpreting JPEG image file (Not a JPEG file: starts with 0x5c 0x78)
I also tried using other formats as well (.png, .bmp)
I double checked my database type and apparently upload_content datatype is text its supposed to be bytea I thought I already had it set to bytea when I created my db. Problem solved.
Question 1 of 2
I'm trying to import data from CSV file to Vertica using Python, using Uber's vertica-python package. The problem is that whitespace-only data elements are being loaded into Vertica as NULLs; I want only empty data elements to be loaded in as NULLs, and non-empty whitespace data elements to be loaded in as whitespace instead.
For example, the following two rows of a CSV file are both loaded into the database as ('1','abc',NULL,NULL), whereas I want the second one to be loaded as ('1','abc',' ',NULL).
1,abc,,^M
1,abc, ,^M
Here is the code:
# import vertica-python package by Uber
# source: https://github.com/uber/vertica-python
import vertica_python
# write CSV file
filename = 'temp.csv'
data = <list of lists, e.g. [[1,'abc',None,'def'],[2,'b','c','d']]>
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.writer(f, escapechar='\\', doublequote=False)
writer.writerows(data)
# define query
q = "copy <table_name> (<column_names>) from stdin "\
"delimiter ',' "\
"enclosed by '\"' "\
"record terminator E'\\r' "
# copy data
conn = vertica_python.connect( host=<host>,
port=<port>,
user=<user>,
password=<password>,
database=<database>,
charset='utf8' )
cur = conn.cursor()
with open(filename, 'rb') as f:
cur.copy(q, f)
conn.close()
Question 2 of 2
Are there any other issues (e.g. character encoding) I have to watch out for using this method of loading data into Vertica? Are there any other mistakes in the code? I'm not 100% convinced it will work on all platforms (currently running on Linux; there may be record terminator issues on other platforms, for example). Any recommendations to make this code more robust would be greatly appreciated.
In addition, are there alternative methods of bulk inserting data into Vertica from Python, such as loading objects directly from Python instead of having to write them to CSV files first, without sacrificing speed? The data volume is large and the insert job as is takes a couple of hours to run.
Thank you in advance for any help you can provide!
The copy statement you have should perform the way you want with regards to the spaces. I tested it using a very similar COPY.
Edit: I missed what you were really asking with the copy, I'll leave this part in because it might still be useful for some people:
To fix the whitespace, you can change your copy statement:
copy <table_name> (FIELD1, FIELD2, MYFIELD3 AS FILLER VARCHAR(50), FIELD4, FIELD3 AS NVL(MYFIELD3,'') ) from stdin
By using filler, it will parse that into something like a variable which you can then assign to your actual table field using AS later in the copy.
As for any gotchas... I do what you have on Solaris often. The only one thing I noticed is you are setting the record terminator, not sure if this is really something you need to do depending on environment or not. I've never had to do it switching between linux, windows and solaris.
Also, one hint, this will return a resultset that will tell you how many rows were loaded. Do a fetchone() and print it out and you'll see it.
The only other thing I can recommend might be to use reject tables in case any rows reject.
You mentioned that it is a large job. You may need to increase your read timeout by adding 'read_timeout': 7200, to your connection or more. I'm not sure if None would disable the read timeout or not.
As for a faster way... if the file is accessible directly on the vertica node itself, you could just reference the file directly in the copy instead of doing a copy from stdin and have the daemon load it directly. It's much faster and has a number of optimizations that you can do. You could then use apportioned load, and if you have multiple files to load you can just reference them all together in a list of files.
It's kind of a long topic, though. If you have any specific questions let me know.
I have a dataset in a CSV file consisting of 2500 lines. The file is structured that (simplified) way:
id_run; run_name; receptor1; receptor2; receptor3_value; [...]; receptor50_value
Each receptor of the file is already in a table and have a unique id.
I need to upload each line to a table with this format:
id_run; id_receptor; receptor_value
1; 1; 2.5
1; 2; 3.2
1; 3, 2.1
[...]
2500, 1, 2.4
2500, 2, 3.0
2500, 3, 1.1
Actually, I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.
For 2500 runs (so 2500 lines in the CSV file) and 50 receptors, my Python program generates ~110000 records in the text file to be uploaded.
I'm dropping the foreign keys of the destination table and restoring them after the upload.
Using this method, it takes actually ~8 seconds to generate the text file and 1 second to copy the file to the table.
Is there a way, method, library or anything else I could use to accelerate the preparation of the data for the upload so that 90% of the time required isn't for the writing of the text file?
Edit:
Here is my (updated) code. I'm now using a bulk writing to the text file. It looks likes it faster (uploaded 110 000 lines in 3.8 seconds).
# Bulk write to file
lines = []
for line_i, line in enumerate(run_specs):
# the run_specs variable consists of the attributes defining a run
# (id_run, run_name, etc.). So basically a line in the CSV file without the
# receptors data
sc_uid = get_uid(db, table_name) # function to get the unique ID of the run
for rec_i, rec in enumerate(rec_uids):
# the rec_uids variable is the unique IDs in the database for the
# receptors in the CSV file
line_to_write = '%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i])
lines.append(line_to_write)
# write to file
fn = r"data\tmp_data_bulk.txt"
with open(fn, 'w') as tmp_data:
tmp_data.writelines(lines)
# get foreign keys of receptor_results
rr_fks = DB.get_fks(conn, 'receptor_results') # function to get foreign keys
# drop the foreign keys
for key in rr_fks:
DB.drop_fk(conn, 'receptor_results', key[0]) # funciton to drop FKs
# upload data with custom function using the COPY SQL command
DB.copy_from(conn, fn, 'receptor_results', ['sc_uid', 'rec_uid', 'value'],\
" ", False)
# restore foreign keys
for key in rr_fks:
DB.create_fk(conn, 'receptor_results', key[0], key[1], key[2])
# commit to database
conn.commit()
Edit #2:
Using cStringIO library, I replaced the creation of a temporary text file with a filelike object, but the speed gains is very very small.
Code changed:
outf = cStringIO.StringIO()
for rec_i, rec in enumerate(rec_uids):
outf.write('%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i]))
cur.copy_from(outf, 'receptor_results')
Yes, there is something you can do to speed up writing the data to the file in advance: don't bother!
You already fit the data into memory, so that isn't an issue. So, instead of writing the lines to a list of strings, write them to a slightly different object - a StringIO instance. Then the data can stay in memory and serve as the parameter to psycopg2's copy_from function.
filelike = StringIO.StringIO('\n'.join(['1\tA', '2\tB', '3\tC']))
cursor.copy_from(filelike, 'your-table-name')
Notice that the StringIO must contain the newlines, the field separators and so on - just as the file would have.
I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.
It is a heavy and unnecessary round-trip for all your data. Since you already have it in memory, you should just translate it into a multi-row insert directly:
INSERT INTO table(col1, col2) VALUES (val1, val2), (val3, val4), ...
i.e. concatenate your data into such a query and execute it as is.
In your case you would probably generate and execute 50 such inserts, with 2500 rows in each, according to your requirements.
It will be the best-performing solution ;)
Craig Ringer yo no puedo trabajar con large object functions
My database looks like this
this is my table
-- Table: files
--
DROP TABLE files;
CREATE TABLE files
(
id serial NOT NULL,
orig_filename text NOT NULL,
file_data bytea NOT NULL,
CONSTRAINT files_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE files
I want save .pdf in my database, I saw you did the last answer, but using python27 (read the file and convert to a buffer object or use the large object functions)
I did the code would look like
path="D:/me/A/Res.pdf"
listaderuta = path.split("/")
longitud=len(listaderuta)
f = open(path,'rb')
f.read().__str__()
cursor = con.cursor()
cursor.execute("INSERT INTO files(id, orig_filename, file_data) VALUES (DEFAULT,%s,%s) RETURNING id", (listaderuta[longitud-1], f.read()))
but when I'm downloading, ie save
fula = open("D:/INSTALL/pepe.pdf",'wb')
cursor.execute("SELECT file_data, orig_filename FROM files WHERE id = %s", (int(17),))
(file_data, orig_filename) = cursor.fetchone()
fula.write(file_data)
fula.close()
but when I'm downloading the file can not be opened, this damaged
I repeat I can not work with large object functions
try this and turned me, can you help ?
I updated my previous answer to indicate usage for Python 2.7. The general idea is to read the manual and follow the instructions there.
Here's the relevant part:
In Python 2, you can't just pass a str directly, as psycopg2 will think it's an encoded text string, not raw bytes. You must either use psycopg2.Binary to wrap it, or load the data into a bytearray object.
So either:
filedata = psycopg2.Binary( f.read() )
or
filedata = buffer( f.read() )
I have a block of data, currently as a list of n-tuples but the format is pretty flexible, that I'd like to append to a Postgres table - in this case, each n-tuple corresponds to a row in the DB.
What I had been doing up to this point is writing these all to a CSV file and then using postgres' COPY to bulk load all of this into the database. This works, but is suboptimal, I'd prefer to be able to do this all directly from python. Is there a method from within python to replicate the COPY type bulk load in Postgres?
If you're using the psycopg2 driver, the cursors provide a copy_to and copy_from function that can read from any file-like object (including a StringIO buffer).
There are examples in the files examples/copy_from.py and examples/copy_to.py that come with the psycopg2 source distribution.
This excerpt is from the copy_from.py example:
conn = psycopg2.connect(DSN)
curs = conn.cursor()
curs.execute("CREATE TABLE test_copy (fld1 text, fld2 text, fld3 int4)")
# anything can be used as a file if it has .read() and .readline() methods
data = StringIO.StringIO()
data.write('\n'.join(['Tom\tJenkins\t37',
'Madonna\t\N\t45',
'Federico\tDi Gregorio\t\N']))
data.seek(0)
curs.copy_from(data, 'test_copy')