Loading large dataframe to Vertica

Loading large dataframe to Vertica - python

I have a rather large dataframe (500k+ rows) that I'm trying to load to Vertica. I have the following code working, but it is extremely slow.
#convert df to list format
lists = output_final.values.tolist()
#make insert string
insert_qry = " INSERT INTO SCHEMA.TABLE(DATE,ID, SCORE) VALUES (%s,%s,%s) "
# load into database
for i in range(len(lists)):
cur.execute(insert_qry, lists[i])
conn_info.commit()
I have seen a few posts talking about using COPY rather than EXECUTE to do this large of a load, but haven't found a good working example.

After a lot of trial and error... I found that the following worked for me.
# insert statements
copy_str = "COPY SCHEMA.TABLE(DATE,ID, SCORE)FROM STDIN DELIMITER ','"
# turn the df into a csv-like object
stream = io.StringIO()
contact_output_final.to_csv(stream, sep=",",index=False, header=False)
# reset the position of the stream variable
stream.seek(0)
# load to data
with conn_info.cursor() as cursor:
cur.copy(copy_str,stream.getvalue())
conn_info.commit()

Related

Populating a postgres table from a fixed width file using psycopg2 and pandas [duplicate]

I am loading about 2 - 2.5 million records into a Postgres database every day.
I then read this data with pd.read_sql to turn it into a dataframe and then I do some column manipulation and some minor merging. I am saving this modified data as a separate table for other people to use.
When I do pd.to_sql it takes forever. If I save a csv file and use COPY FROM in Postgres, the whole thing only takes a few minutes but the server is on a separate machine and it is a pain to transfer files there.
Using psycopg2, it looks like I can use copy_expert to benefit from the bulk copying, but still use python. I want to, if possible, avoid writing an actual csv file. Can I do this in memory with a pandas dataframe?
Here is an example of my pandas code. I would like to add the copy_expert or something to make saving this data much faster if possible.
for date in required_date_range:
df = pd.read_sql(sql=query, con=pg_engine, params={'x' : date})
...
do stuff to the columns
...
df.to_sql('table_name', pg_engine, index=False, if_exists='append', dtype=final_table_dtypes)
Can someone help me with example code? I would prefer to use pandas still and it would be nice to do it in memory. If not, I will just write a csv temporary file and do it that way.
Edit- here is my final code which works. It only takes a couple of hundred seconds per date (millions of rows) instead of a couple of hours.
to_sql = """COPY %s FROM STDIN WITH CSV HEADER"""
def process_file(conn, table_name, file_object):
fake_conn = cms_dtypes.pg_engine.raw_connection()
fake_cur = fake_conn.cursor()
fake_cur.copy_expert(sql=to_sql % table_name, file=file_object)
fake_conn.commit()
fake_cur.close()
#after doing stuff to the dataframe
s_buf = io.StringIO()
df.to_csv(s_buf)
process_file(cms_dtypes.pg_engine, 'fact_cms_employee', s_buf)

Python module io(docs) has necessary tools for file-like objects.
import io
# text buffer
s_buf = io.StringIO()
# saving a data frame to a buffer (same as with a regular file):
df.to_csv(s_buf)
Edit.
(I forgot) In order to read from the buffer afterwards, its position should be set to the beginning:
s_buf.seek(0)
I'm not familiar with psycopg2 but according to docs both copy_expert and copy_from can be used, for example:
cur.copy_from(s_buf, table)
(For Python 2, see StringIO.)

I had problems implementing the solution from ptrj.
I think the issue stems from pandas setting the pos of the buffer to the end.
See as follows:
from StringIO import StringIO
df = pd.DataFrame({"name":['foo','bar'],"id":[1,2]})
s_buf = StringIO()
df.to_csv(s_buf)
s_buf.__dict__
# Output
# {'softspace': 0, 'buflist': ['foo,1\n', 'bar,2\n'], 'pos': 12, 'len': 12, 'closed': False, 'buf': ''}
Notice that pos is at 12. I had to set the pos to 0 in order for the subsequent copy_from command to work
s_buf.pos = 0
cur = conn.cursor()
cur.copy_from(s_buf, tablename, sep=',')
conn.commit()

The pandas.DataFrame API (since v1.0) will output a string if the file object is not specified. For example:
df = pd.DataFrame([{'x': 1, 'y': 1}, {'x': 2, 'y': 4}, {'x': 3, 'y': 9}])
# outputs to a string
csv_as_string = df.to_csv(index=False)
print(repr(csv_as_string)) # prints 'x,y\r\n1,1\r\n2,4\r\n3,9\r\n' (on windows)
# outputs to a file
with open('example.csv', 'w', newline='') as f:
df.to_csv(f, index=False) # writes to file, returns None
From the current (v1.4.3) docs:
path_or_buf : str, path object, file-like object, or None, default None
String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.

extract table from sqlite3 database

I have some large (~500MB) .db3 files with tables in them that I'd like to extract, perform some checks on and write to a .csv file. The following code shows how I open a .db3 file given its filepath, select data from MY_TABLE and return the output list:
import sqlite3
def extract_db3_table(filepath):
output = []
with sqlite3.connect(filepath) as connection:
cursor = connection.cursor()
connection.text_factory = bytes
cursor.execute('''SELECT * FROM MY_TABLE''')
for row_index, row in enumerate(cursor):
output.append(row)
return output
However, the following error happens at the final iteration of the cursor:
File "C:/Users/.../Documents/Python/extract_test.py", line 10, in extract_db3_table
for row_index, row in enumerate(cursor):
DatabaseError: database disk image is malformed
The code runs for every iteration up until the last, when this error is thrown. I have tried different .db3 files but the problem persists. The same error is also thrown if I use the cursor.fetchall() method instead of the cursor for loop in my example above.
Do I have to use a try...catch block within the loop somehow and break the loop once this error is thrown or is there another way to solve this? Any insight is very much appreciated!
The reason I use bytes as the connection.text_factory is because some TEXT values in my table are not able to be converted to a str, so in my full code I use a try...catch block when decoding the TEXT values to UTF-8 and ignore the rows where this does not work.

What's a Fail-safe Way to Update Python CSV

This is my function to build a record of user's performed action in python csv. It will get the username from the global and perform increment given in the amount parameter to the specific location of the csv, matching the user's row and current date.
In brief, the function will read the csv in a list, and do any modification on the data before rewriting the whole list back into the csv file.
Every first item on rows is the username, and the header has the dates.
Accs\Dates,12/25/2016,12/26/2016,12/27/2016
user1,217,338,653
user2,261,0,34
user3,0,140,455
However, I'm not sure why sometimes, the header get's pushed down to the second row, and data gets wiped entirely when it crashes.
Also, I need to point out that there maybe multiple script running this function and writing on the same file, not sure if that causing the issue.
I'm thinking maybe I can write the stats separately and uniquely to each users and combine later, hence eliminating the possible clashing in writing. Although would be great if I could just improve from what I have here and read/write everything on a file.
Any fail-safe way to do what I'm trying to do here?
# Search current user in first rows and updating the count on the column (today's date)
# 'amount' will be added to the respective position
def dailyStats(self, amount, code = None):
def initStats():
# prepping table
with open(self.stats, 'r') as f:
reader = csv.reader(f)
for row in reader:
if row:
self.statsTable.append(row)
self.statsNames.append(row[0])
def getIndex(list, match):
# get the index of the matched date or user
for i, j in enumerate(list):
if j == match:
return i
self.statsTable = []
self.statsNames = []
self.statsDates = None
initStats()
today = datetime.datetime.now().strftime('%m/%d/%Y')
user_index = None
today_index = None
# append header if the csv is empty
if len(self.statsTable) == 0:
self.statsTable.append([r'Accs\Dates'])
# rebuild updated table
initStats()
# add new user/date if not found in first row/column
self.statsDates = self.statsTable[0]
if getIndex(self.statsNames, self.username) is None:
self.statsTable.append([self.username])
if getIndex(self.statsDates, today) is None:
self.statsDates.append(today)
# rebuild statsNames after table appended
self.statsNames = []
for row in self.statsTable:
self.statsNames.append(row[0])
# getting the index of user (row) and date (column)
user_index = getIndex(self.statsNames, self.username)
today_index = getIndex(self.statsDates, today)
# the row where user is matched, if there are previous dates than today which
# has no data, append 0 (e.g. user1,0,0,0,) until the column where today's date is match
if len(self.statsTable[user_index]) < today_index + 1:
for i in range(0,today_index + 1 - len(self.statsTable[user_index])):
self.statsTable[user_index].append(0)
# insert pv or tb code if found
if code is None:
self.statsTable[user_index][today_index] = amount + int(re.match(r'\b\d+?\b', str(self.statsTable[user_index][today_index])).group(0))
else:
self.statsTable[user_index][today_index] = str(re.match(r'\b\d+?\b', str(self.statsTable[user_index][today_index])).group(0)) + ' - ' + code
# Writing final table
with open(self.stats, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerows(self.statsTable)
# return the summation of the user's total count
total_follow = 0
for i in range(1, len(self.statsTable[user_index])):
total_follow += int(re.match(r'\b\d+?\b', str(self.statsTable[user_index][i])).group(0))
return total_follow

As David Z says, concurrency is more likely the cause of your problem.
I will add that CSV format is not suitable for Database storing, indexing, sorting, because it is plain/text and sequential.
You could handle it using a RDBMS for storing and updating your data and periodically processing your stats. Then your CSV format is just an import/export format.
Python offers a SQLite binding in its Standard Library, if you build a connector that import/update CSV content in a SQLite schema and then dump results as CSV you will be able to handle concurency and keep your native format without worring about installing a database server and installing new packages in Python.

Also, I need to point out that there maybe multiple script running this function and writing on the same file, not sure if that causing the issue.
More likely than not that is exactly your issue. When two things are trying to write to the same file at the same time, the outputs from the two sources can easily get mixed up together, resulting in a file full of gibberish.
An easy way to fix this is just what you mentioned in the question, have each different process (or thread) write to its own file and then have separate code to combine all those files in the end. That's what I would probably do.
If you don't want to do that, what you can do is have different processes/threads send their information to an "aggregator process", which puts everything together and writes it to the file - the key is that only the aggregator ever writes to the file. Of course, doing that requires you to build in some method of interprocess communication (IPC), and that in turn can be tricky, depending on how you do it. Actually, one of the best ways to implement IPC for simple programs is by using temporary files, which is just the same thing as in the previous paragraph.

Optimize Netcdf to Python code

I have a python to sql script that reads netcdf file and inserts climatic data to a postgresql table, one row at the time.
This of course takes forever, and now I would like to figure out how I can optimize this code.
I have been thinking about making a huge list, and then use the copy command. However, I am unsure how one would work that out. Another way might be to write to a csv file and then copy this csv file to the postgres database using the COPY command in Postgresql. I guess that would be quicker than inserting one row at a time.
If you have any suggestions on how this could be optimized, then I would really appreciate it. The netcdf file is available here (need to register though):
http://badc.nerc.ac.uk/browse/badc/cru/data/cru_ts/cru_ts_3.21/data/pre
# NetCDF to PostGreSQL database
# CRU-TS 3.21 precipitation and temperature data. From NetCDF to database table
# Requires Python2.6, Postgresql, Psycopg2, Scipy
# Tested using Vista 64bit.
# Import modules
import psycopg2, time, datetime
from scipy.io import netcdf
# Establish connection
db1 = psycopg2.connect("host=192.168.1.162 dbname=dbname user=username password=password")
cur = db1.cursor()
### Create Table
print str(time.ctime())+ " Creating precip table."
cur.execute("DROP TABLE IF EXISTS precip;")
cur.execute("CREATE TABLE precip (gid serial PRIMARY KEY not null, year int, month int, lon decimal, lat decimal, pre decimal);")
### Read netcdf file
f = netcdf.netcdf_file('/home/username/output/project_v2/inputdata/precipitation/cru_ts3.21.1901.2012.pre.dat.nc', 'r')
##
### Create lathash
print str(time.ctime())+ " Looping through lat coords."
temp = f.variables['lat'].data.tolist()
lathash = {}
for entry in temp:
print str(entry)
lathash[temp.index(entry)] = entry
##
### Create lonhash
print str(time.ctime())+ " Looping through long coords."
temp = f.variables['lon'].data.tolist()
lonhash = {}
for entry in temp:
print str(entry)
lonhash[temp.index(entry)] = entry
##
### Loop through every observation. Set timedimension and lat and long observations.
for _month in xrange(1344):
if _month < 528:
print(str(_month))
print("Not yet")
else:
thisyear = int((_month)/12+1901)
thismonth = ((_month) % 12)+1
thisdate = datetime.date(thisyear,thismonth, 1)
print(str(thisdate))
_time = int(_month)
for _lon in xrange(720):
for _lat in xrange(360):
data = [int(thisyear), int(thismonth), lonhash[_lon], lathash[_lat], f.variables[('pre')].data[_time, _lat, _lon]]
cur.execute("INSERT INTO precip (year, month, lon, lat, pre) VALUES "+str(tuple(data))+";")
db1.commit()
cur.execute("CREATE INDEX idx_precip ON precip USING btree(year, month, lon, lat, pre);")
cur.execute("ALTER TABLE precip ADD COLUMN geom geometry;")
cur.execute("UPDATE precip SET geom = ST_SetSRID(ST_Point(lon,lat), 4326);")
cur.execute("CREATE INDEX idx_precip_geom ON precip USING gist(geom);")
db1.commit()
cur.close()
db1.close()
print str(time.ctime())+ " Done!"

Use psycopg2's copy_from.
It expects a file-like object, but that can be your own class that reads and processes the input file and returns it on demand via the read() and readlines() methods.
If you're not confident doing that, you could - as you said - generate a CSV tempfile and then COPY that. For very best performance you'd generate the CSV (Python's csv module is useful) then copy it to the server and use server-side COPY thetable FROM '/local/path/to/file', thus avoiding any network overhead.
Most of the time it's easier to use copy ... from stdin via something like psql's \copy or psycopg2's copy_from, and plenty fast enough. Especially if you couple it with producer/consumer feeding via Python's multiprocessing module (not as complicated as it sounds) so your code to parse the input isn't stuck waiting while the database writes rows.
For some more advice on speeding up bulk loading see How to speed up insertion performance in PostgreSQL - but I can see you're already doing at least some of that right, like creating indexes at the end and batching work into transactions.

I had a similar demand, and I rewrote the Numpy array into a PostgreSQL binary input file format. The main drawback is that all columns of the target table need to be inserted, which gets tricky if you need to encode your geometry WKB, however you can use a temporary unlogged table to load the netCDF file into, then select that data into another table with the proper geometry type.
Details here: https://stackoverflow.com/a/8150329/327026

Extract and sort data from .mdb file using mdbtools in Python

I'm quite new to Python, so any help will be appreciated. I am trying to extract and sort data from 2000 .mdb files using mdbtools on Linux. So far I was able to just take the .mdb file and dump all the tables into .csv. It creates huge mess since there are lots of files that need to be processed.
What I need is to extract particular sorted data from particular table. Like for example, I need the table called "Voltage". The table consists of numerous cycles and each cycle has several rows also. The cycles usually go in chronological order, but in some cases time stamp get recorded with delay. Like cycle's one first row can have later time than cycles 1 first row. I need to extract the latest row of the cycle based on time for the first or last five cycles. For example, in table below, I will need the second row.
Cycle# Time Data
1 100.59 34
1 101.34 54
1 98.78 45
2
2
2 ...........
Here is the script I use. I am using the command python extract.py table_files.mdb. But I would like the script to just be invoked with ./extract.py. The path to filenames should be in the script itself.
import sys, subprocess, os
DATABASE = sys.argv[1]
subprocess.call(["mdb-schema", DATABASE, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", DATABASE],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
print "BEGIN;" # start a transaction, speeds things up when importing
sys.stdout.flush()
# Dump each table as a CSV file using "mdb-export",
# converting " " in table names to "_" for the CSV filenames.
for table in tables:
if table != '':
filename = table.replace(" ","_") + ".csv"
file = open(filename, 'w')
print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", DATABASE, table],
stdout=subprocess.PIPE).communicate()[0]
file.write(contents)
file.close()

Personally, I wouldn't spend a whole lot of time fussing around trying to get mdbtools, unixODBC and pyodbc to work together. As Pedro suggested in his comment, if you can get mdb-export to dump the tables to CSV files then you'll probably save a fair bit of time by just importing those CSV files into SQLite or MySQL, i.e., something that will be more robust than using mdbtools on the Linux platform.
A few suggestions:
Given the sheer number of .mdb files (and hence .csv files) involved, you'll probably want to import the CSV data into one big table with an additional column to indicate the source filename. That will be much easier to manage than ~2000 separate tables.
When creating your target table in the new database you'll probably want to use a decimal (as opposed to float) data type for the [Time] column.
At the same time, rename the [Cycle#] column to just [Cycle]. "Funny characters" in column names can be a real nuisance.
Finally, to select the "last" reading (largest [Time] value) for a given [SourceFile] and [Cycle] you can use a query something like this:
SELECT
v1.SourceFile,
v1.Cycle,
v1.Time,
v1.Data
FROM
Voltage v1
INNER JOIN
(
SELECT
SourceFile,
Cycle,
MAX([Time]) AS MaxTime
FROM Voltage
GROUP BY SourceFile, Cycle
) v2
ON v1.SourceFile=v2.SourceFile
AND v1.Cycle=v2.Cycle
AND v1.Time=v2.MaxTime

To bring it directly to Pandas in python3 I wrote this little snippet
import sys, subprocess, os
from io import StringIO
import pandas as pd
VERBOSE = True
def mdb_to_pandas(database_path):
subprocess.call(["mdb-schema", database_path, "mysql"])
# Get the list of table names with "mdb-tables"
table_names = subprocess.Popen(["mdb-tables", "-1", database_path],
stdout=subprocess.PIPE).communicate()[0]
tables = table_names.splitlines()
sys.stdout.flush()
# Dump each table as a stringio using "mdb-export",
out_tables = {}
for rtable in tables:
table = rtable.decode()
if VERBOSE: print('running table:',table)
if table != '':
if VERBOSE: print("Dumping " + table)
contents = subprocess.Popen(["mdb-export", database_path, table],
stdout=subprocess.PIPE).communicate()[0]
temp_io = StringIO(contents.decode())
print(table, temp_io)
out_tables[table] = pd.read_csv(temp_io)
return out_tables

There's an alternative to mdbtools for Python: JayDeBeApi with the UcanAccess driver. It uses a Python -> Java bridge which slows things down, but I've been using it with considerable success and comes with decent error handling.
It takes some practice setting it up, but if you have a lot of databases to wrangle, it's well worth it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Loading large dataframe to Vertica - python

Related

Populating a postgres table from a fixed width file using psycopg2 and pandas [duplicate]

extract table from sqlite3 database

What's a Fail-safe Way to Update Python CSV

Optimize Netcdf to Python code

Extract and sort data from .mdb file using mdbtools in Python

Categories

Resources