I've just set up a delta-loading data flow between multiple Mysql DBs and a Porgres DB. It's only copying tens of Mbs every 15mins.
Yet, I'd like to set a process to fully load the data between them in case of emergency...
Python is just crashing and seems not to be fast enough when using SQLachemy etc.
I've read that the best might be to just dump everything from MySQL into CSV and then use file_fdw to load the entire tables into Postgres..
Has anyone faced a similar issue? If yes, how did you proceed?
Long story made short, ORM overhead is killing your performance.
When you're not manipulating the objects involved, it's better to use SQA Core expressions ("SQL Expressions") which are almost as fast as pure SQL.
Solution:
Of course I'm presuming your MySQL and Postgres models have been meticulously synchronized (i.e. values from an object from MySQL are not a problem for creating object in Postgres model and vice versa).
Overview:
get Table objects out of declarative classes
select (SQLAlchemy Expression) from one database
convert rows to dicts
insert into the other database
More or less:
# get tables
m_table = ItemMySQL.__table__
pg_table = ItemPG.__table__
# SQL Expression that gets a range of rows quickly
pg_q = select([pg_table]).where(
and_(
pg_table.c.id >= id_start,
pg_table.c.id <= id_end,
))
# get PG DB rows
eng_pg = DBSessionPG.get_bind()
conn_pg = eng_pg.connect()
result = conn_pg.execute(pg_q)
rows_pg = result.fetchall()
for row_pg in rows_pg:
# convert PG row object into dict
value_d = dict(row_pg)
# insert into MySQL
m_table.insert().values(**value_d)
# close row proxy object and connection, else suffer leaks
result.close()
conn_pg.close()
Background on performance, see accepted answer (by SQA principal author himself):
Why is SQLAlchemy insert with sqlite 25 times slower than using sqlite3 directly?
Since you seem to have Python crashing, perhaps you're using too much memory? Hence I suggest reading and writing rows in batches.
A further improvement could be using .values to insert a number of rows in one call, see here: http://docs.sqlalchemy.org/en/latest/core/tutorial.html#inserts-and-updates
Related
I'm migrating data from SQL Server 2017 to Postgres 10.5, i.e., all the tables, stored procedures etc.
I want to compare the data consistency between SQL Server and Postgres databases after the data migration.
All I can think of now is using Python Pandas and loading the tables into data frames from SQL Server and also Postgres and compare the data frames.
But the data is around 6 GB which takes much time for loading table into the data frame and also hosted on a server which is not local to where I'm running the Python script. Is there any way to efficiently compare the data consistency across SQL Server and Postgres?
Yes, you can order the data by primary key, and then write the data to a json or xml file.
Then you can run diff over the two files.
You can also run this chunked by primary-key, that way you don't have to work with a huge file.
Log any diff that doesn't show as equal.
If it doesn't matter what the difference is, you could also just run MD5/SHA1 on the two file chunks, and if the hash machtches, there is no difference, if it doesn't, there is.
Speaking from experience with nhibernate, what you need to watch out for is:
bit fields
text, ntext, varchar(MAX), nvarchar(MAX) fields (they map to varchar with no length, by the way - encoding UTF8)
varbinary, varbinary(MAX), image (bytea[] vs. LOB)
xml
that all primary-key's id serial generator is reset after you inserted all data in pgsql.
Another thing to watch out is which time zone CURRENT_TIMESTAMP uses.
Note:
I'd actually run System.Data.DataRowComparer directly, without writing data to a file:
static void Main(string[] args)
{
DataTable dt1 = dt1();
DataTable dt2= dt2();
IEnumerable<DataRow> idr1 = dt1.Select();
IEnumerable<DataRow> idr2 = dt2.Select();
// MyDataRowComparer MyComparer = new MyDataRowComparer();
// IEnumerable<DataRow> Results = idr1.Except(idr2, MyComparer);
IEnumerable<DataRow> results = idr1.Except(idr2);
}
Then you write all non-matching DataRows into a logfile, for each table one directory (if there are differences).
Don't know what Python uses in place of System.Data.DataRowComparer, though.
Since this would be a one-time task, you could also opt to not do it in Python, and use C# instead (see above code sample).
Also, if you had large tables, you could use DataReader with sequential access to do the comparison. But if the other way cuts it, it reduces the required work considerably.
Have you considered making your SQL Server data visible within your Postgres with a Foreign Data Wrapper (FDW)?
https://github.com/tds-fdw/tds_fdw
I haven't used this FDW tool but, overall, the basic FDW setup process is simple. An FDW acts like a proxy/alias, allowing you to access remote data as though it were housed in Postgres. The tool linked above doesn't support joins, so you would have to perform your comparisons iteratively, etc. Depending on your setup, you would have to check if performance is adequate.
Please report back!
Situation: Need to deal with large amounts of data (~260MB CSV datafile of about 50B of data per line)
Problem: If I just read from the file every time I need to deal with it, it will take a long time. So I decided to push everything into a database. I need a fast database infrastructure to handle the data as I need to do a lot of reading and writing.
Question: What are the faster choices for database in Python?
Additional information 1: The data comprises of 3 columns and I do not see myself needing anymore than that. Would this mean that a NoSQL database is preferred?
Additional information 2: However, if in the future I do need more than one database working together, would it be better to go for a SQL database?
Additional information 3: I think it would help to mention that I am looking at a few different DBs (MongoDB, SQLite, tinydb), but do suggest other DBs that you know are faster.
I've experienced this same situation many times and I wanted something faster than a typical relational database. Redis is very fast and scalable key/value database. You can get started quickly using the Popoto ORM
Here is an example:
import popoto
class City(popoto.Model):
id = popoto.UniqueKeyField()
name = popoto.KeyField()
description = popoto.Field()
for line in open("cities.csv"):
csv_row = line.split('\t')
City.create(
id=csv_row[0],
name=csv_row[1],
description=csv_row[2]
)
new_york = City.query.get(name="New York")
This is the absolute fastest way to store and retrieve data without having to learn the nuances of a new database system.
Keep in mind that if your database grows beyond 5GB, an in-memory database like Redis can start to become expensive compared to slower disk-based databases like Postgres or MySQL
full disclosure: I help maintain the open source Popoto project
I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.
I am using SQLAlchemy. I want to delete all the records efficiently present in database but I don't want to drop the table/database.
I tried with the following code:
con = engine.connect()
trans = con.begin()
con.execute(table.delete())
trans.commit()
It seems, it is not a very efficient one since I am iterating over all tables present in the database.
Can someone suggest a better and more efficient way of doing this?
If you models rely on the existing DB schema (usually use autoload=True), you cannot avoid deleting data in each table. MetaData.sorted_tables comes in handy:
for tbl in reversed(meta.sorted_tables):
engine.execute(tbl.delete())
If your models do define the complete schema, there is nothing simpler than drop_all/create_all (as already pointed out by #jadkik94).
Further, TRUNCATE would anyways not work on the tables which are referenced by ForeignKeys, which is limiting the usage significantly.
For me putting tbl.drop(engine) worked, but not engine.execute(tbl.delete())
SQLAlchemy 0.8.0b2 and
Python 2.7.3
I need to iterate over large collection (3 * 10^6 elements) in Django to do some kind of analysis that can't be done using single SQL statement.
Is it possible to turn off collection caching in django? (Caching all the data is not to be acceptable data has around 0.5GB)
Is it possible to make django fetch collection in chunks? It seems that it tries to pre fetch whole collection in to the memory and then iterate over it. I think that observing the speed of execution:
iter(Coll.objects.all()).next() - this takes forever
iter(Coll.objects.all()[:10000]).next() - this takes less than a second
Use QuerySet.iterator() to walk over the results instead of loading them all first.
It seams that the problem was caused by the database backend (sqlite) that doesn't support reading in chunks.
I've used sqlite as the database will be trashed after I do all the computations but it seems that sqlite isn't good even for that.
Here is what I've found in django source code of sqlite backend:
class DatabaseFeatures(BaseDatabaseFeatures):
# SQLite cannot handle us only partially reading from a cursor's result set
# and then writing the same rows to the database in another cursor. This
# setting ensures we always read result sets fully into memory all in one
# go.
can_use_chunked_reads = False