Migrating large tables using Airflow

Migrating large tables using Airflow - python

I'm new to using Airflow (and newish to Python.)
I need to migrate some very large MySQL tables to s3 files using Airflow. All of the relevant hooks and operators in Airflow seem geared to using Pandas dataframes to load the full SQL output into memory and then transform/export to the desired file format.
This is causing obvious problems for the large tables which cannot fully fit into memory and are failing. I see no way to have Airflow read the query results and save them off to a local file instead of tanking it all up into memory.
I see ways to bulk_dump to output results to a file on the MySQL server using the MySqlHook, but no clear way to transfer that file to s3 (or to Airflow local storage then to s3).
I'm scratching my head a bit because I've worked in Pentaho which would easily handle this problem, but cannot see any apparent solution.
I can try to slice the tables up into small enough chunks that Airflow/Pandas can handle them, but that's a lot of work, a lot of query executions, and there are a lot of tables.
What would be some strategies for moving very large tables from a MySQL server to s3?

You don't have to use Airflow transfer operators if they don't fit to your scale. You can (and probably should) create your very own CustomMySqlToS3Operator with the logic that fits to your process.
Few options:
Don't transfer all the data in one task. slice the data based on dates/number of rows/other. You can use several tasks of CustomMySqlToS3Operator in your workflow. This is not alot of work as you mentioned. This is simply the matter of providing the proper WHERE conditions to the SQL queries that you generate. Depends on the process that you build You can define that every run process the data of a single day thus your WHERE condition is simple date_column between execution_date and next_execution_date (you can read about it in https://stackoverflow.com/a/65123416/14624409 ) . Then use catchup=True to backfill runs.
Use Spark as part of your operator.
As you pointed you can dump the data to local disk and then upload it to S3 using load_file method of S3Hook. This can be done as part of the logic of your CustomMySqlToS3Operator or if you prefer as Python callable from PythonOperator.

Related

Extracting data continuously from RDS MySQL schemas in parallel

I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.

in-memory sqlite in production with python

I am creating a python system that needs to handle many files. Each of the file has more than 10 thousand lines of text data.
Because DB (like mysql) can not be used in that environment, when file is uploaded by a user, I think I will save all the data of the uploaded file in in-memory-SQLite so that I can use SQL to fetch specific data from there.
Then, when all operations by program are finished, save the processed data in a file. This is the file users will receive from the system.
But some websites say SQLite shouldn't be used in production. But in my case, I just save them temporarily in memory to use SQL for the data. Is there any problem for using SQLite in production even in this scenario?
Edit:
The data in in-memory-DB doesn't need to be shared between processes. It just creates tables, process data, then discard all data and tables after saving the processed data in file. I just think saving everything in list makes search difficult and slow. So using SQLite is still a problem?

SQLite shouldn't be used in production is not a one-for-all rule, it's more of a rule of thumb. Of course there are appliances where one could think of reasonable use of SQLite even in production environments.
However your case doesn't seem to be one of them. While SQLite supports multi-threaded and multi-process environments, it will lock all tables when it opens a write transaction. You need to ask yourself whether this is a problem for your particular case, but if you're uncertain go for "yes, it's a problem for me".
You'd be probably okay with in-memory structures alone, unless there are some details you haven't uncovered.

I'm not familiar with the specific context of your system, but if what you're looking for is a SQL database that is
light
Access is from a single process and a single thread.
If the system crashes in the middle, you have a good way to recover from it (either backing up the last stable version of the database or just create it from scratch).
If you meet all these criteria, using SQLite is production is fine. OSX, for example, uses sqlite for a few purposes (e.g. ./var/db/auth.db).

Pandas as fast data storage for Flask application

I'm impressed by the speed of running transformations, loading data and ease of use of Pandas and want to leverage all these nice properties (amongst others) to model some large-ish data sets (~100-200k rows, <20 columns). The aim is to work with the data on some computing nodes, but also to provide a view of the data sets in a browser via Flask.
I'm currently using a Postgres database to store the data, but the import (coming from csv files) of the data is slow, tedious and error prone and getting the data out of the database and processing it is not much easier. The data is never going to be changed once imported (no CRUD operations), so I thought it's ideal to store it as several pandas DataFrame (stored in hdf5 format and loaded via pytables).
The question is:
(1) Is this a good idea and what are the things to watch out for? (For instance I don't expect concurrency problems as DataFrames are (should?) be stateless and immutable (taken care of from application-side)). What else needs to be watched out for?
(2) How would I go about caching the data once it's loaded from the hdf5 file into a DataFrame, so it doesn't need to be loaded for every client request (at least the most recent/frequent dataframes). Flask (or werkzeug) has a SimpleCaching class, but, internally, it pickles the data and unpickles the cached data on access. I wonder if this is necessary in my specific case (assuming the cached object is immutable). Also, is such a simple caching method usable when the system gets deployed with Gunicorn (is it possible to have static data (the cache) and can concurrent (different process?) requests access the same cache?).
I realise these are many questions, but before I invest more time and build a proof-of-concept, I thought I get some feedback here. Any thoughts are welcome.

Answers to some aspects of what you're asking for:
It's not quite clear from your description whether you have the tables in your SQL database only, stored as HDF5 files or both. Something to look out for here is that if you use Python 2.x and create the files via pandas' HDFStore class, any strings will be pickled leading to fairly large files. You can also generate pandas DataFrame's directly from SQL queries using read_sql, for example.
If you don't need any relational operations then I would say ditch the postgre server, if it's already set up and you might need that in future keep using the SQL server. The nice thing about the server is that even if you don't expect concurrency issues, it will be handled automatically for you using (Flask-)SQLAlchemy causing you less headache. In general, if you ever expect to add more tables (files), it's less of an issue to have one central database server than maintaining multiple files lying around.
Whichever way you go, Flask-Cache will be your friend, using either a memcached or a redis backend. You can then cache/memoize the function that returns a prepared DataFrame from either SQL or HDF5 file. Importantly, it also let's you cache templates which may play a role in displaying large tables.
You could, of course, also generate a global variable, for example, where you create the Flask app and just import that wherever it's needed. I have not tried this and would thus not recommend it. It might cause all sorts of concurrency issues.

A good blobstore / memcache solution

Setting up a data warehousing mining project on a Linux cloud server. The primary language is Python .
Would like to use this pattern for querying on data and storing data:
SQL Database - SQL database is used to query on data. However, the SQL database stores only fields that need to be searched on, it does NOT store the "blob" of data itself. Instead it stores a key that references that full "blob" of data in the a key-value Blobstore.
Blobstore - A key-value Blobstore is used to store actual "documents" or "blobs" of data.
The issue that we are having is that we would like more frequently accessed blobs of data to be automatically stored in RAM. We were planning to use Redis for this. However, we would like a solution that automatically tries to get the data out of RAM first, if it can't find it there, then it goes to the blobstore.
Is there a good library or ready-made solution for this that we can use without rolling our own? Also, any comments and criticisms about the proposed architecture would also be appreciated.
Thanks so much!

Rather than using Redis or Memcached for caching, plus a "blobstore" package to store things on disk, I would suggest to have a look at Couchbase Server which does exactly what you want (i.e. serving hot blobs from memory, but still storing them to disk).
In the company I work for, we commonly use the pattern you described (i.e. indexing in a relational database, plus blob storage) for our archiving servers (terabytes of data). It works well when the I/O done to write the blobs are kept sequential. The blobs are never rewritten, but simply appended at the end of a file (it is fine for an archiving application).
The same approach has been also used by others. For instance:
Bitcask (used in Riak): http://downloads.basho.com/papers/bitcask-intro.pdf
Eblob (used in Elliptics project): http://doc.ioremap.net/eblob:eblob

Any SQL database will work for the first part. The Blobstore could also be obtained, essentially, "off the shelf" by using cbfs. This is a new project, built on top of couchbase 2.0, but it seems to be in pretty active development.
CouchBase already tries to serve results out of RAM cache before checking disk, and is fully distributed to support large data sets.
CBFS puts a filesystem on top of that, and already there is a FUSE module written for it.
Since fileststems are effectively the lowest-common-denominator, it should be really easy for you to access it from python, and would reduce the amount of custom code you need to write.
Blog post:
http://dustin.github.com/2012/09/27/cbfs.html
Project Repository:
https://github.com/couchbaselabs/cbfs

Django with huge mysql database

What would be the best way to import multi-million record csv files into django.
Currently using python csv module, it takes 2-4 days for it process 1 million record file. It does some checking if the record already exists, and few others.
Can this process be achieved to execute in few hours.
Can memcache be used somehow.
Update: There are django ManyToManyField fields that get processed as well. How will these used with direct load.

I'm not sure about your case, but we had similar scenario with Django where ~30 million records took more than one day to import.
Since our customer was totally unsatisfied (with the danger of losing the project), after several failed optimization attempts with Python, we took a radical strategy change and did the import(only) with Java and JDBC (+ some mysql tuning), and got the import time down to ~45 minutes (with Java it was very easy to optimize because of the very good IDE and profiler support).

I would suggest using the MySQL Python driver directly. Also, you might want to take some multi-threading options into consideration.

Depending upon the data format (you said CSV) and the database, you'll probably be better off loading the data directly into the database (either directly into the Django-managed tables, or into temp tables). As an example, Oracle and SQL Server provide custom tools for loading large amounts of data. In the case of MySQL, there are a lot of tricks that you can do. As an example, you can write a perl/python script to read the CSV file and create a SQL script with insert statements, and then feed the SQL script directly to MySQL.
As others have said, always drop your indexes and triggers before loading large amounts of data, and then add them back afterwards -- rebuilding indexes after every insert is a major processing hit.
If you're using transactions, either turn them off or batch your inserts to keep the transactions from being too large (the definition of too large varies, but if you're doing 1 million rows of data, breaking that into 1 thousand transactions is probably about right).
And most importantly, BACKUP UP YOUR DATABASE FIRST! The only thing worse than having to restore your database from a backup because of an import screwup is not having a current backup to restore from.

As mentioned you want to bypass the ORM and go directly to the database. Depending on what type of database you're using you'll probably find good options for loading the CSV data directly. With Oracle you can use External Tables for very high speed data loading, and for mysql you can use the LOAD command. I'm sure there's something similar for Postgres as well.
Loading several million records shouldn't take anywhere near 2-4 days; I routinely load a database with several million rows into mysql running on a very load end machine in minutes using mysqldump.

Like Craig said, you'd better fill the db directly first.
It implies creating django models that just fits the CSV cells (you can then create better models and scripts to move the data)
Then, db feedding : a tool of choice for doing this is Navicat, you can grab a functional 30 days demo on their site. It allows you to import CSV in MySQL, save the importation profile in XML...
Then I would launch the data control scripts from within Django, and when you're done, migrating your model with South to get what you want or , like I said earlier, create another set of models within your project and use scripts to convert/copy the data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.