Can a MariaDB be used with Zarr or migrated to Zarr in a lossless fashion, if so please provide some guidance on how this can be achieved?
I have searched the Zarr docs and MariaDB docs and did not find enough information on this topic. I don't want to lose or modify any of the data and I would like to be able to decompress or restore the data to it's original MariaDB state. I receive output in the form of a 4TB MariaDB (10.2) containing multiple tables of various dimensions and multiple variable types. I am using python (+3.6) and would like to take advantage of Zarr for the purpose of being able to perform exploratory data analysis on the data contained across multiple tables in the MariaDB while it is compressed in an effort to save local disk space. The storage of the data and processing of the data is all done locally and there is no plan to utilize cloud services.
I have considered converting the MariaDB to a sqlite database with Python but stopped looking into that route as I understand this could lead to a loss/corruption of data.
Thank you advance,
Brian
Related
I'm new to using Airflow (and newish to Python.)
I need to migrate some very large MySQL tables to s3 files using Airflow. All of the relevant hooks and operators in Airflow seem geared to using Pandas dataframes to load the full SQL output into memory and then transform/export to the desired file format.
This is causing obvious problems for the large tables which cannot fully fit into memory and are failing. I see no way to have Airflow read the query results and save them off to a local file instead of tanking it all up into memory.
I see ways to bulk_dump to output results to a file on the MySQL server using the MySqlHook, but no clear way to transfer that file to s3 (or to Airflow local storage then to s3).
I'm scratching my head a bit because I've worked in Pentaho which would easily handle this problem, but cannot see any apparent solution.
I can try to slice the tables up into small enough chunks that Airflow/Pandas can handle them, but that's a lot of work, a lot of query executions, and there are a lot of tables.
What would be some strategies for moving very large tables from a MySQL server to s3?
You don't have to use Airflow transfer operators if they don't fit to your scale. You can (and probably should) create your very own CustomMySqlToS3Operator with the logic that fits to your process.
Few options:
Don't transfer all the data in one task. slice the data based on dates/number of rows/other. You can use several tasks of CustomMySqlToS3Operator in your workflow. This is not alot of work as you mentioned. This is simply the matter of providing the proper WHERE conditions to the SQL queries that you generate. Depends on the process that you build You can define that every run process the data of a single day thus your WHERE condition is simple date_column between execution_date and next_execution_date (you can read about it in https://stackoverflow.com/a/65123416/14624409 ) . Then use catchup=True to backfill runs.
Use Spark as part of your operator.
As you pointed you can dump the data to local disk and then upload it to S3 using load_file method of S3Hook. This can be done as part of the logic of your CustomMySqlToS3Operator or if you prefer as Python callable from PythonOperator.
I have got a requirement to extract data from Amazon Aurora RDS instance and load it to S3 to make it a data lake for analytics purposes. There are multiple schemas/databases in one instance and each schema has a similar set of tables. I need to pull selective columns from these tables for all schemas in parallel. This should happen in real-time capturing the DML operations periodically.
There may arise the question of using dedicated services like Data Migration or Copy activity provided by AWS. But I can't use them since the plan is to make the solution cloud platform independent as it could be hosted on Azure down the line.
I was thinking Apache Spark could be used for this, but I got to know it doesn't support JDBC as a source in Structured streaming. I read about multi-threading and multiprocessing techniques in Python for this but have to assess if they are suitable (the idea is to run the code as daemon threads, each thread fetching data from the tables of a single schema in the background and they run continuously in defined cycles, say every 5 minutes). The data synchronization between RDS tables and S3 is also a crucial aspect to consider.
To talk more about the data in the source tables, they have an auto-increment ID field but are not sequential and might be missing a few numbers in between as a result of the removal of those rows due to the inactivity of the corresponding entity, say customers. It is not needed to pull the entire data of a record, only a few are pulled which would be been predefined in the configuration. The solution must be reliable, sustainable, and automatable.
Now, I'm quite confused to decide which approach to use and how to implement the solution once decided. Hence, I seek the help of people who dealt with or know of any solution to this problem statement. I'm happy to provide more info in case it is required to get to the right solution. Any help on this would be greatly appreciated.
I have collected a large Twitter dataset (>150GB) that is stored in some text files. Currently I retrieve and manipulate the data using custom Python scripts, but I am wondering whether it would make sense to use a database technology to store and query this dataset, especially given its size. If anybody has experience handling twitter datasets of this size, please share your experiences, especially if you have any suggestions as to what database technology to use and how long the import might take. Thank you
I recommend using a database schema for this, especially considering it's size. (this is without knowing anything about what the dataset holds) That being said, I suggest now or for future questions of this nature using the software suggestions website for this plus adding more about what the dataset would look like.
As for suggesting a certain database in specific, I recommend doing some research about what each do but for something that just holds data with no relations any will do and could show great query improvement vs just txt files as query's can be cached and data is faster to retrieve due to how databases store and lookup files weather it just be hashed values or whatever they use.
Some popular databases:
MYSQL, PostgreSQL - Relational Databases (simple and fast and easy to use/setup but need some knowledge of SQL)
MongoDB - NoSQL Database (also easy to use and setup and no SQL needed, it relies more on dicts to access DB through the API. Also memory mapped so can be faster than Relational but need to have enough RAM for the Indexes.)
ZODB - Full Python NoSQL Database (Kind of like MongoDB but written in Python)
These are very light and brief explanations of each DB, be sure to do your research before using them, they each have their pros and cons. Also, remember this is just a couple of many popular and highly used Databases, there's also TinyDB, SQLite (comes with Python), and PickleDB that are full Python but are generally for small applications.
My experience is mainly with PostgreSQL, TinyDB, and MongoDB, my favorite being MongoDB and PGSQL. For you, I'd look at either of those but don't limit yourself there's a slue of them plus many drivers that help you write easier/less code if that's what you want. Remember google is your friend! And welcome to Stack Overflow!
Edit
If your dataset is and will remain fairly simple but just large and you want to keep with using txt files, consider pandas and maybe a JSON or a csv format and library. It can greatly help and increase efficiency when querying/managing data like this from txt files plus less memory usage as it won't always or ever need the entire dataset in memory.
you can try using any NOSql DB. Mongo DB would be a good place to start
In this illuminating answer a technique is presented to get the data sizes from statically coded datasets that live in a Google Big Query Project.
To further automate this, I am operating in a Python Jupyter notebook environment using the BQ wrapper to make BQ Queries inside the Python. I would like to construct a Big Query that will fetch all the datasets that live in my Project (they have the form of an 8- or 9-digit identifier) as a first step to feed into a Python data structure, then use the referenced answer programmatically to get table sizes for all the identified datasets. Is there a way to do this?
Setting up a data warehousing mining project on a Linux cloud server. The primary language is Python .
Would like to use this pattern for querying on data and storing data:
SQL Database - SQL database is used to query on data. However, the SQL database stores only fields that need to be searched on, it does NOT store the "blob" of data itself. Instead it stores a key that references that full "blob" of data in the a key-value Blobstore.
Blobstore - A key-value Blobstore is used to store actual "documents" or "blobs" of data.
The issue that we are having is that we would like more frequently accessed blobs of data to be automatically stored in RAM. We were planning to use Redis for this. However, we would like a solution that automatically tries to get the data out of RAM first, if it can't find it there, then it goes to the blobstore.
Is there a good library or ready-made solution for this that we can use without rolling our own? Also, any comments and criticisms about the proposed architecture would also be appreciated.
Thanks so much!
Rather than using Redis or Memcached for caching, plus a "blobstore" package to store things on disk, I would suggest to have a look at Couchbase Server which does exactly what you want (i.e. serving hot blobs from memory, but still storing them to disk).
In the company I work for, we commonly use the pattern you described (i.e. indexing in a relational database, plus blob storage) for our archiving servers (terabytes of data). It works well when the I/O done to write the blobs are kept sequential. The blobs are never rewritten, but simply appended at the end of a file (it is fine for an archiving application).
The same approach has been also used by others. For instance:
Bitcask (used in Riak): http://downloads.basho.com/papers/bitcask-intro.pdf
Eblob (used in Elliptics project): http://doc.ioremap.net/eblob:eblob
Any SQL database will work for the first part. The Blobstore could also be obtained, essentially, "off the shelf" by using cbfs. This is a new project, built on top of couchbase 2.0, but it seems to be in pretty active development.
CouchBase already tries to serve results out of RAM cache before checking disk, and is fully distributed to support large data sets.
CBFS puts a filesystem on top of that, and already there is a FUSE module written for it.
Since fileststems are effectively the lowest-common-denominator, it should be really easy for you to access it from python, and would reduce the amount of custom code you need to write.
Blog post:
http://dustin.github.com/2012/09/27/cbfs.html
Project Repository:
https://github.com/couchbaselabs/cbfs