Given a parquet file how can I create the table associated with it into my redshift database? Oh the format of the parquet file is snappy.
If you're dealing with multiple files, especially over a long term, then I think the best solution is to upload them to an S3 bucket and run a Glue crawler.
In addition to populating the Glue data catalog, you can also use this information to configure external tables for Redshift Spectrum, and create your on-cluster tables using create table as select.
If this is just a one-off task, then I've used parquet-tools in the past. The version that I've used is a Java library, but I see that there's also a version on PyPi.
Related
I am building an airflow dag that takes csv files from GCS and inserts them to a postgresql table in cloud SQL. I have several options:
Use sqlalchemy to insert the reows.
Use pandas
Explore PostgreSQL airflow operators (I don't know how to connect them with gcs).
Which is the most pythonic way to do so?
You sould go with COPY.
See https://www.postgresql.org/docs/current/sql-copy.html
COPY moves data between PostgreSQL tables and standard file-system files. COPY TO copies the contents of a table to a file, while COPY FROM copies data from a file to a table (appending the data to whatever is in the table already).
I'm new to using Airflow (and newish to Python.)
I need to migrate some very large MySQL tables to s3 files using Airflow. All of the relevant hooks and operators in Airflow seem geared to using Pandas dataframes to load the full SQL output into memory and then transform/export to the desired file format.
This is causing obvious problems for the large tables which cannot fully fit into memory and are failing. I see no way to have Airflow read the query results and save them off to a local file instead of tanking it all up into memory.
I see ways to bulk_dump to output results to a file on the MySQL server using the MySqlHook, but no clear way to transfer that file to s3 (or to Airflow local storage then to s3).
I'm scratching my head a bit because I've worked in Pentaho which would easily handle this problem, but cannot see any apparent solution.
I can try to slice the tables up into small enough chunks that Airflow/Pandas can handle them, but that's a lot of work, a lot of query executions, and there are a lot of tables.
What would be some strategies for moving very large tables from a MySQL server to s3?
You don't have to use Airflow transfer operators if they don't fit to your scale. You can (and probably should) create your very own CustomMySqlToS3Operator with the logic that fits to your process.
Few options:
Don't transfer all the data in one task. slice the data based on dates/number of rows/other. You can use several tasks of CustomMySqlToS3Operator in your workflow. This is not alot of work as you mentioned. This is simply the matter of providing the proper WHERE conditions to the SQL queries that you generate. Depends on the process that you build You can define that every run process the data of a single day thus your WHERE condition is simple date_column between execution_date and next_execution_date (you can read about it in https://stackoverflow.com/a/65123416/14624409 ) . Then use catchup=True to backfill runs.
Use Spark as part of your operator.
As you pointed you can dump the data to local disk and then upload it to S3 using load_file method of S3Hook. This can be done as part of the logic of your CustomMySqlToS3Operator or if you prefer as Python callable from PythonOperator.
I want to load BigTable data into BigQuery with direct way.
Till now I am loading BigTable data into CSV file using Python and then loading csv file into BigQuery.
But I don't want to use csv file in between BigTable and BigQuery is there any direct way ?
To add to Mikhail's recommendation, I'd suggest creating a permanent table in BigQuery using the external table. You'll define the schema for the columns you want and then query the rows you're interested in. Once that data is saved into BigQuery, it won't have any impact on your Bigtable performance. If you want to get the latest data, you can create a new permanent table with the same query.
If you're looking to have the data copied over and stored in BigQuery, Querying Cloud Bigtable data using permanent external tables is not what you're looking for. It explicitly mentions that "The data is not stored in the BigQuery table". My understanding is that the permanent table is more for persistent access controls, but still queries Bigtable directly.
This may be overkill, but you could set up and Apache Beam pipeline that runs in Dataflow, has a BigQueryIO source, and a BigTableIO sink. You'd have to write a little bit of transformation logic, but overall it should be a pretty simple pipeline. The only catch here is that the BigTableIO connector is only for the Beam Java SDK, so you'd have to write this pipeline in Java.
I am trying to create a database for a bunch of hdf5 files that are on my computer like more than a few hundreds. This database will be used by many people and then i need to create a API to be used to access the database.
So any idea how i can create a database that can contain the HDF5 files in a single place ??
I need to use python to create a database.
I have a user access to hadoop server/cluster containing data that is stored solely in partitioned tables/files in hive (avro). I was wondering if I can perform mapreduce using python mrjob on these tables? So far I have been testing mrjob locally on text files stored on CDH5 and I am impressed by the ease of development.
After some research I discovered there is a library called HCatalog, but as far as I know it's not available for python (only Java). Unfortunately, I do not have much time to learn Java and I would like to stick to Python.
Do you know any way to run mrjob on hive stored data?
If this is impossible, is there a way to stream python-written mapreduce code to hive? (I would rather not upload mapreduce python files to hive)
As Alex stated currently Mr.Job does not work with avro formated files. However, there is a way to perform python code on hive tables directly (no Mr.Job needed, unfortunatelly with loss of flexibility). Eventually, I managed to add python file as a resource to hive by executing "ADD FILE mapper.py" and performing SELECT clause with TRANSFORM ... USING ...., storing the results of a mapper in a separate table. Example Hive query:
INSERT OVERWRITE TABLE u_data_new
SELECT
TRANSFORM (userid, movieid, rating, unixtime)
USING 'python weekday_mapper.py'
AS (userid, movieid, rating, weekday)
FROM u_data;
Full example is available here (at the bottom): link