I have an application which is using Cassandra as a database.I need to create some kind of reports from the Cassanbdra DB data, but data is not modelled as per report queries. So one report may have data scattered in multiple tables. As Cassandra doesn't allow joins like RDBMS, this is not simple to do.So I am thinking of a solution to get the required tables data in some other DB (RDBMS or Mongo) in real time and then genereate the report from there. So do we have any standard way to get the data from Cassandra to other DBs (Mongo or RDBMS) in realtime i.e. whenever an insert/update/delete happens in Cassandra same has to eb updated in destination DB. Any example programe or code would be very helpful.
You would be better off using spark + spark cassandra connector combination to do this task. With Spark you can do joins in memory and write the data back to Cassandra or any text file.
Related
I'm working on an ETL with Apache Beam and Dataflow using Python and I'm using BigQuery as a database/datawarehouse.
The ETL basically performs some processing then updates data that is already in BigQuery. Since there is no update transform in Apache Beam, I had to use the BigQuery SDK and write my own UPDATE query, and map it to each row.
The queries work fine when done sequentially, but when I use multiple workers, I get the following error:
{'reason': 'invalidQuery', 'message': 'Could not serialize access to table my_table due to concurrent update'}
I made sure that the same row is never accessed/updated concurrently (a row is basically an id, and each id is unique), I've also tried to run the same code with a simple Python script without Beam/Dataflow, and I still got the same error when I started using multiple threads instead of one.
Has anyone got the same problem using BigQuery SDK ? And do you have any suggestions to avoid that problem ?
I think it's better from your Beam Dataflow job to append the data.
Bigquery is more append oriented and the BigueryIO in Beam is adapted for append operation.
If you have an orchestrator like Cloud Composer/Airflow or Cloud Workflows, you can deduplicate the data in batch mode with the following steps :
Create a staging and final tables
Your orchestrator truncates your staging table
Your orchestrator runs your Dataflow job
Dataflow job reads your data
Dataflow job writes the result in append mode to Bigquery in the staging table
Your orchestrator run a task with a merge query with Bigquery between the staging and final tables. The merge query allows to insert or update the line in the final table if the element exists.
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax?hl=en#merge_statement
Example of a merge query :
MERGE dataset.Inventory T
USING dataset.NewArrivals S
ON T.product = S.product
WHEN MATCHED THEN
UPDATE SET quantity = T.quantity + S.quantity
WHEN NOT MATCHED THEN
INSERT (product, quantity) VALUES(product, quantity)
I had a use case where I had a BQ table that was containing around 150K records, and I needed to update its content monthly (which means around 100K UPDATE and couple of thousands APPEND.
When I designed my Beam/Dataflow job to update the records with the BQ python API library, I fall in Quota issues (limited number of updated) as well as the concurrency issue.
I had to change the approach my pipeline was working with, from reading the BQ table and updating the record, to process the BQ table, update what needs to be updated, and append what's new, and save to a new BQ table.
Once the job is successfully finished with no error, you can replace the old one with the new created table.
GCP mentions:
Running two mutating DML statements concurrently against a table will
succeed as long as the two statements don’t modify data in the same
partition. Two jobs that try to mutate the same partition may
sometimes experience concurrent update failures.
And then :
BigQuery now handles such failures automatically. To do this, BigQuery
will restart the job.
Can this retry mechanism be a solution at all ? Anyone to elaborate on this?
Source: https://cloud.google.com/blog/products/data-analytics/dml-without-limits-now-in-bigquery
I want to load BigTable data into BigQuery with direct way.
Till now I am loading BigTable data into CSV file using Python and then loading csv file into BigQuery.
But I don't want to use csv file in between BigTable and BigQuery is there any direct way ?
To add to Mikhail's recommendation, I'd suggest creating a permanent table in BigQuery using the external table. You'll define the schema for the columns you want and then query the rows you're interested in. Once that data is saved into BigQuery, it won't have any impact on your Bigtable performance. If you want to get the latest data, you can create a new permanent table with the same query.
If you're looking to have the data copied over and stored in BigQuery, Querying Cloud Bigtable data using permanent external tables is not what you're looking for. It explicitly mentions that "The data is not stored in the BigQuery table". My understanding is that the permanent table is more for persistent access controls, but still queries Bigtable directly.
This may be overkill, but you could set up and Apache Beam pipeline that runs in Dataflow, has a BigQueryIO source, and a BigTableIO sink. You'd have to write a little bit of transformation logic, but overall it should be a pretty simple pipeline. The only catch here is that the BigTableIO connector is only for the Beam Java SDK, so you'd have to write this pipeline in Java.
I am currently working on a single page web app that allows users to upload large CSV files (currently testing a ~7GB file) to a flask server and then stream that dataset to a database. The upload takes about a minute and the file gets fully saved to a temporary file on the flask server. Now I need to be able to stream this file and store it into a database. I did some research and found that PySpark is great for streaming data and I am choosing MySQL as the database to stream the CSV data into (but I am open to other dbs and streaming methods). I am a junior dev and new to PySpark so I'm not sure how to go about this. The Spark streaming guide says that data must ingested through a source like Kafka, Flume, TCP socets, etc so I am wondering if I have to use any of those methods to get my CSV file into Spark. However, I came across this great example where they are streaming csv data into Azure SQL database and it looks like they are just reading the file directly using Spark without needing to ingest it through a streaming source like Kafka, etc. The only thing that confuses me with that example is that they are using a HDInsight Spark cluster to stream data into the db and I'm not sure how to incorporate that all with a flask server. I appologize for the lack of code but currently I just have a flask server file with one route doing the file upload. Any examples, tutorials, or advice would be greatly appreciated.
I am not sure about the streaming part but spark can handle large files efficiently - and storing to a db table will be done in parallel, so without much knowledge about your details, and provided that you have the uploaded file on your server, I'd say that:
If I wanted to save a big structured file like a csv in a table, I would start like this:
# start with some basic spark configuration, e.g. we want the timezone to be UTC
conf = SparkConf()
conf.set('spark.sql.session.timeZone', 'UTC')
# this is important: you need to have the mysql connector jar for the right mysql version:
conf.set('jars', 'path to mysql connector jar you can download from here: https://dev.mysql.com/downloads/connector/odbc/')
# instantiate a spark session: the first time it will take a few seconds
spark = SparkSession.builder \
.config(conf=conf) \
.appName('Huge File uploader') \
.getOrCreate()
# read the file first as a dataframe
df = spark.read.csv('path to 7GB/ huge csv file')
# optionally, add a filename column
from pyspark.sql import functions as F
df = df.withColumn('filename', F.lit('thecurrentfilename'))
# write it to the table
df.write.format('jdbc').options(
url='e.g. localhost:port',
driver='com.mysql.cj.jdbc.Driver', # the driver for MySQL
dbtable='the table name to save to',
user='user',
password='secret',
).mode('append').save()
Note the mode 'append' here: the catch in this is that spark cannot perform updates on a table, it is either append the new rows or replace what is in the table.
So, if your csv is like this:
id, name, address....
You will end up with a table with the same fields.
This is the most basic example I could think of so that you start with spark, with no considerations about a spark cluster or anything else related. I would suggest you take this for a spin and figure out if this suits your needs :)
Also, keep in mind that this might take a few seconds or more depending on your data, where the database is located, your machine and your database load, so it might be a good idea to keep things asynchronous with your api, again I don't know about any of your other details.
Hope this helps. Good luck!
I have a table in Google BigQuery(GBQ) with almost 3 million records(rows) so-far that were created based on data coming from MySQL db every day. This data inserted in GBQ table using Python pandas data frame(.to_gbq()).
What is the optimal way to sync changes from MySQL to GBQ, in this direction, with python.
Several different ways to import data from MySQL to BigQuery that might suit your needs are described in this article. For example Binlog replication:
This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.
Seems to be exactly what you are searching for.
Are there any good online resources to learn how writing data from Spark to Vertica works? I'm trying to understand why writing to a Vertica database is slow.
This is my basic workflow:
Create a SparkContext. I'm using the class pyspark.sql.SQLContext to create one.
From SQLContext, using the read method to get DataFrameReader interface under 'jdbc' format.
df = self._sqlContext.read.format('jdbc').options(url=self._jdbcURL, dbtable=subquery).load()
Read entries from a Vertica database using jdbc connection (call it dbA)
Write those entries into another Vertica database using the SparkContext in Step 1 (call it dbB)
Right now it's just a simple read from dbA and write to dbB. But writing 50 entries takes about 5 seconds.
Thanks!
Have you tried HPE's Big Data Marketplace, specifically the HPE Vertica Connector For Apache Spark? You'll need to create an account to download the file, but there's no cost associated with creating an account. The documentation includes a Scala example of writing a Spark data frame to a Vertica table.