I have a table in Google BigQuery(GBQ) with almost 3 million records(rows) so-far that were created based on data coming from MySQL db every day. This data inserted in GBQ table using Python pandas data frame(.to_gbq()).
What is the optimal way to sync changes from MySQL to GBQ, in this direction, with python.
Several different ways to import data from MySQL to BigQuery that might suit your needs are described in this article. For example Binlog replication:
This approach (sometimes referred to as change data capture - CDC) utilizes MySQL’s binlog. MySQL’s binlog keeps an ordered log of every DELETE, INSERT, and UPDATE operation, as well as Data Definition Language (DDL) data that was performed by the database. After an initial dump of the current state of the MySQL database, the binlog changes are continuously streamed and loaded into Google BigQuery.
Seems to be exactly what you are searching for.
Related
I have an application which is using Cassandra as a database.I need to create some kind of reports from the Cassanbdra DB data, but data is not modelled as per report queries. So one report may have data scattered in multiple tables. As Cassandra doesn't allow joins like RDBMS, this is not simple to do.So I am thinking of a solution to get the required tables data in some other DB (RDBMS or Mongo) in real time and then genereate the report from there. So do we have any standard way to get the data from Cassandra to other DBs (Mongo or RDBMS) in realtime i.e. whenever an insert/update/delete happens in Cassandra same has to eb updated in destination DB. Any example programe or code would be very helpful.
You would be better off using spark + spark cassandra connector combination to do this task. With Spark you can do joins in memory and write the data back to Cassandra or any text file.
Presently, we send entire files to the Cloud (Google Cloud Storage) to be imported into BigQuery and do a simple drop/replace. However, as the file sizes have grown, our network team doesn't particularly like the bandwidth we are taking while other ETLs are also trying to run. As a result, we are looking into sending up changed/deleted rows only.
Trying to find the path/help docs on how to do this. Scope - I will start with a simple example. We have a large table with 300 million records. Rather than sending 300 million records every night, send over X million that have changed/deleted. I then need to incorporate the change/deleted records into the BigQuery tables.
We presently use Node JS to move from Storage to BigQuery and Python via Composer to schedule native table updates in BigQuery.
Hope to get pointed in the right direction for how to start down this path.
Stream the full row on every update to BigQuery.
Let the table accommodate multiple rows for the same primary entity.
Write a view eg table_last that picks the most recent row.
This way you have all your queries near-realtime on real data.
You can deduplicate occasionally the table by running a query that rewrites self table with latest row only.
Another approach is if you have 1 final table, and 1 table which you stream into, and have a MERGE statement that runs scheduled every X minutes to write the updates from streamed table to final table.
I'm trying to read from a CosmosDB collection (MachineCollection) with a large amount of data (58 GB data; index-size 9 GB). Throughput is set to 1000 RU/s. The collection is partitioned with a Serial number, Read Location (WestEurope, NorthEurope), Write Location (WestEurope). Simultaneously to my reading attempts, the MachineCollection is fed with data every 20 seconds.
The problem is that I can not query any data via Python. If I execute the query on CosmosDB Data Explorer I get results in no time. (e.g. querying for a certain serial number).
For troubleshooting purposes, I have created a new Database (TestDB) and a TestCollection. In this TestCollection, there are 10 datasets of MachineCollection. If I try to read from this MachineCollection via Python it succeeds and I am able to save the data to CSV.
This makes me wonder why I am not able to query data from MachineCollection when configuring TestDB and TestCollection with the exact same properties.
What I have already tried for the querying via Python:
options['enableCrossPartitionQuery'] = True
Querying using PartitionKey: options['partitionKey'] = 'certainSerialnumber'
Same as always. Works with TestCollection, but not with MachineCollection.
Any ideas on how to resolve this issue are highly appreciated!
Firstly, what you need to know is that Document DB imposes limits on Response page size. This link summarizes some of those limits: Azure DocumentDb Storage Limits - what exactly do they mean?
Secondly, if you want to query large data from Document DB, you have to consider the query performance issue, please refer to this article:Tuning query performance with Azure Cosmos DB.
By looking at the Document DB REST API, you can observe several important parameters which has a significant impact on query operations : x-ms-max-item-count, x-ms-continuation.
As I know,Azure portal doesn't automatically help you optimize your SQL so you need to handle this in the sdk or rest api.
You could set value of Max Item Count and paginate your data using continuation token. The Document Db sdk supports reading paginated data seamlessly. You could refer to the snippet of python code as below:
q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})
Another case you could refer to:How do I set continuation tokens for Cosmos DB queries sent by document_client objects in Python?
I've been able to append/create a table from a Pandas dataframe using the pandas-gbq package. In particular using the to_gbq method. However, When I want to check the table using the BigQuery web UI I see the following message:
This table has records in the streaming buffer that may not be visible in the preview.
I'm not the only one to ask, and it seems that there's no solution to this yet.
So my questions are:
1. Is there a solution to the above problem (namely the data not being visible in the web UI).
2. If there is no solution to (1), is there another way that I can append data to an existing table using the Python BigQuery API? (Note the documentation says that I can achieve this by running an asynchronous query and using writeDisposition=WRITE_APPEND but the link that it provides doesn't explain how to use it and I can't work it out).
That message is just a UI notice, it should not hold you back.
To check data run a simple query and see if it's there.
To read only the data that is still in Streaming Buffer use this query:
#standardSQL
SELECT count(1)
FROM `dataset.table` WHERE _PARTITIONTIME is null
I know there are various ETL tools available to export data from oracle to MongoDB but i wish to use python as intermediate to perform this. Please can anyone guide me how to proceed with this?
Requirement:
Initially i want to add all the records from oracle to mongoDB and after that I want to insert only newly inserted records from Oracle into MongoDB.
Appreciate any kind of help.
To answer your question directly:
1. Connect to Oracle
2. Fetch all the delta data by timestamp or id (first time is all records)
3. Transform the data to json
4. Write the json to mongo with pymongo
5. Save the maximum timestamp / id for next iteration
Keep in mind that you should think about the data model considerations and usually relational DB (like Oracle) and document DB (like mongo) will have different data model.