What are best practice approaches of properly getting messages from Kafka and generating INSERT/UPDATE/DELETE statements for relational dbs using Python?
Say, I have events that Create Entity/Update Entity/Delete Entity and I want those messages to be transformed into relevant SQL script.
Is there any suggestion rather than writing serialization manually?
There is no way around deserializing the record from Kafka and serializing into the appropriate database query. I would not recommend writing literal DDL statements as Kafka records and running those directly against a database client.
As commented, you can instead produce data in a supported format (JSONSchema, Avro, or Protobuf being the most common / well-documented) from Kafka Connect (optionally using a Schema Registry), then use a Sink Connector for your database.
Related
I'm developing an application in Python which uses Azure Cosmos DB as the main database. At some point in the app, I need to insert bulk data (a batch of items) into Cosmos DB. So far, I've been using Azure Cosmos DB Python SDK for SQL API for communicating with Cosmos DB; however, it doesn't provide a method for bulk data insertion.
As I understood, these are the insertion methods provided in this SDK, both of which only support single item insert, which can be very slow when using it in a for loop:
.upsert_item()
.create_item()
Is there another way to use this SDK to insert bulk data instead of using the methods above in a for loop? If not, is there an Azure REST API that can handle bulk data insertion?
The Cosmos DB service does not provide this via its REST API. Bulk mode is implemented at the SDK layer and unfortunately, the Python SDK does not yet support bulk mode. It does however support asynchronous IO. Here's an example that may help you.
from azure.cosmos.aio import CosmosClient
import os
URL = os.environ['ACCOUNT_URI']
KEY = os.environ['ACCOUNT_KEY']
DATABASE_NAME = 'myDatabase'
CONTAINER_NAME = 'myContainer'
async def create_products():
async with CosmosClient(URL, credential=KEY) as client:
database = client.get_database_client(DATABASE_NAME)
container = database.get_container_client(CONTAINER_NAME)
for i in range(10):
await container.upsert_item({
'id': 'item{0}'.format(i),
'productName': 'Widget',
'productModel': 'Model {0}'.format(i)
}
)
Update: I remembered another way you can do bulk inserts in Cosmos DB for Python SDK and that is using Stored Procedures. There are examples of how to write these, including samples that demonstrate passing an array, which is what you want to do. I would also take a look at bounded execution as you will want to implement this as well. You can learn how to write them here, How to write stored procedures. Then how to register and call them here, How to use Stored Procedures. Note: these can only be used when passing a partition key value so you can only do batches within logical partitions.
The approach I am trying is to write a dynamic script that would generate mirror tables as in Oracle with similar data types in SQL server. Then again, write a dynamic script to insert records to SQL server. The challenge I see is incompatible data types. Has anyone come across similar situation? I am a sql developer but I can learn python if someone can share their similar work.
Have you tried the "SQL Server Import and Export Wizard" in SSMS?
i.e. if you create an empty SQL server database and right click on it in SSMS then one of the "tasks" menu options is "Import Data..." which starts up the "SQL Server Import and Export Wizard". This builds a once-off SSIS package .. which can be saved if you want to re-use.
There is a data source option for "Microsoft OLE DB Provider for Oracle".
You might have a better Oracle OLE DB Provider available also to try.
The will require Oracle client software to be available.
I haven't actually tried this (Oracle to SQL*Server) so am not sure if reasonable or not.
How many tables, columns?
Oracle DB may also have Views, triggers, constraints, Indexes, Functions, Packages, sequence generators, synonyms.
I used linked server, got all the metadata of the tables from dba_tab_columns in Oracle. Wrote script to create tables based on the metadata. I needed to use SSIS script task to save the create table script for source control. Then I wrote sql script to insert data from oracle, handled type differences through script.
I'm brand new to using the Elastic Stack so excuse my lack of knowledge on the subject. I'm running the Elastic Stack on a Windows 10, corporate work computer. I have Git Bash installed for a bash cli, and I can successfully launch the entire Elastic Stack. My task is to take log data that is stored in one of our databases and display it on a Kibana dashboard.
From what my team and I have reasoned, I don't need to use Logstash because the database that the logs are sent to is effectively our 'log stash', so to use the Logstash service would be redundant. I found this nifty diagram
on freecodecamp, and from what I gather, Logstash is just the intermediary for log retrieval different services. So instead of using Logstash, since the log data is already in a database, I could just do something like this
USER ---> KIBANA <---> ELASTICSEARCH <--- My Python Script <--- [DATABASE]
My python script successfully calls our database and retrieves the data, and a function that molds the data into a dict object (as I understand, Elasticsearch takes data in a JSON format).
Now I want to insert all of that data into Elasticsearch - I've been reading the Elastic docs, and there's a lot of talk about indexing that isn't really indexing, and I haven't found any API calls I can use to plug the data right into Elasticsearch. All of the documentation I've found so far concerns the use of Logstash, but since I'm not using Logstash, I'm kind of at a loss here.
If there's anyone who can help me out and point me in the right direction I'd appreciate it. Thanks
-Dan
You ingest data on elasticsearch using the Index API, it is basically a request using the PUT method.
To do that with Python you can use elasticsearch-py, the official python client for elasticsearch.
But sometimes what you need is easier to be done using Logstash, since it can extract the data from your database, format it using many filters and send to elasticsearch.
We have a number MongoDB collections which will store data generated from the application(E-Commerce Order update, delivery, cancel, new orders etc).Currently, we are following traditional ETL approach to schedule data pull (Covert to file in s3/Staging load) and load to DW.As data volume is increasing we feel like it's an inefficient way since we are lagging at least one day to generate reports compared to real time streaming /similar kind of new ETL approaches.So as a streaming option first I read about Apache Kafka which is very popular.But the biggest challenge facing is how to convert this Mongo DB collection to Kafka topics.
I read this MongoDb Streaming Out Inserted Data in Real-time (or near real-time). We are not using capped collections so that recommended solution won't work for us.
Can MongoDB collection be a Kafka producer?
Is there any better way to pull data from MongoDB realtime /near realtime to Target DB /s3 other than Kafka
Note: I prefer a python solution which can easily integrate into our current workflow than Java/Scala.
Thanks
After scanning the very large daily event logs using regular expression, I have to load them into a SQL Server database. I am not allowed to create a temporary CSV file and then use the command line BCP to load them into the SQL Server database.
Using Python, is it possible to use BCP streaming to load data into SQL Server database? The reason I want to use BCP is to improve the speed of the insert into SQL Server database.
Thanks
The BCP API is only available using the ODBC call-level interface and the managed SqlClient .NET API using the SqlBulkCopy class. I'm not aware of a Python extension that provides BCP API access.
You can insert many rows in a single transaction to improve performance. This can be accomplished by batching individual insert statements or by passing multiple rows at once using an XML parameter (which also reduces round-trips).