After scanning the very large daily event logs using regular expression, I have to load them into a SQL Server database. I am not allowed to create a temporary CSV file and then use the command line BCP to load them into the SQL Server database.
Using Python, is it possible to use BCP streaming to load data into SQL Server database? The reason I want to use BCP is to improve the speed of the insert into SQL Server database.
Thanks
The BCP API is only available using the ODBC call-level interface and the managed SqlClient .NET API using the SqlBulkCopy class. I'm not aware of a Python extension that provides BCP API access.
You can insert many rows in a single transaction to improve performance. This can be accomplished by batching individual insert statements or by passing multiple rows at once using an XML parameter (which also reduces round-trips).
Related
I'm developing an application in Python which uses Azure Cosmos DB as the main database. At some point in the app, I need to insert bulk data (a batch of items) into Cosmos DB. So far, I've been using Azure Cosmos DB Python SDK for SQL API for communicating with Cosmos DB; however, it doesn't provide a method for bulk data insertion.
As I understood, these are the insertion methods provided in this SDK, both of which only support single item insert, which can be very slow when using it in a for loop:
.upsert_item()
.create_item()
Is there another way to use this SDK to insert bulk data instead of using the methods above in a for loop? If not, is there an Azure REST API that can handle bulk data insertion?
The Cosmos DB service does not provide this via its REST API. Bulk mode is implemented at the SDK layer and unfortunately, the Python SDK does not yet support bulk mode. It does however support asynchronous IO. Here's an example that may help you.
from azure.cosmos.aio import CosmosClient
import os
URL = os.environ['ACCOUNT_URI']
KEY = os.environ['ACCOUNT_KEY']
DATABASE_NAME = 'myDatabase'
CONTAINER_NAME = 'myContainer'
async def create_products():
async with CosmosClient(URL, credential=KEY) as client:
database = client.get_database_client(DATABASE_NAME)
container = database.get_container_client(CONTAINER_NAME)
for i in range(10):
await container.upsert_item({
'id': 'item{0}'.format(i),
'productName': 'Widget',
'productModel': 'Model {0}'.format(i)
}
)
Update: I remembered another way you can do bulk inserts in Cosmos DB for Python SDK and that is using Stored Procedures. There are examples of how to write these, including samples that demonstrate passing an array, which is what you want to do. I would also take a look at bounded execution as you will want to implement this as well. You can learn how to write them here, How to write stored procedures. Then how to register and call them here, How to use Stored Procedures. Note: these can only be used when passing a partition key value so you can only do batches within logical partitions.
What are best practice approaches of properly getting messages from Kafka and generating INSERT/UPDATE/DELETE statements for relational dbs using Python?
Say, I have events that Create Entity/Update Entity/Delete Entity and I want those messages to be transformed into relevant SQL script.
Is there any suggestion rather than writing serialization manually?
There is no way around deserializing the record from Kafka and serializing into the appropriate database query. I would not recommend writing literal DDL statements as Kafka records and running those directly against a database client.
As commented, you can instead produce data in a supported format (JSONSchema, Avro, or Protobuf being the most common / well-documented) from Kafka Connect (optionally using a Schema Registry), then use a Sink Connector for your database.
We can basically use databricks as intermediate but I'm stuck on the python script to replicate data from blob storage to azure my sql every 30 second we are using CSV file here.The script needs to store the csv's in current timestamps.
There is no ready stream option for mysql in spark/databricks as it is not stream source/sink technology.
You can use in databricks writeStream .forEach(df) or .forEachBatch(df) option. This way it create temporary dataframe which you can save in place of your choice (so write to mysql).
Personally I would go for simple solution. In Azure Data Factory is enough to create two datasets (can be even without it) - one mysql, one blob and use pipeline with Copy activity to transfer data.
I'm creating a gui in python to manipulate stored records and I have the mysql script to set up the database and enter all information. How do I get from the mysql script to the .db file so that python can access and manipulate it?
db files are SQLite databases most of the time. What you are trying to do is converting a dumped MySQL database into an SQLite database. Doing this is not trivial, as I think both dialects are not compatible. If the input is simple enough, you can try running each part of it using an SQLite connection in your Python script. If it uses more complex features, you may want to actually connect to a (filled) MySQL database and fetch the data from there, inserting it back into a local SQLite file.
The approach I am trying is to write a dynamic script that would generate mirror tables as in Oracle with similar data types in SQL server. Then again, write a dynamic script to insert records to SQL server. The challenge I see is incompatible data types. Has anyone come across similar situation? I am a sql developer but I can learn python if someone can share their similar work.
Have you tried the "SQL Server Import and Export Wizard" in SSMS?
i.e. if you create an empty SQL server database and right click on it in SSMS then one of the "tasks" menu options is "Import Data..." which starts up the "SQL Server Import and Export Wizard". This builds a once-off SSIS package .. which can be saved if you want to re-use.
There is a data source option for "Microsoft OLE DB Provider for Oracle".
You might have a better Oracle OLE DB Provider available also to try.
The will require Oracle client software to be available.
I haven't actually tried this (Oracle to SQL*Server) so am not sure if reasonable or not.
How many tables, columns?
Oracle DB may also have Views, triggers, constraints, Indexes, Functions, Packages, sequence generators, synonyms.
I used linked server, got all the metadata of the tables from dba_tab_columns in Oracle. Wrote script to create tables based on the metadata. I needed to use SSIS script task to save the create table script for source control. Then I wrote sql script to insert data from oracle, handled type differences through script.