How to insert bulk data into Cosmos DB in Python? - python

I'm developing an application in Python which uses Azure Cosmos DB as the main database. At some point in the app, I need to insert bulk data (a batch of items) into Cosmos DB. So far, I've been using Azure Cosmos DB Python SDK for SQL API for communicating with Cosmos DB; however, it doesn't provide a method for bulk data insertion.
As I understood, these are the insertion methods provided in this SDK, both of which only support single item insert, which can be very slow when using it in a for loop:
.upsert_item()
.create_item()
Is there another way to use this SDK to insert bulk data instead of using the methods above in a for loop? If not, is there an Azure REST API that can handle bulk data insertion?

The Cosmos DB service does not provide this via its REST API. Bulk mode is implemented at the SDK layer and unfortunately, the Python SDK does not yet support bulk mode. It does however support asynchronous IO. Here's an example that may help you.
from azure.cosmos.aio import CosmosClient
import os
URL = os.environ['ACCOUNT_URI']
KEY = os.environ['ACCOUNT_KEY']
DATABASE_NAME = 'myDatabase'
CONTAINER_NAME = 'myContainer'
async def create_products():
async with CosmosClient(URL, credential=KEY) as client:
database = client.get_database_client(DATABASE_NAME)
container = database.get_container_client(CONTAINER_NAME)
for i in range(10):
await container.upsert_item({
'id': 'item{0}'.format(i),
'productName': 'Widget',
'productModel': 'Model {0}'.format(i)
}
)
Update: I remembered another way you can do bulk inserts in Cosmos DB for Python SDK and that is using Stored Procedures. There are examples of how to write these, including samples that demonstrate passing an array, which is what you want to do. I would also take a look at bounded execution as you will want to implement this as well. You can learn how to write them here, How to write stored procedures. Then how to register and call them here, How to use Stored Procedures. Note: these can only be used when passing a partition key value so you can only do batches within logical partitions.

Related

MaxItemCount in Azure Functions/Cosmos DB

For Python API for Azure Functions serverless and Comsmos DB input binding. Is it possible to tune maxitemcount? Or is it dynamically set. I do some queries resulting in large results and it seems the bottleneck is throughput between Cosmos DB and the executing HTTP-triggered function.
/MG
Assuming you are using the SQL Query input option, you should be able to use the LIMIT clause along with ORDER BY to set a maximum number of records returned. You may also want to take a look at the metrics on your Cosmos account, just in case the maxing out your available RUs causing the bottleneck.

Trying to Connect to azure cosmos client using python, Gives 104 connection aborted error

Okay so I have an azure cosmos subscription, where I have created a Mongo DB resource, Now when I am using python SDK to connect it, now it's given when 104, error, connection reset by peer.
Now I am not sure what's the issue,
I am using endpoint with SSL True and Primary Key.
code
endpoint = "http://XXX.mongo.cosmos.azure.com:10255/?ssl=true"
key = 'xxxxxxxxxxxxxxxx'
# <create_cosmos_client>
client = CosmosClient(endpoint, key)
When choosing the MongoDB API, you must use a native MongoDB SDK (in your case, pymongo); the wire protocol is MongoDB, and operations are performed via the same protocol as MongoDB.
Your code is attempting to use the Cosmos DB SDK, which is specific to, and will only work with, the Core (SQL) API.
If you look in the portal blade for your MongoDB-API instance, you'll see examples under Quick Start tab, which each use a MongoDB SDK in its examples (or the mongo shell). Same thing with the Connection Strings tab, showing native MongoDB connection strings (as well as the separate parts of the connection string).

Elasticsearch Data Insertion with Python

I'm brand new to using the Elastic Stack so excuse my lack of knowledge on the subject. I'm running the Elastic Stack on a Windows 10, corporate work computer. I have Git Bash installed for a bash cli, and I can successfully launch the entire Elastic Stack. My task is to take log data that is stored in one of our databases and display it on a Kibana dashboard.
From what my team and I have reasoned, I don't need to use Logstash because the database that the logs are sent to is effectively our 'log stash', so to use the Logstash service would be redundant. I found this nifty diagram
on freecodecamp, and from what I gather, Logstash is just the intermediary for log retrieval different services. So instead of using Logstash, since the log data is already in a database, I could just do something like this
USER ---> KIBANA <---> ELASTICSEARCH <--- My Python Script <--- [DATABASE]
My python script successfully calls our database and retrieves the data, and a function that molds the data into a dict object (as I understand, Elasticsearch takes data in a JSON format).
Now I want to insert all of that data into Elasticsearch - I've been reading the Elastic docs, and there's a lot of talk about indexing that isn't really indexing, and I haven't found any API calls I can use to plug the data right into Elasticsearch. All of the documentation I've found so far concerns the use of Logstash, but since I'm not using Logstash, I'm kind of at a loss here.
If there's anyone who can help me out and point me in the right direction I'd appreciate it. Thanks
-Dan
You ingest data on elasticsearch using the Index API, it is basically a request using the PUT method.
To do that with Python you can use elasticsearch-py, the official python client for elasticsearch.
But sometimes what you need is easier to be done using Logstash, since it can extract the data from your database, format it using many filters and send to elasticsearch.

Loading Large data into SQL Server [BCP] using Python

After scanning the very large daily event logs using regular expression, I have to load them into a SQL Server database. I am not allowed to create a temporary CSV file and then use the command line BCP to load them into the SQL Server database.
Using Python, is it possible to use BCP streaming to load data into SQL Server database? The reason I want to use BCP is to improve the speed of the insert into SQL Server database.
Thanks
The BCP API is only available using the ODBC call-level interface and the managed SqlClient .NET API using the SqlBulkCopy class. I'm not aware of a Python extension that provides BCP API access.
You can insert many rows in a single transaction to improve performance. This can be accomplished by batching individual insert statements or by passing multiple rows at once using an XML parameter (which also reduces round-trips).

Python database WITHOUT using Django (for Heroku)

To my surprise, I haven't found this question asked elsewhere. Short version, I'm writing an app that I plan to deploy to the cloud (probably using Heroku), which will do various web scraping and data collection. The reason it'll be in the cloud is so that I can have it be set to run on its own every day and pull the data to its database without my computer being on, as well as so the rest of the team can access the data.
I used to use AWS's SimpleDB and DynamoDB, but I found SDB's storage limitations to be to small and DDB's poor querying ability to be a problem, so I'm looking for a database system (SQL or NoSQL) that can store arbitrary-length values (and ideally arbitrary data structures) and that can be queried on any field.
I've found many database solutions for Heroku, such as ClearDB, but all of the information I've seen has shown how to set up Django to access the database. Since this is intended to be script and not a site, I'd really prefer not to dive into Django if I don't have to.
Is there any kind of database that I can hook up to in Heroku with Python without using Django?
You can get a database provided from Heroku without requiring your app to use Django. To do so:
heroku addons:add heroku-postgresql:dev
If you need a larger more dedicated database, you can examine the plans at Heroku Postgres
Within your requirements.txt you'll want to add:
psycopg2
Then you can connect/interact with it similar to the following:
import psycopg2
import os
import urlparse
urlparse.uses_netloc.append('postgres')
url = urlparse.urlparse(os.environ['DATABASE_URL'])
conn = psycopg2.connect("dbname=%s user=%s password=%s host=%s " % (url.path[1:], url.username, url.password, url.hostname))
cur = conn.cursor()
query = "SELECT ...."
cur.execute(query)
I'd use MongoDB. Heroku has support for it, so I think it will be really easy to start and scale out: https://addons.heroku.com/mongohq
About Python: MongoDB is a really easy database. The schema is flexible and fits really well with Python dictionaries. That's something really good.
You can use PyMongo
from pymongo import Connection
connection = Connection()
# Get your DB
db = connection.my_database
# Get your collection
cars = db.cars
# Create some objects
import datetime
car = {"brand": "Ford",
"model": "Mustang",
"date": datetime.datetime.utcnow()}
# Insert it
cars.insert(car)
Pretty simple, uh?
Hope it helps.
EDIT:
As Endophage mentioned, another good option for interfacing with Mongo is mongoengine. If you have lots of data to store, you should take a look at that.
I did this recently with Flask. (https://github.com/HexIce/flask-heroku-sqlalchemy).
There are a couple of gotchas:
1. If you don't use Django you may have to set up your database yourself by doing:
heroku addons:add shared-database
(Or whichever database you want to use, the others cost money.)
2. The database URL is stored in Heroku in the "DATABASE_URL" environment variable.
In python you can get it by doing.
dburl = os.environ['DATABASE_URL']
What you do to connect to the database from there is up to you, one option is SQLAlchemy.
Create a standalone Heroku Postgres database. http://postgres.heroku.com

Categories

Resources