Mongodb: Shard a collection within my python code

Mongodb: Shard a collection within my python code - python

I have a distributed MongoDB 3.2 database. The system is mounted on two nodes with RedHat operating system.
Using python and the PyMongo driver (or some other), I want enable the sharding of a collection, specifying the compound shard key.
In the mongo shell this works:
> use mongotest
> sh.enableSharding("mongotest")
> db.signals.createIndex({ valueX: 1, valueY: 1 }, { unique: true })
> sh.shardCollection("mongotest.signals", { valueX: 1, valueY: 1 })
('mongotest' is the DB, and 'signals' is the collection)
The last two lines I want to make them within my code. Does anyone know if this is possible in python? If so, how is it done?
thank you very much,
sorry for my bad English

A direct translation of your shell commands to python is as shown below.
from pymongo import MongoClient
client = MongoClient()
db = client.admin # run commands against admin database.
db.command('enableSharding', 'mongotest')
db.command({'shardCollection': 'mongotest.signals', 'key': {'valueX': 1, 'valueY': 1}})
However, you may want to confirm that both enableSharding and shardCollection are exposed in your db by running
db.command('listCommands')

Related

Create a Java UDF that uses geoip2 library with the database in a S3 bucket

Correct me if i'm wrong, but my understanding of the UDF function in Snowpark is that you can send the function UDF from your IDE and it will be executed inside Snowflake. I have a staged database called GeoLite2-City.mmdb inside a S3 bucket on my Snowflake account and i would like to use it to retrieve informations about an ip address. So my strategy was to
1 Register an UDF which would return a response string n my IDE Pycharm
2 Create a main function which would simple question the database about the ip address and give me a response.
The problem is that, how the UDF and my code can see the staged file at
s3://path/GeoLite2-City.mmdb
in my bucket, in my case i simply named it so assuming that it will eventually find it (with geoip2.database.Reader('GeoLite2-City.mmdb') as reader:) since the
stage_location='#AWS_CSV_STAGE' is the same as were the UDF will be saved? But i'm not sure if i understand correctly what the option stage_location is referring exactly.
At the moment i get the following error:
"Cannot add package geoip2 because Anaconda terms must be accepted by ORGADMIN to use Anaconda 3rd party packages. Please follow the instructions at https://docs.snowflake.com/en/developer-guide/udf/python/udf-python-packages.html#using-third-party-packages-from-anaconda."
Am i importing geoip2.database correctly in order to use it with snowpark and udf?
Do i import it by writing session.add_packages('geoip2') ?
Thank You for clearing my doubts.
The instructions i'm following about geoip2 are here.
https://geoip2.readthedocs.io/en/latest/
my code:
from snowflake.snowpark import Session
import geoip2.database
from snowflake.snowpark.functions import col
import logging
from snowflake.snowpark.types import IntegerType, StringType
logger = logging.getLogger()
logger.setLevel(logging.INFO)
session = None
user = ''*********'
password = '*********'
account = '*********'
warehouse = '*********'
database = '*********'
schema = '*********'
role = '*********'
print("Connecting")
cnn_params = {
"account": account,
"user": user,
"password": password,
"warehouse": warehouse,
"database": database,
"schema": schema,
"role": role,
}
def first_udf():
with geoip2.database.Reader('GeoLite2-City.mmdb') as reader:
response = reader.city('203.0.113.0')
print('response.country.iso_code')
return response
try:
print('session..')
session = Session.builder.configs(cnn_params).create()
session.add_packages('geoip2')
session.udf.register(
func=first_udf
, return_type=StringType()
, input_types=[StringType()]
, is_permanent=True
, name='SNOWPARK_FIRST_UDF'
, replace=True
, stage_location='#AWS_CSV_STAGE'
)
session.sql('SELECT SNOWPARK_FIRST_UDF').show()
except Exception as e:
print(e)
finally:
if session:
session.close()
print('connection closed..')
print('done.')
UPDATE
I'm trying to solve it using a java udf as in my staging area i have the 'geoip2-2.8.0.jar' library staged already. If i could import it's methods to get the country of an ip it would be perfect, the problem is that i don't know how to do it exactly. I'm trying to follow these instructions https://maxmind.github.io/GeoIP2-java/.
I wanna interrogate the database and get as output the iso code of the country and i want to do it on snowflake worksheet.
CREATE OR REPLACE FUNCTION GEO()
returns varchar not null
language java
imports = ('#AWS_CSV_STAGE/lib/geoip2-2.8.0.jar', '#AWS_CSV_STAGE/geodata/GeoLite2-City.mmdb')
handler = 'test'
as
$$
def test():
File database = new File("geodata/GeoLite2-City.mmdb")
DatabaseReader reader = new DatabaseReader.Builder(database).build();
InetAddress ipAddress = InetAddress.getByName("128.101.101.101");
CityResponse response = reader.city(ipAddress);
Country country = response.getCountry();
System.out.println(country.getIsoCode());
$$;
SELECT GEO();

This will be more complicated that it looks:
To use session.add_packages('geoip2') in Snowflake you need to accept the Anaconda terms. This is easy if you can ask your account admin.
But then you can only get the packages that Anaconda has added to Snowflake in this way. The list is https://repo.anaconda.com/pkgs/snowflake/, and I don't see geoip2 there yet.
So you will need to package you own Python code (until Anaconda sees enough requests for geoip2 in the wishlist). I described the process here https://medium.com/snowflake/generating-all-the-holidays-in-sql-with-a-python-udtf-4397f190252b.
But wait! GeoIP2 is not pure Python, so you will need to wait until Anaconda packages the C extension libmaxminddb. But this will be harder, as you can see their docs don't offer a straightforward way like other pip installable C libraries.
So this will be complicated.
There are other alternative paths, like a commercial provider of this functionality (like I describe here https://medium.com/snowflake/new-in-snowflake-marketplace-monetization-315aa90b86c).
There other approaches to get this done without using a paid dataset, but I haven't written about that yet - but someone else might before I get to do it.
Btw, years ago I wrote something like this for BigQuery (https://cloud.google.com/blog/products/data-analytics/geolocation-with-bigquery-de-identify-76-million-ip-addresses-in-20-seconds), but today I was notified that Google recently deleted the tables that I had shared with the world (https://twitter.com/matthew_hensley/status/1598386009129058315).
So it's time to rebuild in Snowflake. But who (me?) and when is still a question.

PyMongo alternative to findAndModify

Background:
Python 3.7, Mongo 4.4, Ubuntu 20.04
I want to add a sequence number (unique integer) to new documents in my application. I'm holding a counter in the DB and using it as the unique number, increasing it after each new document.
Since my application is multi-processed, I sync this counter between processes using pyMongo findAndModify function, which is atomic (see Atomicity and Transactions).
Here is a simplified example of the code:
from pymongo import MongoClient
client = MongoClient('mongodb://localhost:27017/')
db = client['data']
collection = db.counters
query = {'_id' = 'sequence_data'}
update = {'$inc':{ 'counter':1}}
updated_doc = collection.find_and_modify(query, update)
return updated_doc['counter']
Problem:
The findAndModify function is deprecated (see PyMongo 4.0 changelog), and following the guidelines in PyMongo 4 Migration Guide, i tried to use the find_one_and_update function, but it is not atomic for the find part.
This can cause a race condition between 2 processes that execute the find_one_and_update and both read the same counter value, before incrementing it.
Question:
So my basic question is, how do I apply atomicity on the read section of my logic between multiple processes/threads?

Trigger to reject cosmos DB update from python

I have some code which updates a record in a cosmos DB container (see simplified snippet below). However there are other independent processes that also update the same record from other systems. In the example below if the state is a particular value, I would like the upsert_item() to be a no-op if the same record in the container already got updated to a particular "final" state. One way to solve it is to read the value before each update but that is a bit too expensive. Is there a simple way to make the upsert_item() turn into a no-op based on some server side trigger? Any pointers would be appreciated
client = CosmosClient(<end_pt>, <key>)
database_name = "cosmosdb"
container_name = "solar_system"
db_client = client.get_database_client(database_name)
db_container = db_client.get_container_client(container_name)
uid, planet, state = get_planetary_config()
# How can I make this following update a no-op depending on current state in the database?
json_data = {"id": str(uid), "planet": planet, "state": state}
db_container.upsert_item(body=json_data)

As i know, the cosmos db server side trigger does not meet your need.It is invoked to execute pre-function or post-function,not to judge whether the document meets some conditions.
So,update with specific conditions is not supported by Cosmos Db natively. You need to read the value and judge the condition before your update operations.

Two MySQL databases in Python

I have a problem, i want to connect 2 databases in Python at the same time. The two databases are identical. At any given time , one of the two primary and the other secondary stations . If the primary fails, the secondary station takes over.
I have no idea where to start. Creating 2 databases is not the problem but taking it into python is the challenge. Thank you guys!

First off, you have to install the mysql driver for python. Once installed, you could create two connections this way:
import mysql.connector
con1 = mysql.connector.connect(user='scott', password='tiger', host='127.0.0.1',database='example')
con2 = mysql.connector.connect(user='scott', password='tiger', host='127.0.0.1',database='example2')
connections = [con1, con2]
Once this is done, you could try querying one of them and if something is wrong, you query the other one. Like this
for connection in connections:
try:
runQuery(connection)
break;
except:
continue;

PyMongo: Update, $multi:false, get _id of updated document?

When updating a document in MongoDB using a search-style update, is it possible to get back the _id of the document(s) updated?
For example:
import pymongo
client = pymongo.MongoClient('localhost', 27017)
db = client.test_database
col = db.test_col
col.insert({'name':'kevin', 'status':'new'})
col.insert({'name':'brian', 'status':'new'})
col.insert({'name':'matt', 'status':'new'})
col.insert({'name':'stephen', 'status':'new'})
info = col.update({'status':'new'}, {'$set':{'status':'in_progress'}}, multi=False)
print info
# {u'updatedExisting': True, u'connectionId': 1380, u'ok': 1.0, u'err': None, u'n': 1}
# I want to know the _id of the document that was updated.
I have multiple threads accessing the database collection and want to be able to mark a document as being acted upon. Getting the document first and then updating by Id is not a good answer, because two threads may "get" the same document before it is updated. The application is a simple asynchronous task queue (yes, I know we'd be better off with something like Rabbit or ZeroMQ for this, but adding to our stack isn't possible right now).

You can use pymongo.collection.find_and_modify. It is a wrapper around MongoDB findAndModify command and can return original (by default) or modified document.
info = col.find_and_modify({'status':'new'}, {'$set':{'status':'in_progress'}})
if info:
print info.get('_id')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.