Mongo misses to catch insert triggers - python

I am trying to build an architecture without any HTTPS requests that take a minimum (100 ms) to deliver a request to the backend server, so I decided to take help from mongo triggers now it's taking (20 ms) but miss some requests. I have 2 scripts
To Test Mongo Artitecture
To listen to mongo trigger and process that request
Architecture Stats
No of requests
Time Taken
Processed Requests
missed requests
docker swarm replicas
1000
949.51
996
4
1
3000
4387.24
2948
52
1
5000
10051.78
4878
122
1
Worker Node Code:
from pymongo import MongoClient
import pymongo
connectionString = "mongodb://ml1:27017,ml1:27018,ml1:27019/magic_shop?replicaSet=mongodb-replicaset";
client = MongoClient(connectionString)
db = client.magic_shop # test is my database
col = db.req # Here spam is my collection
import socket
machine_name = socket.gethostname()
from datetime import datetime
import time
def processing_request(request):
'''
Description:
Get the string, takes its length and run a loop 10 times of string length, and set the
response to processed.
Input:
request (str) : Random String
Output:
Set value back to the mongo DB
response: Processed
'''
for i in range(0,len(request)*10):
pass
try:
with db.req.watch([{'$match': {'operationType': 'insert'}}]) as stream:
for values in stream:
request_id = values['fullDocument']['request_id']
request = values['fullDocument']['request']
myquery = { "request_id": request_id }
# Checking if not processed by other replica (Blocking System)
if len(list(col.find({ 'request_id': request_id, 'CSR_NAME' : { "$exists": False}}))):
newvalues = { "$set": { "CSR_NAME": machine_name, "response": "Processing", "processing_time":datetime.today()} }
# CSR Responding to the client
col.update_one(myquery, newvalues)
print(request_id)
print(f"Processing {request_id}")
# Now Processing Request
processing_request(request)
# Updating that work is done
myquery = { "request_id": request_id }
newvalues = { "$set": {"response": "Request Processed Have a nice Day sir!", 'response_time':datetime.today()} }
# CSR Responding to the client
col.update_one(myquery, newvalues)
except pymongo.errors.PyMongoError:
# The ChangeStream encountered an unrecoverable error or the
# resume attempt failed to recreate the cursor.
logging.error('...')
Testing Code
from pymongo import MongoClient
from PIL import Image
import io, random
import matplotlib.pyplot as plt
from datetime import datetime
db = client.magic_shop # test is my database
col = db.req # Here spam is my collection
brust = 5000
count = 0
for i in range(0, brust):
random_id = int(random.random()*100000)
image= {
'request_id': random_id,
'request': random.choice(requests),
'request_time':datetime.today()
}
image_id = col.insert_one(image).inserted_id
count+=1
print(f"Request Done {count}")
You can also get the complete code on Github
https://github.com/SohaibAnwaar/Docker-swarm

Related

Databricks to cosmos DB uploading data is very slow

I have below code which runs many days, input file has around 9GB of CSV and has millions of rows, I written below code, which executing from 5 days and still not completed, is there any way we can speed up process uploading data to cosmos DB?
import json
import logging
import sys
import azure.cosmos.cosmos_client as cosmos_client
import azure.cosmos.exceptions as exceptions
from azure.cosmos.partition_key import PartitionKey
from typing import Optional
configs = {
"dev": {
"file_location": "/FileStore/tables/docs/dpidata_pfile_20050523-20221009.csv",
"file_type": "csv",
"infer_schema": False,
"first_row_is_header": True,
"delimiter": ",",
"cdb_url": "https://xyxxxxxxxxxxxxxxx:443/",
"db_name": "abc",
"container_name": "dpi",
"partition_key": "/dpi"
},
"stg": {},
"prd": {}
}
class LoadToCdb():
def dpi_data_load(self) -> Optional[bool]:
try:
# The applied options are for CSV files. For other file types, these will be ignored.
df = spark.read.format(self.configs["file_type"]) \
.option("inferSchema", self.configs["infer_schema"]) \
.option("header", self.configs["first_row_is_header"]) \
.option("sep", self.configs["delimiter"]) \
.load(self.configs["file_location"])
df = df.select('dpi', 'Entity Type Code')
df = (df.withColumnRenamed("dpi","dpi")
.withColumnRenamed("Entity Type Code","entity_type_code"))
df_json = df.toJSON()
for row in df_json.collect():
print(row)
data = json.loads(row)
data.setdefault('dpi', None)
data["id"] = data["dpi"]
# this method call will update to cosmos db
self.cosmos_db.create_items(data)
except Exception as e:
self.log.error("Could not able to load to cosmos db from csv file")
self.log.error(e)
load_to_cdb = LoadToCdb()
load_to_cdb.dpi_data_load()

How to load elastic data in python using scroll?

I have an index in elastic search which is having huge data. I am trying to load some of its data (more than 10000 records)in python for further processing. As per documentation and web search scroll is used but it is able to fetch only few records. After sometime this exception occurs,
errorNotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [101781]')
My code is as following:
from elasticsearch import Elasticsearch
##########elastic configuration
host='localhost'
port=9200
user=''
pasw=''
el_index_name = 'test'
es = Elasticsearch([{'host':host , 'port': port}], http_auth=(user,pasw))
res = es.search(index=el_index_name, body={"query": {"match_all": {}}},scroll='10m')
rows=[]
while True:
try:
rows.append(es.scroll(scroll_id=res['_scroll_id'])['hits']['hits'])
except Exception as esl:
print ('error{}'.format(esl))
break
##deleting scroll
es.clear_scroll(scroll_id=res['_scroll_id'])
I have changed the value of scroll='10m' but still, this exception occurs.
You need to change your scroll request line to this:
rows.append(es.scroll(scroll_id=res['_scroll_id'], body={"scroll": "10m","scroll_id": res['_scroll_id']})['hits']['hits'])
As an advice, It is better to increase number of retrieved posts. retrieving just 1 post in each request have negative influence on your performance and it has overhead for your cluster, as well. as an example:
{
"query": {
"match_all": {}
},"size":100
}
I have added the below part to answer to the question in comments. It is not stopping because you have put While True in your code. You need to change it to this:
res = es.search(index=el_index_name, body={"query": {"match_all": {}}}, scroll='10m')
scroll_id = res['_scroll_id']
query = {
"scroll": "10m",
"scroll_id": scroll_id
}
rows = []
while len(res['hits']['hits']):
for item in res['hits']['hits']:
rows.append(item)
res = es.scroll(scroll_id=scroll_id, body=query)
Please let me know if there was any problem with this.

How to pass a CustomDataAsset to a DataContext to run custom expectations on a batch?

I have a CustomPandasDataset with a custom expectation
from great_expectations.data_asset import DataAsset
from great_expectations.dataset import PandasDataset
from datetime import date, datetime, timedelta
class CustomPandasDataset(PandasDataset):
_data_asset_type = "CustomPandasDataset"
#DataAsset.expectation(["column", "datetime_match", "datetime_diff"])
def expect_column_max_value_to_match_datetime(self, column:str, datetime_match: datetime = None, datetime_diff: tuple = None) -> dict:
"""
Check if data is constantly updated by matching the max datetime column to a
datetime value or to a datetime difference.
"""
max_datetime = self[column].max()
if datetime_match is None:
from datetime import date
datetime_match = date.today()
if datetime_diff:
from datetime import timedelta
success = (datetime_match - timedelta(*datetime_diff)) <= max_datetime <= datetime_match
else:
success = (max_datetime == datetime_match)
result = {
"data_max_value": max_datetime,
"expected_max_value": str(datetime_match),
"expected_datetime_diff": datetime_diff
}
return {
"success": success,
"result": result
}
I want to run the expectation expect_column_max_value_to_match_datetime to a given pandas dataframe
expectation_suite_name = "df-raw-expectations"
suite = context.create_expectation_suite(expectation_suite_name, overwrite_existing=True)
df_ge = ge.from_pandas(df, dataset_class=CustomPandasDataset)
batch_kwargs = {'dataset': df_ge, 'datasource': 'df_raw_datasource'}
# Get batch of data
batch = context.get_batch(batch_kwargs, suite)
which I get from a DataContext, now when I run expectations on this batch
datetime_diff = 4,
batch.expect_column_max_value_to_match_datetime(column='DATE', datetime_diff=datetime_diff)
I got the following error
AttributeError: 'PandasDataset' object has no attribute 'expect_column_max_value_to_match_datetime'
According to the docs I've specified the dataset_class=CustomPandasDataset attribute when constructing the GreatExpectations dataset, indeed running the expectations on df_ge works but not on the batch of data.
According to the docs
To use custom expectations in a datasource or DataContext, you need to define the custom DataAsset in the datasource configuration or batch_kwargs for a specific batch.
so pass CustomPandasDataset through the data_asset_type parameter of get_batch() function
# Get batch of data
batch = context.get_batch(batch_kwargs, suite, data_asset_type=CustomPandasDataset)
or define it in the context Configuration
from great_expectations.data_context.types.base import DataContextConfig
from great_expectations.data_context import BaseDataContext
data_context_config = DataContextConfig(
...
datasources={
"sales_raw_datasource": {
"data_asset_type": {
"class_name": "CustomPandasDataset",
"module_name": "custom_dataset",
},
"class_name": "PandasDatasource",
"module_name": "great_expectations.datasource",
}
},
...
)
context = BaseDataContext(project_config=data_context_config)
where CustomPandasDataset is available from the module/script custom_dataset.py

Poloniex API using requests, returnBalances works, but returnTradeHistory does not

I'm trying to access the Poloniex API using requests.
The returnBalances code works, but the returnTradeHistory code does not.
The returnTradeHistory is commented out in the example.
Data is returned for returnBalances but not for returnTradeHistory.
I know the whole APIKey and secret code is working because I am getting accurate returnBalances data.
So why is returnTradeHistory not working?
from time import time
import urllib.parse
import hashlib
import hmac
import requests
import json
APIKey=b"stuff goes in here"
secret=b"stuff goes in here"
url = "https://poloniex.com/tradingApi"
# this works and returns data
payload = {
'command': 'returnBalances',
'nonce': int(time() * 1000),
}
# this does not work and does not return data
#payload = {
# 'command': 'returnTradeHistory',
# 'currencyPair': 'BTC_MAID',
# 'nonce': int(time() * 1000),
#}
paybytes = urllib.parse.urlencode(payload).encode('utf8')
sign = hmac.new(secret, paybytes, hashlib.sha512).hexdigest()
headers = {
'Content-Type': 'application/x-www-form-urlencoded',
'Key': APIKey,
'Sign': sign,
}
r = requests.post(url, data=paybytes, headers=headers)
fulldata=r.content
data = json.loads(fulldata)
print(data)
According to the official poloniex API documentation:
returnTradeHistory
Returns the past 200 trades for a given market, or up to 50,000 trades
between a range specified in UNIX timestamps by the "start" and "end"
GET parameters [...]
so it is required to specify the start and end parameter
e.g: https://poloniex.com/public?command=returnTradeHistory&currencyPair=BTC_NXT&start=1410158341&end=1410499372

PyMongo query not returning results although the same query returns results in mongoDB shell

import pymongo
uri = 'mongodb://127.0.0.1:27017'
client = pymongo.MongoClient(uri)
db = client.TeamCity
students = db.students.find({})
for student in students:
print (student)
Python Result:
Blank
MongoDB: Results
db.students.find({})
{ "_id" : ObjectId("5788483d0e5b9ea516d4b66c"), "name" : "Jose", "mark" : 99 }
{ "_id" : ObjectId("57884cb3f7edc1fd01c3511e"), "name" : "Jordan", "mark" : 100
}
import pymongo
uri = 'mongodb://127.0.0.1:27017'
client = pymongo.MongoClient(uri)
db = client.TeamCity
students = db.students.find({})
print (students.count())
Python Result:
0
mongoDB Results
db.students.find({}).count()
2
What am I missing?
For
import pymongo
uri = 'mongodb://127.0.0.1:27017'
client = pymongo.MongoClient(uri)
db = client.TeamCity
students = db.students.find({})
print (students)
Python Result :
So I think it is able to connect to the db successfully but not returning results
Try your pymongo code like so, i.e. changing TeamCity to Teamcity
Print all students:
import pymongo
uri = 'mongodb://127.0.0.1:27017'
client = pymongo.MongoClient(uri)
db = client.Teamcity
students = db.students.find({})
for student in students:
print (student)
Count all students:
import pymongo
uri = 'mongodb://127.0.0.1:27017'
client = pymongo.MongoClient(uri)
db = client.Teamcity
students = db.students.find({})
print (students.count())
I know this question has been answered long ago, but I ran into the same kind of problem today and it happened to have a different reason, so I'm adding an answer here.
Code working on shell:
> db.customers.find({"cust_id": 2345}, {"pending_questions": 1, _id: 0})
{ "pending_questions" : [ 1, 5, 47, 89 ] }
Code not working in PyMongo (cust_id set through a web form):
db.customers.find({"cust_id": cust_id}, {"pending_questions": 1, "_id": 0})
It turned out that the numbers in the shell are being interpreted as ints, whereas the numbers used in Python code are being interpreted by PyMongo as floats, and hence return no matches. This proves the point:
cust_id = int(request.args.get('cust_id'))
db.customers.find({"cust_id": cust_id}, {"pending_questions": 1, "_id": 0})
which produces the result:
[1.0, 5.0, 47.0, 89.0]
The simple solution was to typecast everything to int in the python code. In conclusion, the data type inferred by the shell may be different from the data type inferred by PyMongo and this may be one reason that a find query that returns results on the shell doesn't return anything when run in PyMongo.

Categories

Resources