How to improve performance of pymongo queries

How to improve performance of pymongo queries - python

I inherited an old Mongo database. Let's focus on the following two collections (removed most of their content for better readability):
Collection user
db.user.find_one({"email": "user#host.com"})
{'lastUpdate': datetime.datetime(2016, 9, 2, 11, 40, 13, 160000),
'creationTime': datetime.datetime(2016, 6, 23, 7, 19, 10, 6000),
'_id': ObjectId('576b8d6ee4b0a37270b742c7'),
'email': 'user#host.com' }
Collections entry (one user to many entries):
db.entry.find_one({"userId": _id})
{'date_entered': datetime.datetime(2015, 2, 7, 0, 0),
'creationTime': datetime.datetime(2015, 2, 8, 14, 41, 50, 701000),
'lastUpdate': datetime.datetime(2015, 2, 9, 3, 28, 2, 115000),
'_id': ObjectId('54d775aee4b035e584287a42'),
'userId': '576b8d6ee4b0a37270b742c7',
'data': 'test'}
As you can see, there is no DBRef between the two.
What I would like to do is to count the total number of entries, and the number of entries updated after a given date.
To do this I used Python's pymongo library. The code below gets me what I need, but it is painfully slow.
from pymongo import MongoClient
client = MongoClient('mongodb://foobar/')
db = client.userdata
# First I need to fetch all user ids. Otherwise db cursor will time out after some time.
user_ids = [] # build a list of tuples (email, id)
for user in db.user.find():
user_ids.append( (user['email'], str(user['_id'])) )
date = datetime(2016, 1, 1)
for user_id in user_ids:
email, _id = user_id
t0 = time.time()
query = {"userId": _id}
no_of_all_entries = db.entry.find(query).count()
query = {"userId": _id, "lastUpdate": {"$gte": date}}
no_of_entries_this_year = db.entry.find(query).count()
t1 = time.time()
print("delay ", round(t1 - t0, 2))
print(email, no_of_all_entries, no_of_entries_this_year)
It takes around 0.83 second to run both db.entry.find queries on my laptop, and 0.54 on an AWS server (not the MongoDB server).
Having ~20000 users it takes painful 3 hours to get all the data.
Is that the kind of latency you'd expect to see in Mongo ? What can I do to improve this ? Bear in mind that MongoDB is fairly new to me.

Instead of running two aggregates for all users separately you can just get both aggregates for all users with db.collection.aggregate().
And instead of a (email, userId) tuples we make it a dictionary as it is easier to use to get the corresponding email.
user_emails = {str(user['_id']): user['email'] for user in db.user.find()}
date = datetime(2016, 1, 1)
entry_counts = db.entry.aggregate([
{"$group": {
"_id": "$userId",
"count": {"$sum": 1},
"count_this_year": {
"$sum": {
"$cond": [{"$gte": ["$lastUpdate", date]}, 1, 0]
}
}
}}
])
for entry in entry_counts:
print(user_emails.get(entry['_id']),
entry['count'],
entry['count_this_year'])
I'm pretty sure getting the user's email address into the result could be done but I'm not a mongo expert either.

Related

Get new item uploaded in AWS DynamoDB table streaming to Lambda using python

I created a framework that saves information in DynamoDB table, I need the last uploaded item, the DynamoDB table looks like this:
{'Table': {'AttributeDefinitions': [{'AttributeName': 'ID',
'AttributeType': 'S'}],
'TableName': 'Bulk_query_database',
'KeySchema': [{'AttributeName': 'ID', 'KeyType': 'HASH'}],
'TableStatus': 'ACTIVE',
'CreationDateTime': datetime.datetime(2022, 10, 6, 21, 58, 20, 293000, tzinfo=tzlocal()),
'ProvisionedThroughput': {'LastDecreaseDateTime': datetime.datetime(2022, 10, 6, 22, 8, 40, 735000, tzinfo=tzlocal()),
'NumberOfDecreasesToday': 0,
'ReadCapacityUnits': 1,
'WriteCapacityUnits': 1},
'TableSizeBytes': 59,
'ItemCount': 1,
So far now I have connect DynamoDB stream as a trigger for a Lambda function, this fuction should print the last element inserted in the table The query im using is this:
dynamodb_resource = boto3.resource('dynamodb', region_name="us-east-1")
table = dynamodb_resource.Table('Bulk_query_database')
response = table.query(KeyConditionExpression=Key('ID').eq('I CANT MAKE IT WORK UNLESS I USE THE ID STRING'))
items = response['Items']
print(items)
There is no response using this query, what I can do to get last element loaded into the table.

Your question isn't so clear. From my understanding you want to Query the item which is stored in DynamoDB highlighted by the image you shared. To do this you need to set the value of ID:
response = table.query(KeyConditionExpression=Key('ID').eq('3P3AD596A'))
items = response['Items']
In DynamoDB items are stored based on the key, in your case ID and to retrieve those items you need to do so by passing the key value to your requests.

Convert shell script to Python

aws ec2 describe-snapshots --owner-ids $AWS_ACCOUNT_ID --query
"Snapshots[?(StartTime<='$dtt')].[SnapshotId]" --output text | tr '\t'
'\n' | sort
I have this shell script which I want to convert to python.
I tried looking at the boto3 documentation and came up with this
client = boto3.client('ec2')
client.describe_snapshots(OwnerIds = [os.environ['AWS_ACCOUNT_ID']], )
But I can't figure out how to change that --query tag in python.
I couldn't find it in the documentation.
What am I missing here?

You should ignore the --query portion and everything after it, and process that within Python instead.
First, store the result of the call in a variable:
ec2_client = boto3.client('ec2')
response = ec2_client.describe_snapshots(OwnerIds = ['self'])
It will return something like:
{
'NextToken': '',
'Snapshots': [
{
'Description': 'This is my snapshot.',
'OwnerId': '012345678910',
'Progress': '100%',
'SnapshotId': 'snap-1234567890abcdef0',
'StartTime': datetime(2014, 2, 28, 21, 28, 32, 4, 59, 0),
'State': 'completed',
'VolumeId': 'vol-049df61146c4d7901',
'VolumeSize': 8,
},
],
'ResponseMetadata': {
'...': '...',
},
}
Therefore, you can use response['Snapshots'] to extract your desired results, for example:
for snapshot in response['Snapshots']:
if snapshot['StartTime'] < datetime(2022, 6, 1):
print(snapshot['SnapshotId'])
It's really all Python at that point.

Django ORM how to get raw values grouped by a field

I have a model which is like so:
class CPUReading(models.Model):
host = models.CharField(max_length=256)
reading = models.IntegerField()
created = models.DateTimeField(auto_now_add=True)
I am trying to get a result which looks like the following:
{
"host 1": [
{
"created": DateTimeField(...),
"value": 20
},
{
"created": DateTimeField(...),
"value": 40
},
...
],
"host 2": [
{
"created": DateTimeField(...),
"value": 19
},
{
"created": DateTimeField(...),
"value": 10
},
...
]
}
I need it grouped by host and ordered by created.
I have tried a bunch of stuff including using values() and annotate() in order to create a GROUP BY statement, but I think I must be missing something because in order to use GROUP BY it seems I need to use some aggregation function which I don't really want to do. I need the actual values of the reading field grouped by the host field and ordered by the created field.
This is more-or-less how any charting library needs the data.
I know I can make it happen with either python code or with raw sql queries, but I'd much prefer to use the django ORM, unless it explicitly disallows this sort of query.

As far as I'm aware, there's nothing in the ORM that makes this easy. If you want to do it in the ORM without raw queries, and if you're willing and able to change your data structure, you can solve this mostly in the ORM, with Python code kept to a minimum:
class Host(models.Model):
pass
class CPUReading(models.Model):
host = models.ForeignKey(Host, related_name="readings", on_delete=models.CASCADE)
reading = models.IntegerField()
created = models.DateTimeField(auto_now_add=True)
With this you can use two queries with fairly clean code:
from collections import defaultdict
results = defaultdict(list)
hosts = Host.objects.prefetch_related("readings")
for host in hosts:
for reading in host.readings.all():
results[host.id].append(
{"created": reading.created, "value": reading.reading}
)
Or you can do it a little more efficiently with one query and a single loop:
from collections import defaultdict
results = defaultdict(list)
readings = CPUReading.objects.select_related("host")
for reading in readings:
results[reading.host.id].append(
{"created": reading.created, "value": reading.reading}
)

Assuming you are using PostgreSQL you can use a combination of array_agg and json_object to achieve what you're after.
from django.contrib.postgres.aggregation import ArrayAgg
from django.contrib.postgres.fields import ArrayField, JSONField
from django.db.models import CharField
from django.db.models.expressions import Func, Value
class JSONObject(Func):
function = 'json_object'
output_field = JSONField()
def __init__(self, **fields):
fields, expressions = zip(*fields.items())
super().__init__(
Value(fields, output_field=ArrayField(CharField())),
Func(*expressions, template='array[%(expressions)s]'),
)
readings = dict(CPUReading.objects.values_list(
'host',
ArrayAgg(
JSONObject(
created_at='created_at',
value='value',
),
ordering='created_at',
),
))

If you want to stay close to the Django ORM, you just need to remember this doesn't return a queryset but a dictionary and is evaluated on the fly, so don't use this in declarative scope. However, the interface is similar to QuerySet.values() and has the additional requirement that it needs to be sorted first.
class PlotQuerySet(models.QuerySet):
def grouped_values(self, key_field, *fields, **expressions):
if key_field not in fields:
fields += (key_field,)
values = self.values(*fields, **expressions)
data = {}
for key, gen in itertools.groupby(values, lambda x: x.pop(key_field)):
data[key] = list(gen)
return data
PlotManager = models.Manager.from_queryset(PlotQuerySet, class_name='PlotManager')
class CpuReading(models.Model):
host = models.CharField(max_length=255)
reading = models.IntegerField()
created_at = models.DateTimeField(auto_now_add=True)
objects = PlotManager()
Example:
CpuReading.objects.order_by(
'host', 'created_at'
).grouped_values(
'host', 'created_at', 'reading'
)
Out[10]:
{'a': [{'created_at': datetime.datetime(2020, 7, 13, 16, 45, 23, 215005, tzinfo=<UTC>),
'reading': 0},
{'created_at': datetime.datetime(2020, 7, 13, 16, 45, 23, 223080, tzinfo=<UTC>),
'reading': 1},
{'created_at': datetime.datetime(2020, 7, 13, 16, 45, 23, 230218, tzinfo=<UTC>),
'reading': 2},
...],
'b': [{'created_at': datetime.datetime(2020, 7, 13, 16, 45, 23, 241476, tzinfo=<UTC>),
'reading': 0},
{'created_at': datetime.datetime(2020, 7, 13, 16, 45, 23, 242015, tzinfo=<UTC>),
'reading': 1},
{'created_at': datetime.datetime(2020, 7, 13, 16, 45, 23, 242537, tzinfo=<UTC>),
'reading': 2},
...]}

Django: annotate Sum Case When depending on the status of a field

In my application i need to get all transactions per day for the last 30 days.
In transactions model i have a currency field and i want to convert the value in euro if the chosen currency is GBP or USD.
models.py
class Transaction(TimeMixIn):
COMPLETED = 1
REJECTED = 2
TRANSACTION_STATUS = (
(COMPLETED, _('Completed')),
(REJECTED, _('Rejected')),
)
user = models.ForeignKey(CustomUser)
status = models.SmallIntegerField(choices=TRANSACTION_STATUS, default=COMPLETED)
amount = models.DecimalField(default=0, decimal_places=2, max_digits=7)
currency = models.CharField(max_length=3, choices=Core.CURRENCIES, default=Core.CURRENCY_EUR)
Until now this is what i've been using:
Transaction.objects.filter(created__gte=last_month, status=Transaction.COMPLETED)
.extra({"date": "date_trunc('day', created)"})
.values("date").annotate(amount=Sum("amount"))
which returns a queryset containing dictionaries with date and amount:
<QuerySet [{'date': datetime.datetime(2018, 6, 19, 0, 0, tzinfo=<UTC>), 'amount': Decimal('75.00')}]>
and this is what i tried now:
queryset = Transaction.objects.filter(created__gte=last_month, status=Transaction.COMPLETED).extra({"date": "date_trunc('day', created)"}).values('date').annotate(
amount=Sum(Case(When(currency=Core.CURRENCY_EUR, then='amount'),
When(currency=Core.CURRENCY_USD, then=F('amount') * 0.8662),
When(currency=Core.CURRENCY_GBP, then=F('amount') * 1.1413), default=0, output_field=FloatField()))
)
which is converting gbp or usd to euro but it creates 3 dictionaries with the same day instead of making the sum of them.
This is what it returns: <QuerySet [{'date': datetime.datetime(2018, 6, 19, 0, 0, tzinfo=<UTC>), 'amount': 21.655}, {'date': datetime.datetime(2018, 6, 19, 0, 0, tzinfo=<UTC>), 'amount': 28.5325}, {'date': datetime.datetime(2018, 6, 19, 0, 0, tzinfo=<UTC>), 'amount': 25.0}]>
and this is what i want:
<QuerySet [{'date': datetime.datetime(2018, 6, 19, 0, 0, tzinfo=<UTC>), 'amount': 75.1875}]>

The only thing that remains is an order_by. This will (yeah, I know that sounds strange), force Django to perform a GROUP BY. So it should be rewritten to:
queryset = Transaction.objects.filter(
created__gte=last_month,
status=Transaction.COMPLETED
).extra(
{"date": "date_trunc('day', created)"}
).values(
'date'
).annotate(
amount=Sum(Case(
When(currency=Core.CURRENCY_EUR, then='amount'),
When(currency=Core.CURRENCY_USD, then=F('amount') * 0.8662),
When(currency=Core.CURRENCY_GBP, then=F('amount') * 1.1413),
default=0,
output_field=FloatField()
))
).order_by('date')
(I here fixed the formatting a bit to make it more readable, especially for small screens, but it is (if we ignore spacing) the same as in the question, except for .order_by(..) of course.)

We need to aggregate the query set to accomplish what you are trying.
Try using aggregate()
queryset = Transaction.objects.filter(created__gte=last_month, status=Transaction.COMPLETED).extra({"date": "date_trunc('day', created)"}).values('date').aggregate(
amount=Sum(Case(When(currency=Core.CURRENCY_EUR, then='amount'),
When(currency=Core.CURRENCY_USD, then=F('amount') * 0.8662),
When(currency=Core.CURRENCY_GBP, then=F('amount') * 1.1413), default=0, output_field=FloatField())))
for more info: aggregate()

Use Freebusy to other calendar than 'primary'

I am creating events for a not primary calendar, I want to check if the user is not busy in this calendar, not in primary one for this event.
My query:
the_datetime = tz.localize(datetime.datetime(2016, 1, 3, 0))
the_datetime2 = tz.localize(datetime.datetime(2016, 1, 4, 8))
body = {
"timeMin": the_datetime.isoformat(),
"timeMax": the_datetime2.isoformat(),
"timeZone": 'US/Central',
"items": [{"id": 'my.email#gmail.com'}]
}
eventsResult = service.freebusy().query(body=body).execute()
It returns:
{'calendars': {'my.email#gmail.com': {'busy': []}},
'kind': 'calendar#freeBusy',
'timeMax': '2016-01-04T14:00:00.000Z',
'timeMin': '2016-01-03T06:00:00.000Z'}
even if i have something created for that date in my X calendar, but when I create an event in primary calendar I have:
{'calendars': {'my.email#gmail.com': {'busy': [{'end': '2016-01-03T07:30:00-06:00',
'start': '2016-01-03T06:30:00-06:00'}]}},
'kind': 'calendar#freeBusy',
'timeMax': '2016-01-04T14:00:00.000Z',
'timeMin': '2016-01-03T06:00:00.000Z'}
Is there a way to tell the API the calendar I want to check?

i found it! :D
in items of body, put calendar id instead of mail

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to improve performance of pymongo queries - python

Related

Get new item uploaded in AWS DynamoDB table streaming to Lambda using python

Convert shell script to Python

Django ORM how to get raw values grouped by a field

Django: annotate Sum Case When depending on the status of a field

Use Freebusy to other calendar than 'primary'

Categories

Resources