ElasticSearch aggregation on large dataset

ElasticSearch aggregation on large dataset - python

I use Python ElasticSearch API.
I have a dataset too large to retrieve using search().
I can retrieve it with helpers.scan() but the data is too big to be process rapidly with pandas.
So I learnt how to do aggregations to compact the data with ElasticSearch but still using search() I can't retrieve all the data. I understand that the aggregation is done on the "usual" search size, even if the aggregation would give one line ?
Finally I tried aggregations + scan or scroll but I understand that scan() or scroll() can not be used to do aggregations because those requests work on subset of the dataset then the aggregation is nonsense on the subsets.
What is the good way to do aggregations on a very large dataset ?
I can't find any relevant solution on the web.
To be more explicit my case is:
I have X thousands moving sensors transmitting every hour the last stop location, the new stop location. The move from last stop to new stop can take days, so during days I don't have relevant informations with the hourly acquisitions.
As an ElasticSearch search output I only need every unique line of the format :
sensor_id / last_stop / new_stop

If you are using elastic with pandas, you could try eland a new official elastic library written to integrate better them. Try:
es = Elasticsearch()
body = {
"size": 0,
"aggs": {
"getAllSensorId": {
"terms": {
"field": "sensor_id",
"size": 10000
},
"aggs": {
"getAllTheLastStop": {
"terms": {
"field": "last_stop",
"size": 10000
},
"aggs": {
"getAllTheNewStop": {
"terms": {
"field": "new_stop",
"size": 10000
}
}
}
}
}
}
}
}
list_of_results = []
result = es.search(index="my_index", body=body)
for sensor in result["aggregations"]["getAllTheSensorId"]["buckets"]:
for last in sensor["getAllTheLastStop"]["buckets"]:
for new in last["getAllTheNewStop"]["buckets"]:
record = {"sensor": sensor['key'], "last_stop": last['key'], "new_stop": new['key']}
list_of_results.append(record)

Related

How do I update fields for multiple elements in an array with different values in MongoDB?

I have data of the form:
{
'_id': asdf123b51234
'field2': 0
'array': [
0: {
'unique_array_elem_id': id
'nested_field': {
'new_field_i_want_to_add': value
}
}
...
]
}
I have been trying to update like this:
for doc in update_dict:
collection.find_one_and_update(
{'_id':doc['_id']},
{'$set': {
'array.$[elem].nested_field.new_field_i_want_to_add':doc['new_field_value']
}
},
array_filters=[{'elem.unique_array_elem_id':doc['unique_array_elem_id']}]
But it is painfully slow. Updating all of my data will take several days running continuously. Is there a way to update this nested field for all array elements for a given document at once?
Thanks a lot

how to add values using mongodb aggregation

is there any way to add values via aggregation
like db.insert_one
x = db.aggregate([{
"$addFields": {
"chat_id": -10013345566,
}
}])
i tried this
but this code return nothing and values are not updated
i wanna add the values via aggregation
cuz aggregation is way faster than others
sample document :
{"_id": 123 , "chat_id" : 125}
{"_id": 234, "chat_id" : 1325}
{"_id": 1323 , "chat_id" : 335}
expected output :
alternative to db.insert_one() in mongodb aggregation

You have to make use of $merge stage to save output of the aggregation to the collection.
Note: Be very very careful when you use $merge stage as you can accidentally replace the entire document in your collection. Go through the complete documentation of this stage before using it.
db.collection.aggregate([
{
"$match": {
"_id": 123
}
},
{
"$addFields": {
"chat_id": -10013345566,
}
},
{
"$merge": {
"into": "collection", // <- Collection Name
"on": "_id", // <- Merge operation match key
"whenMatched": "merge" // <- Operation to perform when matched
}
},
])
Mongo Playground Sample Execution

how to select data range with google sheets API when number of columns is unknown?

I am converting a script that was originally done in App Script to apply formatting to google sheets.
This script needs to apply to many sheets, and the number of columns is not known in advance. Before, in the App Scripts, I used basic getDataRange() without parameters, and it would select the correct number of columns and rows. How can I do the same via API? Is there a way to set end column index to end of data range?
For example, I'm using
{
"setBasicFilter": {
"filter": {
"range": {
"sheetId": SHEET_ID,
"startRowIndex": 0
}
}
}
}
To set top row as a filter. But it applies filters to all the empty cells as well, that are outside the table with data, while I need them to stop at last column.
What is the best way to do this via the API?

Solution:
You can call spreadsheets.values.get to get the values of a range, then get the length of the first element array. Then plug it in the setBasicFilter request.
Sample Code:
# The ID and range of a sample spreadsheet.
SAMPLE_SPREADSHEET_ID = 'enter spreadsheet ID here'
SAMPLE_RANGE_NAME = 'Sheet1!A1:1'
.
.
.
# Call the Sheets API
sheet = service.spreadsheets()
result = sheet.values().get(spreadsheetId=SAMPLE_SPREADSHEET_ID,
range=SAMPLE_RANGE_NAME).execute()
values = result.get('values', [])
length = len(values[0])
.
.
.
# filter parameters
{
"setBasicFilter": {
"filter": {
"range": {
"sheetId": SHEET_ID,
"startRowIndex": 0
"startColumnIndex": 0
"endColumnIndex" : length
}
}
}
}
References:
Python Quickstart
Grid Range

Mongodb aggregate query with condition

I have to perform aggregate on mongodb in python and unable to do so.
Below is the structure of mongodb document extracted:
{'Category': 'Male',
'details' :[{'name':'Sachin','height': 6},
{'name':'Rohit','height': 5.6},
{'name':'Virat','height': 5}
]
}
I want to return the height where name is Sachin by the aggregate function. Basically my idea is to extract data by $match apply condition and aggregate at the same time with aggregate function. This can be easily done by doing in 3 steps with if statements but i'm looking to do in 1 aggregate function.
Please note: there is not fixed length of 'details' value.
Let me know if any more explanation is needed.

You can do a $filter to achieve
db.collection.aggregate([
{
$project: {
details: {
$filter: {
input: "$details",
cond: {
$eq: [
"$$this.name",
"Sachin"
]
}
}
}
}
}
])
Working Mongo playground
If you use in find, but you need to be aware of positional operator
db.collection.find({
"details.name": "Sachin"
},
{
"details.$": 1
})
Working Mongo playground
If you need to make it as object, you can simply use $arrayElemAr with $ifNull

Custom Histogram aggregation in Elasticsearch

I have a index of following structure
item_id: unique item id
sale_date: date of the date
price: price of the sale wrt the date
I want to create a histogram of the latest sale prices per item. aggregate term item_id and histogram of last or latest price
My first choice was to term aggregate item_id and pick price from top_hits size 1 order sale_date desc and create histogram on the python end.
but.
since the data is in 10s of millions of records for one month. It is not viable to download all sources in time to perform histogram.
Note: Some item sell daily and some at different time interval. which makes it tricky to just pick latest sale_date
Updated:
Input: Item based sales time series data.
Output: Historgram of the count of items lies in a certain price buckets wrt to latest information

I have turn around that I used similar case, You can use max aggs with date type, and you can order aggregation based on nested aggs value, to be like:
"aggs": {
"item ID": {
"terms": {
"field": "item_id",
"size": 10000
},
"aggs": {
"price": {
"terms": {
"field": "price",
"size": 1,
"order": {
"sale_date": "desc"
}
},
"aggs": {
"sale_date": {
"max": {
"field": "sale_date"
}
}
}
}
}
}
}
I hope that will help you, and I wish you inform me if it works with you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

ElasticSearch aggregation on large dataset - python

Related

How do I update fields for multiple elements in an array with different values in MongoDB?

how to add values using mongodb aggregation

how to select data range with google sheets API when number of columns is unknown?

Mongodb aggregate query with condition

Custom Histogram aggregation in Elasticsearch

Categories

Resources