Custom Histogram aggregation in Elasticsearch - python

I have a index of following structure
item_id: unique item id
sale_date: date of the date
price: price of the sale wrt the date
I want to create a histogram of the latest sale prices per item. aggregate term item_id and histogram of last or latest price
My first choice was to term aggregate item_id and pick price from top_hits size 1 order sale_date desc and create histogram on the python end.
but.
since the data is in 10s of millions of records for one month. It is not viable to download all sources in time to perform histogram.
Note: Some item sell daily and some at different time interval. which makes it tricky to just pick latest sale_date
Updated:
Input: Item based sales time series data.
Output: Historgram of the count of items lies in a certain price buckets wrt to latest information

I have turn around that I used similar case, You can use max aggs with date type, and you can order aggregation based on nested aggs value, to be like:
"aggs": {
"item ID": {
"terms": {
"field": "item_id",
"size": 10000
},
"aggs": {
"price": {
"terms": {
"field": "price",
"size": 1,
"order": {
"sale_date": "desc"
}
},
"aggs": {
"sale_date": {
"max": {
"field": "sale_date"
}
}
}
}
}
}
}
I hope that will help you, and I wish you inform me if it works with you.

Related

Multi-level Python Dict to Pandas DataFrame only processes one level out of many

I'm parsing some XML data, doing some logic on it, and trying to display the results in an HTML table. The dictionary, after filling, looks like this:
{
"general_info": {
"name": "xxx",
"description": "xxx",
"language": "xxx",
"prefix": "xxx",
"version": "xxx"
},
"element_count": {
"folders": 23,
"conditions": 72,
"listeners": 1,
"outputs": 47
},
"external_resource_count": {
"total": 9,
"extensions": {
"jar": 8,
"json": 1
},
"paths": {
"/lib": 9
}
},
"complexity": {
"over_1_transition": {
"number": 4,
"percentage": 30.769
},
"over_1_trigger": {
"number": 2,
"percentage": 15.385
},
"over_1_output": {
"number": 4,
"percentage": 30.769
}
}
}
Then I'm using pandas to convert the dictionary into a table, like so:
data_frame = pandas.DataFrame.from_dict(data=extracted_metrics, orient='index').stack().to_frame()
The result is a table that is mostly correct:
While the first and second levels seem to render correctly, those categories with a sub-sub category get written as a string in the cell, rather than as a further column. I've also tried using stack(level=1) but it raises an error "IndexError: Too many levels: Index has only 1 level, not 2". I've also tried making it into a series with no luck. It seems like it only renders "complete" columns. Is there a way of filling up the empty spaces in the dictionary before processing?
How can I get, for example, external_resource_count -> extensions to have two daughter rows jar and json, with an additional column for the values, so that the final table looks like this:
Extra credit if anyone can tell me how to get rid of the first row with the index numbers. Thanks!
The way you load the dataframe is correct but you should rename the 0 to a some column name.
# this function extracts all the keys from your nested dicts
def explode_and_filter(df, filterdict):
return [df[col].apply(lambda x:x.get(k) if type(x)==dict else x).rename(f'{k}')
for col,nested in filterdict.items()
for k in nested]
data_frame = pd.DataFrame.from_dict(data= extracted_metrics, orient='index').stack().to_frame(name='somecol')
#lets separate the rows where a dict is present & explode only those rows
mask = data_frame.somecol.apply(lambda x:type(x)==dict)
expp = explode_and_filter(data_frame[mask],
{'somecol':['jar', 'json', '/lib', 'number', 'percentage']})
# here we concat the exploded series to a frame
exploded_df = pd.concat(expp, axis=1).stack().to_frame(name='somecol2').reset_index(level=2)\.rename(columns={'level_2':'somecol'})
# and now we concat the rows with dict elements with the rows with non dict elements
out = pd.concat([data_frame[~mask], exploded_df])
The output dataframe looks like this

Querying for values of embedded documents in MongoDB with PyMongo

I have a document in MongoDB that looks like this
{
"_id": 0,
"cash_balance": 50,
"holdings": [
{
"name": "item1",
"code": "code1",
"quantity": 300
},
{
"name": "item2",
"code": "code2",
"quantity": 100
}
]
}
I would like to query for this particular document and get the quantity value of the object inside the holdings array whose code matches "code1". It can be assumed that there will be a match.
data = collection.find_one({"_id": 0, "holdings.code": "code1"}, {"holdings.$.quantity": 1})
{ "_id": 0, "holdings": [{"name": "item1", "code": "code1", "quantity": 300}] }
Running the above code gives me this output and I can get the quantity value by using:
data["holdings"][0]["quantity]
300
However this seems to be a rather roundabout way of getting a single value. Is there a way I can query for the value of a particular key matching the code query without getting the holdings array containing the required object?
try to use the aggregate method with $unwind.
$unwind does the following:
Deconstructs an array field from the input documents to output a document for each element. Each output document is the input document with the value of the array field replaced by the element.
MongoDB documentation for $unwind
I created a playground example for you.

ElasticSearch aggregation on large dataset

I use Python ElasticSearch API.
I have a dataset too large to retrieve using search().
I can retrieve it with helpers.scan() but the data is too big to be process rapidly with pandas.
So I learnt how to do aggregations to compact the data with ElasticSearch but still using search() I can't retrieve all the data. I understand that the aggregation is done on the "usual" search size, even if the aggregation would give one line ?
Finally I tried aggregations + scan or scroll but I understand that scan() or scroll() can not be used to do aggregations because those requests work on subset of the dataset then the aggregation is nonsense on the subsets.
What is the good way to do aggregations on a very large dataset ?
I can't find any relevant solution on the web.
To be more explicit my case is:
I have X thousands moving sensors transmitting every hour the last stop location, the new stop location. The move from last stop to new stop can take days, so during days I don't have relevant informations with the hourly acquisitions.
As an ElasticSearch search output I only need every unique line of the format :
sensor_id / last_stop / new_stop
If you are using elastic with pandas, you could try eland a new official elastic library written to integrate better them. Try:
es = Elasticsearch()
body = {
"size": 0,
"aggs": {
"getAllSensorId": {
"terms": {
"field": "sensor_id",
"size": 10000
},
"aggs": {
"getAllTheLastStop": {
"terms": {
"field": "last_stop",
"size": 10000
},
"aggs": {
"getAllTheNewStop": {
"terms": {
"field": "new_stop",
"size": 10000
}
}
}
}
}
}
}
}
list_of_results = []
result = es.search(index="my_index", body=body)
for sensor in result["aggregations"]["getAllTheSensorId"]["buckets"]:
for last in sensor["getAllTheLastStop"]["buckets"]:
for new in last["getAllTheNewStop"]["buckets"]:
record = {"sensor": sensor['key'], "last_stop": last['key'], "new_stop": new['key']}
list_of_results.append(record)

How to customize python pandas dataframe into mongodb object

I have imported a csv dataset using python and did some clean ups.
Download the dataset here
# importing pandas
import pandas as pd
# reading csv and assigning to 'data'
data = pd.read_csv('co-emissions-per-capita.csv')
# dropping all columns before 2016 (2016 - 2017 remains)
data.drop(data[data.Year < 2016].index, inplace=True)
# dropping rows with all null values in rows
data.dropna(how="all", inplace=True)
# dropping rows with all null values in columns
data.dropna(axis="columns", how="all", inplace=True)
# filling NA values
data["Entity"].fillna("No Country", inplace=True)
data["Code"].fillna("No Code", inplace=True)
data["Year"].fillna("No Year", inplace=True)
data["Per capita CO2 emissions (tonnes per capita)"].fillna(0, inplace=True)
# Sort by Year && Country
data.sort_values(["Year", "Entity"], inplace=True)
# renaming columns
data.rename(columns={"Entity": "Country",
"Per capita CO2 emissions (tonnes per capita)": "CO2 emissions (metric tons)"}, inplace=True)
My currecnt dataset has data for 2 years and 197 countries which is 394 rows
I want to insert the data into mongodb in the following format.
{
{
"_id": ObjectId("5dfasdc2f7c4b0174c5d01bc"),
"year": 2016,
"countries":
{
"name": "Afghanistan",
"code": "AFG",
"CO2 emissions (metric tons)": 0.366302
},
{
"name": "Albania",
"code": "ALB",
"CO2 emissions (metric tons)": 0.366302
}
},
{
"_id": ObjectId("5dfasdc2f7c4b0174c5d01bc"),
"year": 2017,
"countries":
{
"name": "Afghanistan",
"code": "AFG",
"CO2 emissions (metric tons)": 0.366302
},
{
"name": "Albania",
"code": "ALB",
"CO2 emissions (metric tons)": 0.366302
}
}
}
I want one object each for an year.
Inside that I want to nest all the countries and it related information.
To be precise, I want my database to have 2(max) objects and 197 nested objects inside each main object. So each year will only be listed once inside the database whereas each country will appear twice in the database 1 each for 1 year
is there a better structure to store these data? please specify the steps to store these data into mongodb and I'd really appreciate if you can suggest a good 'mongoose for NodeJs' like ODM driver for python.
Use groupby function to split values from your dataframe into separate groups per year.
Use to_dict function with orient parameter set to 'records' to convert results into JSON arrays.
Use pymongo API to connect to DB and insert values.

Filter and count entries for each hour, over the last 24hours from a specific date

I'm currently aggregating some data. I already done aggregation where I would count entries per day, as you can see in the code below.
I've got no idea how to even start with this aggregation.
I assume it would be something to do with my current date time, minus 24 hours and then get the count for each hour in the 24 hours.
Collection = {
"_id" : ObjectId("5c125a185dea1b0252c895b2"),
"time" : ISODate("2018-12-13T15:09:42.536Z"),
}
pipeline = [
{"$unwind": "$time"},
{"$group": {
"_id": {"$dateToString": {"format": "%Y-%m-%d", "date": "$time","timezone": "Africa/Johannesburg"}},
"count": {"$sum": 1},
}},
{"$sort": SON([("_id", -1)])}
]
This is the code for the daily aggregation.
Further help would be much appreciated.

Categories

Resources