Two JSON douments linked by a key - python

I have a python server listening to POST from an external server.I expect two JSON documents for every incident happening on the external server. One of the fields in the JSON documents is a unique_key which can be used to identify that these two documents belong together. Upon recieving the JSON documents, my python server sticks into elasticsearch. The two documents related to the incident will be indexed in the elastic search as follows.
/my_index/doc_type/doc_1
/my_index/doc_type/doc_2
i.e the documents belong to the same index and has the same document type. But I don't have an easy way to know that these two documents are related. I want to do some processing before inserting into ElasticSearch when I can use the unique_key on the two documents to link these two. What are your thoughts on doing some normalization across the two documents and merging them into a single JSON document. It has to be remembered that I will be recieving a large number of such documents per second. I need some temporary storage to store and process the JSON documents. Can some one give some suggestions for approaching this problem.
As updated I am adding the basic structure of the JSON files here.
json_1
{
"msg": "0",
"tdxy": "1",
"data": {
"Metric": "true",
"Severity": "warn",
"Message": {
"Session": "None",
"TransId": "myserver.com-14d9e013794",
"TransName": "dashboard.action",
"Time": 0,
"Code": 0,
"CPUs": 8,
"Lang": "en-GB",
"Event": "false",
},
"EventTimestamp": "1433192761097"
},
"Timestamp": "1433732801097",
"Host": "myserver.myspace.com",
"Group": "UndefinedGroup"
}
json_2
{
"Message": "Hello World",
"Session": "4B5ABE9B135B7EHD49343865C83AD9E079",
"TransId": "myserver.com-14d9e013794",
"TransName": "dashboard.action"
"points": [
{
"Name": "service.myserver.com:9065",
"Host": "myserver.com",
"Port": "9065",
}
],
"Points Operations": 1,
"Points Exceeded": 0,
"HEADER.connection": "Keep-Alive",
"PARAMETER._": "1432875392706",
}
I have updated the code as per the suggestion.
if rx_buffer:
txid = json.loads(rx_buffer)['TransId']
if `condition_1`:
res = es.index(index='its', doc_type='vents', id=txid, body=rx_buffer)
print(res['created'])
elif `condition_2`:
res = es.update(index='its', doc_type='vents', id=txid, body={"f_vent":{"b_vent":rx_buffer}})
I get the following error.
File "/usr/lib/python2.7/site-packages/elasticsearch/transport.py", line 307, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/http_urllib3.py", line 89, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/lib/python2.7/site-packages/elasticsearch/connection/base.py", line 105, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
RequestError: TransportError(400, u'ActionRequestValidationException[Validation Failed: 1: script or doc is missing;]')

The code below makes the assumption you're using the official elasticsearch-py library, but it's easy to transpose the code to another library.
We'd also probably need to create a specific mapping for your assembled document of type doc_type, but it heavily depends on how you want to query it later on.
Anyway, based on our discussion above, I would then index json1 first
from elasticsearch import Elasticsearch
es_client = Elasticsearch(hosts=[{"host": "localhost", "port": 9200}])
json1 = { ...JSON of the first document you've received... }
// extract the unique ID
// note: you might want to only take 14d9e013794 and ditch "myserver.com-" if that prefix is always constant
doc_id = json1['data']['Message']['TransID']
// index the first document
es_client.index(index="my_index", doc_type="doc_type", id=doc_id, body=json1)
At this point json1 is stored in Elasticsearch. Then, when you later get your second document json2 you can proceed like this:
json2 = { ...JSON of the first document you've received... }
// extract the unique ID
// note: same remark about keeping only the second part of the id
doc_id = json2['TransID']
// make a partial update of your first document
es_client.update(index="my_index", doc_type="doc_type", id=doc_id, body={"doc": {"SecondDoc": json2}})
Note that SecondDoc can be any name of your choosing here, it's simply a nested field that will contain your second document.
At this point you should have a single document having the id 14d9e013794 and the following content:
{
"msg": "0",
"tdxy": "1",
"data": {
"Metric": "true",
"Severity": "warn",
"Message": {
"Session": "None",
"TransId": "myserver.com-14d9e013794",
"TransName": "dashboard.action",
"Time": 0,
"Code": 0,
"CPUs": 8,
"Lang": "en-GB",
"Event": "false"
},
"EventTimestamp": "1433192761097"
},
"Timestamp": "1433732801097",
"Host": "myserver.myspace.com",
"Group": "UndefinedGroup",
"SecondDoc": {
"Message": "Hello World",
"Session": "4B5ABE9B135B7EHD49343865C83AD9E079",
"TransId": "myserver.com-14d9e013794",
"TransName": "dashboard.action",
"points": [
{
"Name": "service.myserver.com:9065",
"Host": "myserver.com",
"Port": "9065"
}
],
"Points Operations": 1,
"Points Exceeded": 0,
"HEADER.connection": "Keep-Alive",
"PARAMETER._": "1432875392706"
}
}
Of course, you can make any processing on json1 and json2 before indexing/updating them.

Related

How to filter API response data based on particular time range using python

I am using one lambda python function to get email logs from mailgun using mailgun log API.
Here is my function,
import json
import requests
resp = requests.get("https://api.eu.mailgun.net/v3/domain/events",
auth=("api","key-api"))
def jprint(obj):
# create a formatted string of the Python JSON object
text = json.dumps(obj, sort_keys=True, indent=4)
print(text)
jprint(resp.json())
This function gives formatted json output of email logs fetched from mailgun API
sample response from API,
{
"items": [
{
"campaigns": [],
"delivery-status": {
"attempt-no": 1,
"certificate-verified": true,
"code": 250,
"description": "",
"message": "OK",
"mx-host": "host",
"session-seconds": 1.5093050003051758,
"tls": true
},
"envelope": {
"sender": "postmaster#domain.com",
"sending-ip": "ip",
"targets": "id#mail.com",
"transport": "smtp"
},
"event": "delivered",
"flags": {
"is-authenticated": true,
"is-routed": false,
"is-system-test": false,
"is-test-mode": false
},
"id": "id",
"log-level": "info",
"message": {
"attachments": [],
"headers": {
"from": "NAME <noreply#name.com>",
"message-id": "20220223075827.de300265fad746e9#domain.com",
"subject": "Client due diligence information has been submitted by one of your customers.",
"to": "id#mail.com"
},
"size": 1990
},
"recipient": "id#mail.com",
"recipient-domain": "domain.com",
"storage": {
"key": "key",
"url": "https://storage.eu.mailgun.net/v3/domains/domain/messages/id"
},
"tags": [],
"timestamp": 1645603109.434181,
"user-variables": {}
},
{
"envelope": {
"sender": "postmaster#domain.com",
"targets": "id#mail.com.com",
"transport": "smtp"
},
"event": "accepted",
"flags": {
"is-authenticated": true,
"is-test-mode": false
},
"id": "id",
"log-level": "info",
"message": {
"headers": {
"from": "NAME <noreply#name.com>",
"message-id": "20220223075827.de300265fad746e9#domain.com",
"subject": "Client due diligence information has been submitted by one of your customers.",
"to": "id#mail.com.com"
},
"size": 1990
},
"method": "HTTP",
"recipient": "id#mail.com",
"recipient-domain": "domain",
"storage": {
"key": "key",
"url": "https://storage.eu.mailgun.net/v3/domains/domain/messages/key"
},
"tags": null,
"timestamp": 1645603107.282775,
"user-variables": {}
},
Here timestamp is not human readable
I need to setup the aws lambda python script to trigger the event to call the mailgun API periodically and send the logs to cloudwatch. I am familiar with setup but not with script.
Now I need to filter the API data only for last one hour dynamically.
From the analysis using pandas library this can be achieved but I couldn't get the proper answer to get logs for dynamic time range periodically.
I referred many docs about this but I cannot find proper answer and also python is totally new for me.
Can anyone please guide me how can i get the logs from last N time range dynamically?
In addition to above answers, for filtering between two time and date ranges using only python, you could use datetime. Here using the same list comprehension as #Amirhossein Kiani.:
import datetime
start = datetime.datetime(year, month, day, hour, minute, second).timestamp()
stop = datetime.datetime(year, month, day, hour, minute, second).timestamp()
[x for x in data["items"] if start < x["timestamp"] < stop]
For the one hour difference, you could also use timedelta:
start = (datetime.datetime.now() - datetime.timedelta(hours=1)).timestamp()
stop = datetime.datetime.now().timestamp()
In the documentation of mailgun, you can specify a timerange, so your result can already be filtered using begin and end parameters.
After that, you can use pd.json_normalize to reshape your json response.
In addition to what #Corralien said about documentation, which I personally prefer, you can use a pure python approach to reselct the last hour data using a list comprehension. In the code below, I am going to assume you named API's response as data which should be dictionary:
from time import time
lastHour = time() - 3600
[x for x in data["items"] if x["timestamp"] > lastHour]
This would filter the values with a timestamp greater than the last hour(time() - 3600).

Generic JSON parsing/saving to Db with Python

I need to create a generic library to parse a JSON file and insert data resulted into Database Tables
For this I need to create a configuration file to transmit which JDO attributes will be saved in which tables
The next JSON file is an example :
{
"label": "first",
"data": [
{
"id": 1,
"name": "john",
"eyes": "blue",
"books": [
{
"id": 999,
"title": "the best book",
"pages": 234,
"pictures": [
{
"id": 1,
"label": "unicorn",
"file": "unicorn1.jpg"
},
{
"id": 2,
"label": "frog",
"file": "frog3.jpg"
}
]
},
{
"id": 9,
"title": "last book",
"pages": 123,
"pictures": [
{
"id": 5,
"label": "horse",
"file": "horse5.jpg"
}
]
}
]
}
]
}
Requirements (related to this example) :
for each data value an iteration idx value should be created as an id in table t_persons.idx
"data.id" and data.name will be saved into table t_persons
"data.eyes" will be saved in table t_eyes connected t_eyes.p_idx = t_persons.idx
"data"."books" attributes : id, title, pages will be saved in t_books connected t_books.p_idx = t_persons.idx
"data.books.pictures" will be saved in t_pictures connected t_pictures.b_idx = t_books.idx
the tables are already defined and can't be modified
For a different JSON a new configuration file will be created but the python code should remain unmodified.
I've implemented a Python code to directly parse an JSON example and I've tried to generalize it.
I've saved JSON attributes in a pandas data frame corresponding to each table.
I've transformed JSON to a dictionary and I've used TOML as a configuration system.
For the JSON example I've used 3 for loops but I don't know how to specify an arbitrary number of loops inside in a generic way (the place and the number of for loops depend of the JSON structure)

Issue building a REST API query using $filter for Power BI Admin API

I am trying to access a Power BI Admin API by filtering the status of dataset refreshes that are failed, however the API filter query doesn't work.
Here is the documentation : https://learn.microsoft.com/en-us/rest/api/power-bi/admin/getrefreshables
Below is part my code in Python for calling Get method, which is failing -
refreshables_url = "https://api.powerbi.com/v1.0/myorg/admin/capacities/refreshables?$filter=lastRefresh/status eq 'Failed'"
header = {'Content-Type':'application/json','Authorization': f'Bearer {access_token}'}
r = requests.get(url=refreshables_url, headers=header)
Below error is thrown when I try to filter for status -
raise JSONDecodeError("Expecting value",s,err.value") from None json.decoder.JSONDecodeError: Expecting value line 1 column 1 (char 0)
However, when I tried below, it works fine for such simple queries without inner/nested elements.
refreshables_url = "https://api.powerbi.com/v1.0/myorg/admin/capacities/refreshables?$filter=averageDuration gt 1200"
refreshables_url = "https://api.powerbi.com/v1.0/myorg/admin/capacities/refreshables?$filter=refreshesPerDay eq 15"
However, when I try to filter for inner array like Status, it fails. I must be calling it incorrectly but not sure of it.
What am I missing here?
Here is how the response looks like -
{
"value": [
{
"id": "cfafbeb1-8037-4d0c-896e-a46fb27ff229",
"name": "SalesMarketing",
"kind": "Dataset",
"startTime": "2017-06-13T09:25:43.153Z",
"endTime": "2017-06-19T11:22:32.445Z",
"refreshCount": 22,
"refreshFailures": 0,
"averageDuration": 289.3814,
"medianDuration": 268.6245,
"refreshesPerDay": 11,
"lastRefresh": {
"refreshType": "ViaApi",
"startTime": "2017-06-13T09:25:43.153Z",
"endTime": "2017-06-13T09:31:43.153Z",
"status": "Completed",
"requestId": "9399bb89-25d1-44f8-8576-136d7e9014b1"
}
}
]
}
Here is what am expecting (it should just be filtering the entries for status as "Failed" instead of above "completed" entries -
{
"value": [
{
"id": "ewrffbeb1-6337-460c-326e-a46fb27hh234",
"name": "SalesMarketing",
"kind": "Dataset",
"startTime": "2017-06-13T09:25:43.153Z",
"endTime": "2017-06-19T11:22:32.445Z",
"refreshCount": 2,
"refreshFailures": 0,
"averageDuration": 189.3814,
"medianDuration": 168.6245,
"refreshesPerDay": 1,
"lastRefresh": {
"refreshType": "ViaApi",
"startTime": "2017-04-13T09:25:43.153Z",
"endTime": "2017-10-13T09:31:43.153Z",
"status": "Failed",
"requestId": "43643bb89-25d1-77f8-8543-dsgfewre9034r3223"
}
}
]
}

How do you parsing nested JSON data for specific information?

I'm using the national weather service API and when you use a specific URL you get JSON data back. My program so far grabs everything including 155 hours of weather data.
Simply put I'm trying to parse the data and grab the weather for the
latest hour but everything is in a nested data structure.
My code, JSON data, and more information are below. Any help is appreciated.
import requests
import json
def get_current_weather(): #This method returns json data from the api
url = 'https://api.weather.gov/gridpoints/*office*/*any number,*any number*/forecast/hourly'
response = requests.get(url)
full_data = response.json()
return full_data
def main(): #Prints the information grabbed from the API
print(get_current_weather())
if __name__ == "__main__":
main()
In the JSON response, I get there are 3 layers before you get to the 'shortForecast' data that I'm trying to get. The first nest is 'properties, everything before it is irrelevant to my program. The second nest is 'periods' and each period is a new hour, 0 being the latest. Lastly, I just need to grab the 'shortForcast' in the first period or periods[0].
{
"#context": [
"https://geojson.org/geojson-ld/geojson-context.jsonld",
{
"#version": "1.1",
"wx": "https://api.weather.gov/ontology#",
"geo": "http://www.opengis.net/ont/geosparql#",
"unit": "http://codes.wmo.int/common/unit/",
"#vocab": "https://api.weather.gov/ontology#"
}
],
"type": "Feature",
"geometry": {
"type": "Polygon",
"coordinates": [
[
*data I'm not gonna add*
]
]
},
"properties": {
"updated": "2021-02-11T05:57:24+00:00",
"units": "us",
"forecastGenerator": "HourlyForecastGenerator",
"generatedAt": "2021-02-11T07:12:58+00:00",
"updateTime": "2021-02-11T05:57:24+00:00",
"validTimes": "2021-02-10T23:00:00+00:00/P7DT14H",
"elevation": {
"value": ,
"unitCode": "unit:m"
},
"periods": [
{
"number": 1,
"name": "",
"startTime": "2021-02-11T02:00:00-05:00",
"endTime": "2021-02-11T03:00:00-05:00",
"isDaytime": false,
"temperature": 18,
"temperatureUnit": "F",
"temperatureTrend": null,
"windSpeed": "10 mph",
"windDirection": "N",
"icon": "https://api.weather.gov/icons/land/night/snow,40?size=small",
"shortForecast": "Chance Light Snow",
"detailedForecast": ""
},
{
"number": 2,
"name": "",
"startTime": "2021-02-11T03:00:00-05:00",
"endTime": "2021-02-11T04:00:00-05:00",
"isDaytime": false,
"temperature": 17,
"temperatureUnit": "F",
"temperatureTrend": null,
"windSpeed": "12 mph",
"windDirection": "N",
"icon": "https://api.weather.gov/icons/land/night/snow,40?size=small",
"shortForecast": "Chance Light Snow",
"detailedForecast": ""
},
OK, so I didn't want to edit everything again so this is the new get_current_weather method. I was able to get to 'periods but after that I'm still stumped. This is the new method.
def get_current_weather():
url = 'https://api.weather.gov/gridpoints/ILN/82,83/forecast/hourly'
response = requests.get(url)
full_data = response.json()
return full_data['properties'].get('periods')
For the dictionary object, you can access the nested elements by using indexing multiple times.
So, for your dictionary object, you can use the following to get the value for the key shortForecast for the first element in the list of dictionaries under key periods under the key properties in the main dictionary:
full_data['properties']['periods'][0]['shortForecast']

SugarCRM response ordered dict key _hash

What is
_hash
that is received with the API request?
My request url,
url = "https://" + sugar_instance + "/rest/v10/Leads"
Is there a unique user_id for each Lead/Employee/Module in SugarCRM? And if yes, how can I obtain is using a request. I am using Python.
There are a few different questions within your question. I'll try to answer all of them.
What is _hash?
Have a look at this subset of an API response:
"modified_user_id": "e8b433d5-5d17-456c-8506-fe56452fcce8",
"modified_by_name": "Reisclef",
"modified_user_link": {
"full_name": "Administrator",
"id": "1",
"_acl": {
"fields": [],
"delete": "no",
"_hash": "8e11bf9be8f04daddee9d08d44ea891e"
}
},
"created_by": "1",
"created_by_name": "Administrator",
"created_by_link": {
"full_name": "Administrator",
"id": "1",
"_acl": {
"fields": [],
"delete": "no",
"_hash": "8e11bf9be8f04daddee9d08d44ea891e"
}
},
The "_hash" in the above response is a hash of the related acl record, representing the user's access control limits to the record in question.
We can prove this by looking further down my response. You will notice that the hash changes, but is consistent with each object with the same criteria:
"member_of": {
"name": "",
"id": "",
"_acl": {
"fields": [],
"_hash": "654d337e0e912edaa00dbb0fb3dc3c17"
}
},
"campaign_id": "",
"campaign_name": "",
"campaign_accounts": {
"name": "",
"id": "",
"_acl": {
"fields": [],
"_hash": "654d337e0e912edaa00dbb0fb3dc3c17"
}
},
What we can gather from this is that the _hash is a hash of the _acl object. You can confirm this by looking at include/MetaDataManager/MetaDataManager.php, line 1035.
Therefore, it's not a hash of the user record, it's a hash of the ACL settings of the record.
Is there a unique user_id?
Strictly speaking, no, there won't be a unique user id for every record (unless one user only ever created/edited one record).
If you refer back to my first block of JSON, you'll see there are two user relationships:
modified_user_id
and
created_by
These indicate what the unique id is of the user record, which we can guarantee to be unique (as far as GUIDs are).
How can I obtain it?
It's technically already in the request, but if you just wanted to retrieve the created by user id and modified by user id, you can do the call using this:
https://{INSTANCE}/rest/v10/{MODULE}?fields=created_by,modified_user_id

Categories

Resources