Parsing JSON (String v Int indices)

Parsing JSON (String v Int indices) - python

I'll try to explain the problem as succinctly as possible. I'm trying to filter some values from a log file coming from Elastic. The log outputs this JSON exactly:
{'took': 2, 'timed_out': False, '_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0}, 'hits': {'total': {'value': 2, 'relation': 'eq'}, 'max_score': None, 'hits': [{'_index': 'winlogbeat-dc-2022.10.17-000014', '_type': '_doc', '_id': 'vOCnfoQBeS2JF7giMG9q', '_score': None, '_source': {'agent': {'hostname': 'SRVDC1'}, '#timestamp': '2022-11-16T04:19:13.622Z'}, 'sort': [-9223372036854775808]}, {'_index': 'winlogbeat-dc-2022.10.17-000014', '_type': '_doc', '_id': 'veCnfoQBeS2JF7giMG9q', '_score': None, '_source': {'agent': {'hostname': 'SRVDC1'}, '#timestamp': '2022-11-16T04:19:13.630Z'}, 'sort': [-9223372036854775808]}]}}
Now, I want to filter out only the _index and #timestamp keys. If I assign this JSON to a variable, I can perfectly filter out the two keys by running:
index = (data['hits']['hits'][0]['_index'])
timestamp = (data['hits']['hits'][0]['_source']['#timestamp'])
Output:
winlogbeat-dc*
2022-11-16T04:19:13.622Z
However, if I try to do the same directly from the server call, I get:
Traceback (most recent call last):
File "c:\Users\user\Desktop\PYTHON\tiny2.py", line 96, in <module>
query()
File "c:\Users\user\Desktop\PYTHON\tiny2.py", line 77, in query
index = (final_data['hits']['hits'][0]['_index'])
TypeError: string indices must be integers
Now, I understand the it's asking for integer values instead of the strings I'm using, but if I use integers, then I get individual characters rather than a key/value pair.
What am I missing?
UPDATE:
Below is the entire code, but it won't help much. It contains Elastic's DSL query language, and a call to the server, which obviously you won't be able to connect to.
I tried your suggestions, but I either get the same error, or a new one:
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not ObjectApiResponse
Entire code as follows:
import os
import ast
import csv
import json
from elasticsearch import Elasticsearch
from datetime import datetime,timedelta
import datetime
ELASTIC_USERNAME = 'elastic'
ELASTIC_PASSWORD = "abc123"
PORT= str('9200')
HOST = str('10.20.20.131')
CERT = os.path.join(os.path.dirname(__file__),"cert.crt")
initial_time = datetime.datetime.now()
past_time = datetime.datetime.now() - (timedelta(minutes=15))
def query():
try: #connection to Elastic server
es = Elasticsearch(
"https://10.20.20.131:9200",
ca_certs = CERT,
verify_certs=False,
basic_auth = (ELASTIC_USERNAME, ELASTIC_PASSWORD)
)
except ConnectionRefusedError as error:
print("[-] Connection error")
else: #DSL Elastic query of Domain Controler logs
query_res = es.search(
index="winlogbeat-dc*",
body={
"size": 3,
"sort": [
{
"timestamp": {
"order": "desc",
"unmapped_type": "boolean"
}
}
],
"_source": [
"agent.hostname",
"#timestamp"
],
"query": {
"bool": {
"must": [],
"filter": [
{
"range": {
"#timestamp": {
"format": "strict_date_optional_time",
"gte": f'{initial_time}',
"lte": f'{past_time}'
}
}
}
],
"should": [],
"must_not": []
}
}
}
)
if query_res:
parse_to_json =json.loads(query_res)
final_data = json.dumps(str(parse_to_json))
index = ast.literal_eval(final_data)['hits']['hits'][0]['_index']
timestamp = ast.literal_eval(final_data)['hits']['hits'][0]['_source']['#timestamp']
columns = ['Index','Last Updated']
rows = [[f'{index}',f'{timestamp}']]
with open("final_data.csv", 'w') as csv_file:
write_to_csv = csv.writer(csv_file)
write_to_csv.writerow(columns)
write_to_csv.writerows(rows)
print("CSV file created!")
else:
print("Log not found")
query()

If you're really getting ' in your response, use this:
import ast
...
index = ast.literal_eval(final_data)['hits']['hits'][0]['_index']
Otherwise use this:
import json
...
index = json.loads(final_data)['hits']['hits'][0]['_index']

Elasticsearch returns an ObjectApiResponse so you have to parse the _source field:
import json
final_data = json.loads(query_res["_source"])
index = final_data['hits']['hits'][0]['_index']
I'm not sure why you surround with parenthesis the indexing selection.

I struggle to make sense of this:
query_res = es.search(...)
if query_res:
parse_to_json =json.loads(query_res)
final_data = json.dumps(str(parse_to_json))
index = ast.literal_eval(final_data)['hits']['hits'][0]['_index']
timestamp = ast.literal_eval(final_data)['hits']['hits'][0]['_source']['#timestamp']
query_res is an instance of ObjectApiResponse, and you can get data from it like a dictionary right away. Instead you perform a sequence of converting object to string and back again, and then "stringify" it once more, with unpredictable results.
Just do it like they do in ES docs:
first_hit = query_res['hits']['hits'][0]
index = first_hit['_index']
timestamp = first_hit['_source']['#timestamp']

I was able to fix the problem by running a broad query first with
query_res['hits']['hits']
And then run a for loop with the specific time range I needed.
Here is the code:
for query_res in query_res['hits']['hits']:
winlogbeat_dc_timestamp = query_res['_source']['#timestamp']
Then another issue arose. I needed to convert datetime format into a string:
#Convert datetime to string
pattern = '%Y-%m-%dT%H:%M:%S.%fZ'
dt = datetime.strptime(winlogbeat_dc_timestamp, pattern)
new_timestamp = str(dt + timedelta(hours=1))[:-3] + 'Z'
And finally format it to a more readable pattern:
#Format timestamp to a more readable pattern
formatted_time = (''.join(letter for letter in new_timestamp if not letter.isalpha()))[:19]
formatted_time2 = formatted_time[:10] + ' / ' + formatted_time[10:]

Related

How do I use rsplit when adding a Requests response to python dictionary?

I am currently writing a script to scrape data from an API into a Python dictionary and then export the result into a JSON file. I am trying to get the file extension from a response by splitting using .rsplit('.', 1)[-1] The only problem is that some keys have 'None" as their value and this throws the AttributeError: 'NoneType' object has no attribute 'rsplit'. Here is my code snippet:
d = requests.get(dataset_url)
output = d.json()
output_dict = {
'data_files': {
# Format to get only extension
'format': output.get('connectionParameters', {}).get('url').rsplit('.', 1)[-1],
'url': output.get('connectionParameters', {}).get('url'),
},
}
An example of JSON response with the required key is as follows:
"connectionParameters": {
"csv_escape_char": "\\",
"protocol": "DwC",
"automation": false,
"strip": false,
"csv_eol": "\\n",
"csv_text_enclosure": "\"",
"csv_delimiter": "\\t",
"incremental": false,
"url": "https://registry.nbnatlas.org/upload/1564481725489/London_churchyards_dwc.txt",
"termsForUniqueKey": [
"occurrenceID"
]
},
Any way to tackle this?

Try:
url = output.get("connectionParameters", {}).get("url") or "-"
f = url.rsplit(".", 1)[-1]
output_dict = {
"data_files": {
"format": f,
"url": url,
},
}
This will print:
{'data_files': {'format': '-', 'url': '-'}}
If the "url" parameter is None.

Referencing Values in a List (syntax issue?) [duplicate]

I wrote some code to get data from a web API. I was able to parse the JSON data from the API, but the result I gets looks quite complex. Here is one example:
>>> my_json
{'name': 'ns1:timeSeriesResponseType', 'declaredType': 'org.cuahsi.waterml.TimeSeriesResponseType', 'scope': 'javax.xml.bind.JAXBElement$GlobalScope', 'value': {'queryInfo': {'creationTime': 1349724919000, 'queryURL': 'http://waterservices.usgs.gov/nwis/iv/', 'criteria': {'locationParam': '[ALL:103232434]', 'variableParam': '[00060, 00065]'}, 'note': [{'value': '[ALL:103232434]', 'title': 'filter:sites'}, {'value': '[mode=LATEST, modifiedSince=null]', 'title': 'filter:timeRange'}, {'value': 'sdas01', 'title': 'server'}]}}, 'nil': False, 'globalScope': True, 'typeSubstituted': False}
Looking through this data, I can see the specific data I want: the 1349724919000 value that is labelled as 'creationTime'.
How can I write code that directly gets this value?
I don't need any searching logic to find this value. I can see what I need when I look at the response; I just need to know how to translate that into specific code to extract the specific value, in a hard-coded way. I read some tutorials, so I understand that I need to use [] to access elements of the nested lists and dictionaries; but I can't figure out exactly how it works for a complex case.
More generally, how can I figure out what the "path" is to the data, and write the code for it?

For reference, let's see what the original JSON would look like, with pretty formatting:
>>> print(json.dumps(my_json, indent=4))
{
"name": "ns1:timeSeriesResponseType",
"declaredType": "org.cuahsi.waterml.TimeSeriesResponseType",
"scope": "javax.xml.bind.JAXBElement$GlobalScope",
"value": {
"queryInfo": {
"creationTime": 1349724919000,
"queryURL": "http://waterservices.usgs.gov/nwis/iv/",
"criteria": {
"locationParam": "[ALL:103232434]",
"variableParam": "[00060, 00065]"
},
"note": [
{
"value": "[ALL:103232434]",
"title": "filter:sites"
},
{
"value": "[mode=LATEST, modifiedSince=null]",
"title": "filter:timeRange"
},
{
"value": "sdas01",
"title": "server"
}
]
}
},
"nil": false,
"globalScope": true,
"typeSubstituted": false
}
That lets us see the structure of the data more clearly.
In the specific case, first we want to look at the corresponding value under the 'value' key in our parsed data. That is another dict; we can access the value of its 'queryInfo' key in the same way, and similarly the 'creationTime' from there.
To get the desired value, we simply put those accesses one after another:
my_json['value']['queryInfo']['creationTime'] # 1349724919000

I just need to know how to translate that into specific code to extract the specific value, in a hard-coded way.
If you access the API again, the new data might not match the code's expectation. You may find it useful to add some error handling. For example, use .get() to access dictionaries in the data, rather than indexing:
name = my_json.get('name') # will return None if 'name' doesn't exist
Another way is to test for a key explicitly:
if 'name' in resp_dict:
name = resp_dict['name']
else:
pass
However, these approaches may fail if further accesses are required. A placeholder result of None isn't a dictionary or a list, so attempts to access it that way will fail again (with TypeError). Since "Simple is better than complex" and "it's easier to ask for forgiveness than permission", the straightforward solution is to use exception handling:
try:
creation_time = my_json['value']['queryInfo']['creationTime']
except (TypeError, KeyError):
print("could not read the creation time!")
# or substitute a placeholder, or raise a new exception, etc.

Here is an example of loading a single value from simple JSON data, and converting back and forth to JSON:
import json
# load the data into an element
data={"test1": "1", "test2": "2", "test3": "3"}
# dumps the json object into an element
json_str = json.dumps(data)
# load the json to a string
resp = json.loads(json_str)
# print the resp
print(resp)
# extract an element in the response
print(resp['test1'])

Try this.
Here, I fetch only statecode from the COVID API (a JSON array).
import requests
r = requests.get('https://api.covid19india.org/data.json')
x = r.json()['statewise']
for i in x:
print(i['statecode'])

Try this:
from functools import reduce
import re
def deep_get_imps(data, key: str):
split_keys = re.split("[\\[\\]]", key)
out_data = data
for split_key in split_keys:
if split_key == "":
return out_data
elif isinstance(out_data, dict):
out_data = out_data.get(split_key)
elif isinstance(out_data, list):
try:
sub = int(split_key)
except ValueError:
return None
else:
length = len(out_data)
out_data = out_data[sub] if -length <= sub < length else None
else:
return None
return out_data
def deep_get(dictionary, keys):
return reduce(deep_get_imps, keys.split("."), dictionary)
Then you can use it like below:
res = {
"status": 200,
"info": {
"name": "Test",
"date": "2021-06-12"
},
"result": [{
"name": "test1",
"value": 2.5
}, {
"name": "test2",
"value": 1.9
},{
"name": "test1",
"value": 3.1
}]
}
>>> deep_get(res, "info")
{'name': 'Test', 'date': '2021-06-12'}
>>> deep_get(res, "info.date")
'2021-06-12'
>>> deep_get(res, "result")
[{'name': 'test1', 'value': 2.5}, {'name': 'test2', 'value': 1.9}, {'name': 'test1', 'value': 3.1}]
>>> deep_get(res, "result[2]")
{'name': 'test1', 'value': 3.1}
>>> deep_get(res, "result[-1]")
{'name': 'test1', 'value': 3.1}
>>> deep_get(res, "result[2].name")
'test1'

How to get specific json value - python

I am migrating my code from java to python, but I am still having some difficulties understanding how to fetch a specific path in json using python.
This is my Java code, which returns a list of accountsId.
public static List < String > v02_JSON_counterparties(String date) {
baseURI = "https://cdwp/cdw";
String counterparties =
given()
.auth().basic(getJiraUser(), getJiraPass())
.param("limit", "1000000")
.param("count", "false")
.when()
.get("/counterparties/" + date).body().asString();
List < String > accountId = extract_accountId(counterparties);
return accountId;
}
public static List < String > extract_accountId(String res) {
List < String > ids = JsonPath.read(res, "$..identifier[?(#.accountIdType == 'ACCOUNTID')].accountId");
return ids;
}
And this is the json structure where I am getting the accountID.
{
'organisationId': {
'#value': 'MHI'
},
'accountName': 'LAZARD AM DEUT AC LF1632',
'identifiers': {
'accountId': 'LAZDLF1632',
'customerId': 'LAZAMDEUSG',
'blockAccountCode': 'LAZDEUBDBL',
'bic': 'LAMDDEF1XXX',
'identifier': [{
'accountId': 'MHI',
'accountIdType': 'REVNCNTR'
}, {
'accountId': 'LAZDLF1632',
'accountIdType': 'ACCOUNTID'
}, {
'accountId': 'LAZAMDEUSG',
'accountIdType': 'MHICUSTID'
}, {
'accountId': 'LAZDEUBDBL',
'accountIdType': 'BLOCKACCOUNT'
}, {
'accountId': 'LAMDDEF1XXX',
'accountIdType': 'ACCOUNTBIC'
}, {
'accountId': 'LAZDLF1632',
'accountIdType': 'GLOBEOP'
}]
},
'isBlocAccount': 'N',
'accountStatus': 'COMPLETE',
'products': {
'productType': [{
'productLineName': 'CASH',
'productTypeId': 'PRODMHI1',
'productTypeName': 'Bond, Equity,Convertible Bond',
'cleared': 'N',
'bilateral': 'N',
'limitInstructions': {
'limitInstruction': [{
'limitAmount': '0',
'limitCurrency': 'GBP',
'limitType': 'PEAEXPLI',
'limitTypeName': 'Cash-Peak Exposure Limit'
}]
}
}]
},
'etc': {
'addressGeneral': 'LZFLUS33XXX',
'addressAccount': 'LF1632',
'tradingLevel': 'B'
},
'clientBroker': 'C',
'costCentre': 'Credit Sales',
'clientLevel': 'SUBAC',
'accountCreationDate': '2016-10-19T00:00:00.000Z',
'accountOpeningDate': '2016-10-19T00:00:00.000Z'
}
This is my code in Python
import json, requests, urllib.parse, re
from pandas.io.parsers import read_csv
import pandas as pd
from termcolor import colored
import numpy as np
from glob import glob
import os
# Set Up
dateinplay = "2021-09-27"
#Get accountId
cdwCounterparties = (
f"http://cdwu/cdw/counterparties/?limit=1000000?yyyy-mm-dd={dateinplay}"
)
r = json.loads(requests.get(cdwCounterparties).text)
account_ids = [i['accountId'] for i in data['identifiers']['identifier']if i['accountIdType']=="ACCOUNTID"]
I am getting this error when I try to fetch the accountId:
Traceback (most recent call last):
File "h:\DESKTOP\test_check\checkCounterpartie.py", line 54, in <module>
account_ids = [i['accountId'] for i in data['identifiers']['identifier']if i['accountIdType']=="ACCOUNTID"]
TypeError: list indices must be integers or slices, not str

If I'm inerpeting your question correctly you want all ids where accountistype is "ACCOUNTID".
this give you that:
account_ids = [i['accountId'] for i in data['identifiers']['identifier']if i['accountIdType']=="ACCOUNTID"]

accs = {
"identifiers": {
...
account_id_list = []
for acc in accs.get("identifiers", {}).get("identifier", []):
account_id_list.append(acc.get("accountId", ""))
creates a list called account_id_list which is
['MHI', 'DKEPBNPGIV', 'DKEPLLP SG', 'DAVKEMEQBL', '401821', 'DKEPGB21XXX', 'DKEPBNPGIV', 'DKPARTNR']

assuming you store the dictionary (json structure) in variable x, getting all accountIDs is something like:
account_ids = [i['accountId'] for i in x['identifiers']['identifier']]

I'd like to thank you all for your answers. It helped me a lot to find a resolution to my problem.
Below is how I solved it.
listAccountId = []
cdwCounterparties = (
f"http://cdwu/cdw/counterparties/?limit=100000?yyyy-mm-dd={dateinplay}"
)
r = requests.get(cdwCounterparties).json()
jsonpath_expression = parse("$..accounts.account[*].identifiers.identifier[*]")
for match in jsonpath_expression.find(r):
# print(f'match id: {match.value}')
thisdict = match.value
if thisdict["accountIdType"] == "ACCOUNTID":
# print(thisdict["accountId"])
listAccountId.append(thisdict["accountId"])

"Invalid JSON payload received. Unknown name "range" at 'data[0]'" when using Google Sheet APIv4 batchUpdate() in Python

I followed the reference code in the guide -- https://developers.google.com/sheets/api/guides/values. However, I am getting the error "Invalid JSON payload received. Unknown name "range" at 'data[0]': Proto field is not repeating, cannot start list" when I call batchUpdate().
Any suggestion on what may be wrong and how to fix it?
# Preparing data to be written back to sheet
data = [
{
'range': range_name,
'values': values
},
]
body = {
'valueInputOption': "USER_ENTERED",
'data': data,
}
request = service.spreadsheets().values().batchUpdate(spreadsheetId=SPREADSHEET_ID, body=body)
response = request.execute()
content of "body" =
{'valueInputOption': 'USER_ENTERED',
'data': [{'range': ['!B251:I251', '!B252:I252', '!B254:I254', '!B256:I256'],
'values': "[['2020-06-04', 2, '123456789098765421abcdefg', '', 'test1', 1, None, 1], ['2020-06-04', 1, '123456789098765421abcdefg', '', 'test2', -1, None, 1], ['2020-06-04', 2, '123456789098765421abcdefg', '', 'test1', 3, None, 1], ['2020-06-04', 1, '123456789098765421abcdefg', '', 'test9', 4, None, 1]]"}]}
myCode
From the guide,
values = [
[
# Cell values ...
],
# Additional rows
]
data = [
{
'range': range_name,
'values': values
},
# Additional ranges to update ...
]
body = {
'valueInputOption': value_input_option,
'data': data
}
result = service.spreadsheets().values().batchUpdate(
spreadsheetId=spreadsheet_id, body=body).execute()
print('{0} cells updated.'.format(result.get('totalUpdatedCells')))
Yet, the error msg seems to indicate the param 'range' is not known?

Several things:
You are trying to pass to range and array of ranges while the method expects a single range per item, the corect request would be:
"data": [
{
"range1": "",
"values1": []
},
{
"range2": "",
"values2": []
}
]
You use None for an empty value - instead you need to leave an empty space between commas: 1,2, ,4 (or wrap None in " quotes if you want to pass it as a string
A JSON object expects double quotes " instead of single quotes ' for vlaues and range notations
I recommend you to test your request syntax with the Try this API feature before implementing into Python.

Add Timestamp to ElasticSearch with Elasticsearch-py using Bulk-API

I'm trying to add a timestamp to my data, have elasticsearch-py bulk index it, and then display the data with kibana.
My data is showing up in kibana, but my timestamp is not being used. When I go to the "Discovery" tab after configuring my index pattern, I get 0 results (yes, I tried adjusting the search time).
Here is what my bulk index json looks like:
{'index':
{'_timestamp': u'2015-08-11 14:18:26',
'_type': 'webapp_fingerprint',
'_id': u'webapp_id_redacted_2015_08_13_12_39_34',
'_index': 'webapp_index'
}
}
****JSON DATA HERE***
This will be accepted by elasticsearch and will get imported into Kibana, but the _timestamp field will not actually be indexed (it does show up in the dropdown when configuring an index pattern under "Time-field name").
I also tried formatting the metaFields like this:
{'index': {
'_type': 'webapp_fingerprint',
'_id': u'webapp_id_redacted_2015_08_13_12_50_04',
'_index': 'webapp_index'
},
'source': {
'_timestamp': {
'path': u'2015-08-11 14:18:26',
'enabled': True,
'format': 'YYYY-MM-DD HH:mm:ss'
}
}
}
This also doesn't work.
Finally, I tried including the _timestamp field within the index and applying the format, but I got an error with elasticsearch.
{'index': {
'_timestamp': {
'path': u'2015-08-11 14:18:26',
'enabled': True,
'format': 'YYYY-MM-DD HH:mm:ss'
},
'_type': 'webapp_fingerprint',
'_id': u'webapp_id_redacted_2015_08_13_12_55_53',
'_index': 'webapp_index'
}
}
The error is:
elasticsearch.exceptions.TransportError:
TransportError(500,u'IllegalArgumentException[Malformed action/metadata
line [1], expected a simple value for field [_timestamp] but found [START_OBJECT]]')
Any help someone can provide would be greatly appreciated. I apologize if I haven't explained the issue well enough. Let me know if I need to clarify more. Thanks.

Fixed my own problem. Basically, I needed to add mappings for the timestamp when I created the index.
request_body = {
"settings" : {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings" : {
"_default_":{
"_timestamp":{
"enabled":"true",
"store":"true",
"path":"plugins.time_stamp.string",
"format":"yyyy-MM-dd HH:m:ss"
}
}
}
}
print("creating '%s' index..." % (index_name))
res = es.indices.create(index = index_name, body = request_body)
print(" response: '%s'" % (res))

In the latest versions of Elasticsearch, just using the PUT/POST API and ISOFORMAT strings should work.
import datetime
import requests
query = json.dumps(
{
"createdAt": datetime.datetime.now().replace(microsecond=0).isoformat(),
}
)
response = requests.post("https://search-XYZ.com/your-index/log", data=query,
headers={'Content-Type': 'application/json'})
print(response)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing JSON (String v Int indices) - python

If you're really getting ' in your response, use this: import ast ... index = ast.literal_eval(final_data)['hits']['hits'][0]['_index'] Otherwise use this: import json ... index = json.loads(final_data)['hits']['hits'][0]['_index']

Elasticsearch returns an ObjectApiResponse so you have to parse the _source field: import json final_data = json.loads(query_res["_source"]) index = final_data['hits']['hits'][0]['_index'] I'm not sure why you surround with parenthesis the indexing selection.

Related

How do I use rsplit when adding a Requests response to python dictionary?

Referencing Values in a List (syntax issue?) [duplicate]

How to get specific json value - python

"Invalid JSON payload received. Unknown name "range" at 'data[0]'" when using Google Sheet APIv4 batchUpdate() in Python

Add Timestamp to ElasticSearch with Elasticsearch-py using Bulk-API

Categories

Resources