python parsing strange JSON data

python parsing strange JSON data - python

How should I parse (with Python3) data in this "unusual" format?
As you can see inside the "variables" dictionary the data that is in capitals has no label, it is provided as a literal. Therefore when I loop over the entries inside "variables" all I get is the strings in capitals, nothing else. I need, obviously, to get the capitals plus the value inside it.
{
"variables": {
"ABSENCE_OSL_PROD": {
"value": "REZWWnBTejN5Ng=="
},
"ACTION_OSL_INT": {
"value": "S0RXSVNTbmFhNw=="
},
"ACTION_OSL_PROD": {
"value": "RUJCaDJGnmFnUg=="
},
"API_STORE_OSL_INT": {
"value": "U3lxaVhogWtIcg=="
}
},
"id": 4,
"type": "Vsts"
}

To load the variables inside variables in the local variable space:
data = {
"variables": {
"ABSENCE_OSL_PROD": {
"value": "REZWWnBTejN5Ng=="
},
"ACTION_OSL_INT": {
"value": "S0RXSVNTbmFhNw=="
},
"ACTION_OSL_PROD": {
"value": "RUJCaDJGnmFnUg=="
},
"API_STORE_OSL_INT": {
"value": "U3lxaVhogWtIcg=="
}
},
"id": 4,
"type": "Vsts"
}
for variable_name, variable_content in data['variables'].items():
locals()[variable_name] = variable_content['value']
print(ABSENCE_OSL_PROD)
# prints "REZWWnBTejN5Ng=="

With dict comprehension you can get a time efficient manner:
ugly = {
"variables": {
"ABSENCE_OSL_PROD": {
"value": "REZWWnBTejN5Ng=="
},
"ACTION_OSL_INT": {
"value": "S0RXSVNTbmFhNw=="
},
"ACTION_OSL_PROD": {
"value": "RUJCaDJGnmFnUg=="
},
"API_STORE_OSL_INT": {
"value": "U3lxaVhogWtIcg=="
}
},
"id": 4,
"type": "Vsts"
}
proper = {elt: ugly["variables"][elt]["value"] for elt in ugly["variables"]}
print(proper)
returns
{'ABSENCE_OSL_PROD': 'REZWWnBTejN5Ng==', 'ACTION_OSL_INT': 'S0RXSVNTbmFhNw==', 'ACTION_OSL_PROD': 'RUJCaDJGnmFnUg==', 'API_STORE_OSL_INT': 'U3lxaVhogWtIcg=='}```

Related

JSON ID extraction from Array in Python

I wrote a script to pull data from Verizon's connectivity management API with the following. Below is the section that requests the line information based on the search item, in this case, the SIM or iccid. I did not include the previous parts because they are just to connect to the API and get credentials.
header = {
'accept': 'application/json',
'VZ-M2M-Token': session_token,
'Authorization': 'Bearer' + bearer_token,
'Content-Type': 'application/json',
}
data = '{ "deviceId": { "id": ' + SIM +', "kind": "ICCID" }}'
response = requests.post('https://thingspace.verizon.com/api/m2m/v1/devices/actions/list', headers=header, data=data)
And the response I get is a JSON Array which looks like
{
"hasMoreData": false,
"devices": [
{
"accountName": "123456789-00001",
"billingCycleEndDate": "2020-10-31T20:00:00-04:00",
"carrierInformations": [
{
"carrierName": "Verizon Wireless",
"servicePlan": "3rrrrx48wwwwrjgjtyjtyjtyjtyj",
"state": "active"
}
],
"connected": true,
"createdAt": "2016-11-04T11:06:28-04:00",
"deviceIds": [
{
"id": "5256694405",
"kind": "mdn"
},
{
"id": "3114949302094150",
"kind": "imsi"
},
{
"id": "35922505468230",
"kind": "imei"
},
{
"id": "891480000054957290575",
"kind": "iccId"
},
{
"id": "15256694405",
"kind": "msisdn"
},
{
"id": "5256694405",
"kind": "min"
}
],
"extendedAttributes": [
{
"key": "PrimaryPlaceOfUseTitle"
},
{
"key": "PrimaryPlaceOfUseFirstName",
"value": "5256694405",
},
{
"key": "PrimaryPlaceOfUseMiddleName"
},
{
"key": "PrimaryPlaceOfUseLastName",
"value": "ESN"
},
{
"key": "PrimaryPlaceOfUseSuffix"
},
{
"key": "PrimaryPlaceOfUseAddressLine1"
},
{
"key": "PrimaryPlaceOfUseAddressLine2"
},
{
"key": "PrimaryPlaceOfUseCity"
},
{
"key": "PrimaryPlaceOfUseState"
},
{
"key": "PrimaryPlaceOfUseCountry"
},
{
"key": "PrimaryPlaceOfUseZipCode"
},
{
"key": "PrimaryPlaceOfUseZipCode4"
},
{
"key": "PrimaryPlaceOfUseCBRPhone"
},
{
"key": "PrimaryPlaceOfUseCBRPhoneType"
},
{
"key": "PrimaryPlaceOfUseEmailAddress"
},
{
"key": "AccountNumber",
"value": "5256694405-00001"
},
{
"key": "SmsrOid"
},
{
"key": "ProfileStatus"
},
{
"key": "PromoCodes",
"value": ""
},
{
"key": "PromotionStartDate",
"value": ""
},
{
"key": "PromotionScheduledEndDate",
"value": ""
},
{
"key": "LeadId",
"value": ""
},
{
"key": "CustomerName",
"value": ""
},
{
"key": "CustomerAddressLine1",
"value": ""
},
{
"key": "CustomerAddressLine2",
"value": ""
},
{
"key": "CustomerAddressCity",
"value": ""
},
{
"key": "CustomerAddressState",
"value": ""
},
{
"key": "CustomerAddressZipCode",
"value": ""
},
{
"key": "ServiceZipCode",
"value": ""
},
{
"key": "SkuNumber",
"value": "VZW080000460053"
},
{
"key": "CostCenterCode"
},
{
"key": "PreIMEI",
"value": "3592254564568445"
},
{
"key": "PreSKU",
"value": "VZW080000100037"
},
{
"key": "SIMOTADate",
"value": "4/30/2020 1:22:18 PM"
},
{
"key": "RoamingStatus",
"value": "NotRoaming"
},
{
"key": "LastRoamingStatusUpdate",
"value": "9/24/2020 5:40:26 PM"
}
],
"groupNames": [
"Default: 0220433754-00001"
],
"ipAddress": "100.100.100.100",
"lastActivationBy": "User Verizon",
"lastActivationDate": "2016-11-04T11:06:28-04:00",
"lastConnectionDate": "2020-09-24T13:40:26-04:00"
}
]
}
I added a part to my script to pull the mdn, iccid and the imei from the array with the code that is below.
def puller(line_json):
line_data = json.loads(line_json)
mdn = (line_data['devices'][0]['deviceIds'][0]['id'])
iccid = (line_data['devices'][0]['deviceIds'][3]['id'])
imei = (line_data['devices'][0]['deviceIds'][2]['id'])
print('phone = ' ,mdn)
print('SIM = ' , iccid)
print('IMEI = ' , imei)
I tested this code and it works the way it should with one test ID. I then proceeded to test with another test ID and I learned that the array structure is not always the same. That second JSON array is below. I am wondering is there a better way to find the specific values that I want, but in the way that I am not specifically telling the script where in the structure the item will be as I did above.
{
"hasMoreData": false,
"devices": [
{
"accountName": "02234234234-00001",
"billingCycleEndDate": "2020-10-31T20:00:00-04:00",
"carrierInformations": [
{
"carrierName": "Verizon Wireless",
"servicePlan": "37776xdsfewsfwe576193",
"state": "active"
}
],
"connected": true,
"createdAt": "2016-05-24T15:55:06-04:00",
"deviceIds": [
{
"id": "0945437676404",
"kind": "esn"
},
{
"id": "1234565799",
"kind": "mdn"
},
{
"id": "31148454545458767",
"kind": "imsi"
},
{
"id": "01426786678211",
"kind": "imei"
},
{
"id": "89148000006456456454",
"kind": "iccId"
},
{
"id": "1234565799",
"kind": "min"
}
],
"extendedAttributes": [
{
"key": "PrimaryPlaceOfUseTitle"
},
{
"key": "PrimaryPlaceOfUseFirstName",
"value": "096114564506772"
},
{
"key": "PrimaryPlaceOfUseMiddleName"
},
{
"key": "PrimaryPlaceOfUseLastName",
"value": "096546454806772"
},
{
"key": "PrimaryPlaceOfUseSuffix"
},
{
"key": "PrimaryPlaceOfUseAddressLine1"
},
{
"key": "PrimaryPlaceOfUseAddressLine2"
},
{
"key": "PrimaryPlaceOfUseCity"
},
{
"key": "PrimaryPlaceOfUseState"
},
{
"key": "PrimaryPlaceOfUseCountry"
},
{
"key": "PrimaryPlaceOfUseZipCode"
},
{
"key": "PrimaryPlaceOfUseZipCode4"
},
{
"key": "PrimaryPlaceOfUseCBRPhone"
},
{
"key": "PrimaryPlaceOfUseCBRPhoneType"
},
{
"key": "PrimaryPlaceOfUseEmailAddress"
},
{
"key": "AccountNumber",
"value": "02242342354-00001"
},
{
"key": "SmsrOid"
},
{
"key": "ProfileStatus"
},
{
"key": "PromoCodes",
"value": ""
},
{
"key": "PromotionStartDate",
"value": ""
},
{
"key": "PromotionScheduledEndDate",
"value": ""
},
{
"key": "LeadId",
"value": ""
},
{
"key": "CustomerName",
"value": ""
},
{
"key": "CustomerAddressLine1",
"value": ""
},
{
"key": "CustomerAddressLine2",
"value": ""
},
{
"key": "CustomerAddressCity",
"value": ""
},
{
"key": "CustomerAddressState",
"value": ""
},
{
"key": "CustomerAddressZipCode",
"value": ""
},
{
"key": "ServiceZipCode",
"value": ""
},
{
"key": "SkuNumber",
"value": "VZW12000364343005"
},
{
"key": "CostCenterCode"
},
{
"key": "PreIMEI"
},
{
"key": "PreSKU",
"value": "VZW12000334340005"
},
{
"key": "SIMOTADate",
"value": "3/13/2020 10:52:07 AM"
},
{
"key": "RoamingStatus",
"value": "NotRoaming"
},
{
"key": "LastRoamingStatusUpdate",
"value": "10/20/2020 6:14:20 PM"
}
],
"groupNames": [
"Default: 02342343754-00001"
],
"ipAddress": "101.101.101.101",
"lastActivationBy": "User Verizon",
"lastActivationDate": "2016-05-24T15:55:16-04:00",
"lastConnectionDate": "2020-10-20T14:14:20-04:00"
}
]
}
I tried to use this block of code from some research I did to find the value that I was looking for; in this case, the mdn. Problem I have is that the response returns a blank set of brackets with no information, so I know there is something I probably did wrong.
def json_extract(obj, kind):
"""Recursively fetch values from nested JSON."""
arr = []
def extract(obj, arr, kind):
"""Recursively search for values of key in JSON tree."""
if isinstance(obj, dict):
for k, v in obj.items():
if isinstance(v, (dict, list)):
extract(v, arr, kind)
elif k == kind:
arr.append(v)
elif isinstance(obj, list):
for item in obj:
extract(item, arr, kind)
return arr
values = extract(obj, arr, kind)
return values
names = json_extract(response , 'mdn')
print(names)

I understood that you are trying to find, mdn, iccid and imei'id from the json object above,so, instead of recursion and the complicated coding that you have done there, it is easier to use python's inbuilt libraries to help you out:
You can use the next function for your purpose:
# load your json data
line_data = json.loads(data)
# narrow your focus on the array in question
device_ids = line_data['devices'][0]['deviceIds']
# This gets the first item's id attribute from the list that matches the condition, and returns None if no item matches.
mdn = next((x['id'] for x in device_ids if x['kind'] == "mdn"), None)
iccid = next((x['id'] for x in device_ids if x['kind'] == "iccid"), None)
imei = next((x['id'] for x in device_ids if x['kind'] == "imei"), None)
You will need to handle the None if it was unable to find such element in the array.
Reference : Find object in list that has attribute equal to some value (that meets any condition)

How to erase/delete curly brackets/braces from a dictionary?

My JSON response looks like this. I want to remove curly brackets (with ** around it) so I can get the values in the card key. Can I do that or is it gonna mess up the entire Dictionary? If so, I want to assign key-value before curly brackets (with ** around it ).
Hope someone can help me with this and if you can give me a further explanation about a dictionary in python I would be thrilled!
[
**{**
"board": {
"id": "5f2106f0a188d073ebf3604b",
"name": "TrAPI_test",
"shortLink": "OIeEN1vG"
},
"card": {
"id": "5f236a13a64ee90e7ef95341",
"idShort": 3,
"name": "task3",
"shortLink": "WNHiHWxh"
},
"idMember": "5e1d96663a14c86d44d0edc4",
"member": {
"id": "5e1d96663a14c86d44d0edc4",
"name": "Zorigt"
}
**}**,
{
"board": {
"id": "5f2106f0a188d073ebf3604b",
"name": "TrAPI_test",
"shortLink": "OIeEN1vG"
},
"card": {
"id": "5f236a13a64ee90e7ef95341",
"idShort": 3,
"name": "task3",
"shortLink": "WNHiHWxh"
},
"list": {
"id": "5f22161e221bea80b90d96ad",
"name": "SprintTask"
}
}
]

I was able to get it like this because it was a list of multiple dictionaries all along :))
{
"0": {
"board": {
"id": "5f2106f0a188d073ebf3604b",
"name": "TrAPI_test",
"shortLink": "OIeEN1vG"
},
"card": {
"id": "5f236a13a64ee90e7ef95341",
"idShort": 3,
"name": "task3",
"shortLink": "WNHiHWxh"
},
"idMember": "5e1d96663a14c86d44d0edc4",
"member": {
"id": "5e1d96663a14c86d44d0edc4",
"name": "Zorigt"
}
},
"1": {
"board": {
"id": "5f2106f0a188d073ebf3604b",
"name": "TrAPI_test",
"shortLink": "OIeEN1vG"
},
"card": {
"id": "5f236a13a64ee90e7ef95341",
"idShort": 3,
"name": "task3",
"shortLink": "WNHiHWxh"
},
"list": {
"id": "5f22161e221bea80b90d96ad",
"name": "SprintTask"
}
}
}
Use this
Dict_convert= {}
for idx, val in enumerate(List):
Dict_convert[idx] = val

Reorder and return the whole of nested dictionary

I am trying to retain the whole contents of a nested dictionary but only with its contents reordered..
This is an example of my nested dictionaries (pardon the long example..) -
{
"pages": {
"rotatingTest": {
"elements": {
"apvfafwkbnjn2bjt": {
"name": "animRot_tilt40_v001",
"data": {
"description": "tilt testing",
"project": "TEST",
"created": "26/11/18 16:32",
},
"type": "AnimWidget",
"uid": "apvfafwkbnjn2bjt"
},
"p0pkje1hjcc9jukq": {
"name": "poseRot_positionD_v003",
"data": {
"description": "posing test for positionD",
"created": "10/01/18 14:16",
"project": "TEST",
},
"type": "PosedWidget",
"uid": "p0pkje1hjcc9jukq"
},
"k1gzzc5uy1ynqtnj": {
"name": "animRot_positionH_v001",
"data": {
"description": "rotational posing test for positionH",
"created": "13/06/18 14:19",
"project": "TEST",
},
"type": "AnimWidget",
"uid": "k1gzzc5uy1ynqtnj"
}
}
},
"panningTest": {
"elements": {
"7lyuri8g8u5ctwsa": {
"name": "posePan_positionZ_v001",
"data": {
"description": "panning test for posZ",
"created": "04/10/18 12:43",
"project": "TEST",
},
"type": "PosedWidget",
"uid": "7lyuri8g8u5ctwsa"
}
}
},
"zoomingTest": {
"elements": {
"prtn0i6ehudhz475": {
"name": "posZoom_positionH_v010",
"data": {
"description": "zoom test",
"created": "11/10/18 12:42",
"project": "TEST",
},
"type": "PosedWidget",
"uid": "prtn0i6ehudhz475"
}
}
}
},
"page_order": [
"rotatingTest",
"zoomingTest",
"panningTest"
]
}
and this is my code:
for k1, v1 in test_dict.get('pages', {}).items():
return (sorted(v1.get('elements').items(), key=lambda (k2,v2): v2['data']['created']))
In the code, keys such as the page_order, pages etc are missing...
Or if there is/ are any commands where it will enables me to retain the 'whole' of the dictionary?
Appreciate in advance for any advice.

If you're using Python 3.7, a dict will preserve insert order. Otherwise, you need to use an OrderedDict.Additionally, you need to convert the date string to a date to get the correct sort order:
from datetime import datetime
def sortedPage(d):
return {k: {'elements': dict(sorted(list(v['elements'].items()), key=lambda tuple: datetime.strptime(tuple[1]['data']['created'], '%d/%m/%y %H:%M')))} for k,v in d.items()}
output = {k: sortedPage(v) if k == 'pages' else v for k,v in input.items()}

How to specify Stopwords in Elasticsearch mapping using python

I have this python code where I first create a Elasticsearch mapping and then after data is inserted I do searching for that data:
# Create Data mapping
data_mapping = {
"mappings": {
(doc_type): {
"properties": {
"data_id": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"data":{
"type": "array",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"resp": {
"type": "string",
"fields": {
"stemmed": {
"type": "string",
"analyzer": "english"
}
}
},
"update": {
"type": "integer",
"fields": {
"stemmed": {
"type": "integer",
"analyzer": "english"
}
}
}
}
}
}
}
#Search
data_search = {
"query": {
"function_score": {
"query": {
"match": {
'data': question
}
},
"field_value_factor": {
"field": "update",
"modifier": "log2p"
}
}
}
}
response = es.search(index=doc_type, body=data_search)
Now what I am unable to figure out where and how to specify stopwords in the above code? This link gives an example of using stopwords but I am unable to relate it to my code. Do I need to specify in the data mapping section, search section or both? And how do I specify it?
Any example help would be appreciated!
UPDATE: Based on some comments suggestion is to add either analysis section or settings sections but I am not sure how should I add those to the mapping section I have written above.

Improving elasticsearch performance

I'm using elasticsearch in a python web app in order to query news documents. There're actually 100000 documents in the database.
The original db is a mongo one and elasticsearch is plugged through the mongoriver plugin.
The problem is that the function takes ~850ms to return the results. I'd like to decrease that number as much as possible.
Here's the python code I'm using to query the db(the limit is usually 16):
def search_news(term, limit, page, flagged_articles):
query = {
"query": {
"from": page*limit,
"size": limit,
"multi_match" : {
"query" : term,
"fields" : [ "title^3" , "category^5" , "entities.name^5", "art_text^1", "summary^1"]
}
},
"filter" : {
"not" : {
"filter" : {
"ids" : {
"values" : flagged_articles
}
},
"_cache" : True
}
}
}
es_query = json_util.dumps(query)
uri = 'http://localhost:9200/newsidx/_search'
r = requests.get(uri, data=es_query)
results = json.loads( r.text )
data = []
for res in results['hits']['hits']:
data.append(res['_source'])
return data
And here's the index mapping:
{
"news": {
"properties": {
"actual_rank": {
"type": "long"
},
"added": {
"type": "date",
"format": "dateOptionalTime"
},
"api_id": {
"type": "long"
},
"art_text": {
"type": "string"
},
"category": {
"type": "string"
},
"downvotes": {
"type": "long"
},
"entities": {
"properties": {
"etype": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"flags": {
"properties": {
"a": {
"type": "long"
},
"b": {
"type": "long"
},
"bad_image": {
"type": "long"
},
"c": {
"type": "long"
},
"d": {
"type": "long"
},
"innapropiate": {
"type": "long"
},
"irrelevant_info": {
"type": "long"
},
"miscategorized": {
"type": "long"
}
}
},
"media": {
"type": "string"
},
"published": {
"type": "string"
},
"published_date": {
"type": "date",
"format": "dateOptionalTime"
},
"show": {
"type": "boolean"
},
"source": {
"type": "string"
},
"source_rank": {
"type": "double"
},
"summary": {
"type": "string"
},
"times_showed": {
"type": "long"
},
"title": {
"type": "string"
},
"top_entities": {
"properties": {
"einfo_test": {
"type": "string"
},
"etype": {
"type": "string"
},
"name": {
"type": "string"
}
}
},
"tweet_article_poster": {
"type": "string"
},
"tweet_favourites": {
"type": "long"
},
"tweet_retweets": {
"type": "long"
},
"tweet_user_rank": {
"type": "double"
},
"upvotes": {
"type": "long"
},
"url": {
"type": "string"
}
}
}
}
Edit: The response time was measured on the server, given the tornado server information output.

I've rewritten your query somewhat here, moving the size and limit to the outside scope, adding the filtered query clause and changing your not query to a bool/must_not query, which should be cached by default:
{
"query": {
"filtered": {
"query": {
"multi_match" : {
"query" : term,
"fields" : [ "title^3" , "category^5" , "entities.name^5", "art_text^1", "summary^1"]
}
},
"filter" : {
"bool" : {
"must_not" : {
"ids" : {"values" : flagged_articles}
}
}
}
}
}
"from": page * limit,
"size": limit,
}
I haven't tested this, and I haven't made sense of your mapping as it is jumbled, so there might be some improvements to be made there.
Edit: This is a great read on why to use the bool filter: http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/ - in short, bool uses 'bitsets', which are very fast on subsequent queries.

First of all you can add the boosts to your mapping (assuming it doesn't interfere with your other queries) like this:
"title": {
"boost": 3.0,
"type": "string"
},
"category": {
"boost": 5.0,
"type": "string"
},
etc.
Then setup a bool query with field (or term) queries like this:
"query": {
"bool" : {
"should" : [ {
"field" : {
"title" : term
}
}, {
"field" : {
"category" : term
}
} ],
"must_not" : {
"ids" : {"values" : flagged_articles}
}
}
}
"from": page * limit,
"size": limit
This should perform better, but without access to your setup I can't test it :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python parsing strange JSON data - python

Related

JSON ID extraction from Array in Python

How to erase/delete curly brackets/braces from a dictionary?

Reorder and return the whole of nested dictionary

How to specify Stopwords in Elasticsearch mapping using python

Improving elasticsearch performance

Categories

Resources