Parsing JSON output efficiently in Python?

Parsing JSON output efficiently in Python? - python

The below block of code works however I'm not satisfied that it is very optimal due to my limited understanding of using JSON but I can't seem to figure out a more efficient method.
The steam_game_db is like this:
{
"applist": {
"apps": [
{
"appid": 5,
"name": "Dedicated Server"
},
{
"appid": 7,
"name": "Steam Client"
},
{
"appid": 8,
"name": "winui2"
},
{
"appid": 10,
"name": "Counter-Strike"
}
]
}
}
and my Python code so far is
i = 0
x = 570
req_name_from_id = requests.get(steam_game_db)
j = req_name_from_id.json()
while j["applist"]["apps"][i]["appid"] != x:
i+=1
returned_game = j["applist"]["apps"][i]["name"]
print(returned_game)
Instead of looping through the entire app list is there a smarter way to perhaps search for it? Ideally the elements in the data structure with 'appid' and 'name' were numbered the same as their corresponding 'appid'
i.e.
appid 570 in the list is Dota2
However element 570 in the data structure in appid 5069 and Red Faction
Also what type of data structure is this? Perhaps it has limited my searching ability for this answer already. (I.e. seems like a dictionary of 'appid' and 'element' to me for each element?)
EDIT: Changed to a for loop as suggested
# returned_id string for appid from another query
req_name_from_id = requests.get(steam_game_db)
j_2 = req_name_from_id.json()
for app in j_2["applist"]["apps"]:
if app["appid"] == int(returned_id):
returned_game = app["name"]
print(returned_game)

The most convenient way to access things by a key (like the app ID here) is to use a dictionary.
You pay a little extra performance cost up-front to fill the dictionary, but after that pulling out values by ID is basically free.
However, it's a trade-off. If you only want to do a single look-up during the life-time of your Python program, then paying that extra performance cost to build the dictionary won't be beneficial, compared to a simple loop like you already did. But if you want to do multiple look-ups, it will be beneficial.
# build dictionary
app_by_id = {}
for app in j["applist"]["apps"]:
app_by_id[app["appid"]] = app["name"]
# use it
print(app_by_id["570"])
Also think about caching the JSON file on disk. This will save time during your program's startup.

It's better to have the JSON file on disk, you can directly dump it into a dictionary and start building up your lookup table. As an example I've tried to maintain your logic while using the dict for lookups. Don't forget to encode the JSON it has special characters in it.
Setup:
import json
f = open('bigJson.json')
apps = {}
with open('bigJson.json', encoding="utf-8") as handle:
dictdump = json.loads(handle.read())
for item in dictdump['applist']['apps']:
apps.setdefault(item['appid'], item['name'])
Usage 1:
That's the way you have used it
for appid in range(0, 570):
if appid in apps:
print(appid, apps[appid].encode("utf-8"))
Usage 2: That's how you can query a key, using getinstead of [] will prevent a KeyError exception if the appid isn't recorded.
print(apps.get(570, 0))

Related

Iterate and append faster through millions of dictionnaries in python

I'm writing a script to handle millions of dictionnaries from many files of 1 million lines each.
The main interest of this script is to create a json file and send it to Elasticsearch bulk edit.
What I'm trying to do is to read lines from called "entity" files, and in those entities, I have to find their matched addresses in "sub" files (for sub-entities). But my problem here, is that the function which should associate them, take REALLY too much time to do a single iteration... But after trying to optimize it as much as possible, the association still insanely slow.
So, just to be clear about the data structure:
Entities are object like Persons : an id, an unique_id, a name, list of postal addresses, list of email addresses :
{id: 0, uniqueId: 'Z5ER1ZE5', name: 'John DOE', postalList: [], emailList: []}
Sub-entities are object referenced as differents types of addresses (mail, postal, etc ..) : {personUniqueId: 'Z5ER1ZE5', 'Email': 'john.doe#gmail.com'}
So, I get files content with Pandas using pd.read_csv(filename).
To optimize as much as possible I've decided to handle every iterations on data using multiprocessing (working fine, even if I didn't handled RAM usage..) :
## Using manager to be alble to pass my main object and update it through processes
manager = multiprocessing.Manager()
main_obj = manager.dict({
'dataframes': manager.dict(),
'dicts': manager.dict(),
'json': manager.dict()
})
## Just an example of how I use multiprocessing
pool = multiprocessing.Pool()
result = pool.map(partial(func, obj=main_obj), data_list_to_iterate)
pool.close()
pool.join()
And I have some referentials, where identifiers refers to a dict having entities name in keys and their uniqueId as value, and sub_associations is a dict where we have sub-entities name as keys and their related collection as value :
sub_associations = {
'Persons_emlAddr': 'postalList',
'Persons_pstlAddr': 'emailList'
}
identifiers = {
'Animals': 'uniqueAnimalId',
'Persons': 'uniquePersonId',
'Persons_emlAddr': 'uniquePersonId',
'Persons_pstlAddr': 'uniquePersonId'
}
This said, now, I experienced a big issue in my function to get sub-entities, for my entity:
for key in list(main_obj['dicts'].keys()):
main_obj['json'][key] = ''
with mp.Pool() as stringify_pool:
res_stringify = stringify_pool.map(partial(convert_to_json, obj=main_obj, name=key), main_obj['dicts'][key]['records'])
stringify_pool.close()
stringify_pool.join()
Here is where I call my issued function. I feed it with keys I have in my main_obj['dicts'], where keys are just entities filename (Persons, Animals, ..), and my main_obj['dicts'][key] is a list of dict where main_obj['dicts'][key] = {name: 'Persons', records: []} where records is the list of entities dicts I need to iterate on.
def convert_to_json(item, obj, name):
global sub_associations
global identifiers
dump = ''
subs = [val for val in sub_associations.keys() if val.startswith(name)]
if subs:
for sub in subs:
df = obj['dataframes'][sub]
id_name = identifiers[name]
sub_items = df[df[id_name] == item[id_name]].to_dict('records')
if sub_items:
item[sub_associations[sub]] = sub_items
else:
item[sub_associations[sub]] = []
index = {
"index": {
"_index": name,
"_id": item[identifiers[name]]
}
}
dump += f'{json.dumps(index)}\n'
dump += f'{json.dumps(item)}\n'
obj['json'][name] += dump
return 'Done'
Does someone could have an idea about what could be the main issue ? And how I could change it to make it faster ?
If you need any additionnal information.. If I haven't been clear on some things, feel free.
Thank you in advance ! :)

How to handle when API Response returns either list or dict in Python?

I am parsing an API in python using responseJson = json.loads(response.text)
The API response is somewhat like this:
When having single entry in books
{
"name": "A",
"books": {
"bookname": "BookA"
}
}
or
2. When having multiple entries in books
{
"name": "A",
"books": [
{
"bookname": "BookA"
},
{
"bookname": "BookB"
}
]
}
Currenty I am using:
if type(responseJson['books']) is dict:
bookName.append(responseJson['books']['bookname'])
# do a lot more stuff
else:
for val in responseJson['books']:
bookName.append(val['bookname'])
# do a lot more stuff
Since the code (# do a lot more stuff) is a bit complex, I was looking to find an optimized way to do this instead of relying on type().
Any suggestions on how to improve code quality here?

I would use isinstance instead of type but instead of having to different branches that do a bunch of stuff I would only look for the dictionaries and if found wrap place it inside of a list and then you only need one branch that does stuff.
for example:
books = response.json['books']
if isinstance(books, dict):
books = [books]
for val in books:
bookName.append(val['bookname'])
# do alot more stuff

Normalizing Cloudwatch Log JSON in Python

I'm trying to clean AWS Cloudwatch's log data, which is delivered in JSON format when queried via boto3. Each log line is stored as an array of dictionaries. For example, one log line takes the following form:
[
{
"field": "field1",
"value": "abc"
},
{
"field": "field2",
"value": "def"
},
{
"field": "field3",
"value": "ghi"
}
]
If this were in a standard key-value format (e.g., {'field1':'abc'}), I would know exactly what to do with it. I'm just getting stuck on untangling the extra layer of hierarchy introduced by the field/value keys. The ultimate goal is to convert the entire response object into a data frame like the following:
| field1 | field2 | field3 |
|--------|--------|--------|
| abc | def | ghi
(and so on for the rest of the response object, one row per log line.)
Last bit of info: each array has the same set of fields, and there is no nesting deeper than the example I've provided here. Thank you in advance :)

I was able to do this using nested loops. Not my favorite - I always feel like there has to be a more elegant solution than crawling through every single inch of the object, but this data is simple enough that it's still very fast.
logList = [] # Empty array to store list of dictionaries (i.e., log lines)
for line in logs: # logs = response object
line_dict = {}
# Flatten each dict into a single key-value pair
for i in range( len(line) ):
line_dict[ line[i]['field'] ] = line[i]['value']
logList.append(line_dict)
df = pd.json_normalize(logList)
For anyone else working with CloudWatch logs, the actual log lines (like those I displayed above) are nested in an array called 'results' in the boto3 response object. So you'd need to extract that array first, or point the outer loop to it (i.e., for line in response['results']).

Parse Json file and save specific values [duplicate]

This question already has answers here:
Getting a list of values from a list of dicts
(10 answers)
Closed 5 years ago.
I have this JSON file where the amount of id's sometimes changes (more id's will be added):
{
"maps": [
{
"id": "blabla1",
"iscategorical": "0"
},
{
"id": "blabla2",
"iscategorical": "0"
},
{
"id": "blabla3",
"iscategorical": "0"
},
{
"id": "blabla4",
"iscategorical": "0"
}
]
}
I have this python code that has to print all the values of ids:
import json
data = json.load(open('data.json'))
variable1 = data["maps"][0]["id"]
print(variable1)
variable2 = data["maps"][1]["id"]
print(variable2)
variable3 = data["maps"][2]["id"]
print(variable3)
variable4 = data["maps"][3]["id"]
print(variable4)
I have to use variables, because i want to show the values in a dropdown menu. Is it possible to save the values of the id's in a more efficient way? How do you know the max amount of id's of this json file (in de example 4)?

You can get the number of id (which is the number of elements) by checking the length of data['maps']:
number_of_ids = len(data['maps'])
A clean way to get all the id values is storing them in a list.
You can achieve this in a pythonic way like this:
list_of_ids = [map['id'] for map in data['maps']]
Using this approach you don't even need to store the number of elements in the original json, because you iterate through all of them using a foreach approach, essentially.
If the pythonic approach troubles you, you can achieve the same thing with a classic foreach approach doing so:
list_of_ids = []
for map in data['maps']:
list_of_ids.append(map['id'])
Or you can do with a classic for loop, and here is where you really need the length:
number_of_ids = len(data['maps'])
list_of_ids = []
for i in range(0,number_of_ids):
list_of_ids.append(data['maps'][i]['id'])
This last is the classic way, but I suggest you to take the others approaches in order to leverage the advantages python offers to you!
You can find more on this stuff here!
Happy coding!

data['maps'] is a simple list, so you can iterate over it as such:
for map in data['maps']:
print(map['id'])
To store them in a variable, you'll need to output them to a list. Storing them each in a separate variable is not a good idea, because like you said, you don't have a way to know how many there are.
ids = []
for map in data['maps']:
ids.append(map['id'])

Create nested JSON from flat csv

Trying to create a 4 deep nested JSON from a csv based upon this example:
Region,Company,Department,Expense,Cost
Gondwanaland,Bobs Bits,Operations,nuts,332
Gondwanaland,Bobs Bits,Operations,bolts,254
Gondwanaland,Maureens Melons,Operations,nuts,123
At each level I would like to sum the costs and include it in the outputted JSON at the relevant level.
The structure of the outputted JSON should look something like this:
{
"id": "aUniqueIdentifier",
"name": "usually a nodes name",
"data": [
{
"key": "some key",
"value": "some value"
},
{
"key": "some other key",
"value": "some other value"
}
],
"children": [/* other nodes or empty */ ]
}
(REF: http://blog.thejit.org/2008/04/27/feeding-json-tree-structures-to-the-jit/)
Thinking along the lines of a recursive function in python but have not had much success with this approach so far... any suggestions for a quick and easy solution greatly appreciated?
UPDATE:
Gradually giving up on the idea of the summarised costs because I just can't figure it out :(. I'not much of a python coder yet)! Simply being able to generate the formatted JSON would be good enough and I can plug in the numbers later if I have to.
Have been reading, googling and reading for a solution and on the way have learnt a lot but still no success in creating my nested JSON files from the above CSV strucutre. Must be a simple solution somewhere on the web? Maybe somebody else has had more luck with their search terms????

Here are some hints.
Parse the input to a list of lists with csv.reader:
>>> rows = list(csv.reader(source.splitlines()))
Loop over the list to buildi up your dictionary and summarize the costs. Depending on the structure you're looking to create the build-up might look something like this:
>>> summary = []
>>> for region, company, department, expense, cost in rows[1:]:
summary.setdefault(*region, company, department), []).append((expense, cost))
Write the result out with json.dump:
>>> json.dump(summary, open('dest.json', 'wb'))
Hopefully, the recursive function below will help get you started. It builds a tree from the input. Please be aware of what type you want your leaves to be in, which we label as the "cost". You'll need to elaborate on the function to build-up the exact structure you intend:
import csv, itertools, json
def cluster(rows):
result = []
for key, group in itertools.groupby(rows, key=lambda r: r[0]):
group_rows = [row[1:] for row in group]
if len(group_rows[0]) == 2:
result.append({key: dict(group_rows)})
else:
result.append({key: cluster(group_rows)})
return result
if __name__ == '__main__':
s = '''\
Gondwanaland,Bobs Bits,Operations,nuts,332
Gondwanaland,Bobs Bits,Operations,bolts,254
Gondwanaland,Maureens Melons,Operations,nuts,123
'''
rows = list(csv.reader(s.splitlines()))
r = cluster(rows)
print json.dumps(r, indent=4)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing JSON output efficiently in Python? - python

Related

Iterate and append faster through millions of dictionnaries in python

How to handle when API Response returns either list or dict in Python?

Normalizing Cloudwatch Log JSON in Python

Parse Json file and save specific values [duplicate]

Create nested JSON from flat csv

Categories

Resources