Python/PySpark parse JSON string with numbered attributes

Python/PySpark parse JSON string with numbered attributes - python

I need to store JSON strings like the one below in some file format different from plaintext (e.g: parquet):
{
"vidName": "Foo",
"vidInfo.0.size.length": 10,
"vidInfo.0.size.width": 10,
"vidInfo.0.quality": "Good",
"vidInfo.1.size.length": 7,
"vidInfo.1.size.width": 3,
"vidInfo.1.quality": "Bad",
"vidInfo.2.size.length": 10,
"vidInfo.2.size.width": 2,
"vidInfo.2.quality": "Excelent"
}
There's no known bound for the index of vidInfo (can be 10, 20). Thus I want either to have vidInfos in an array, or explode such JSON object into multiple smaller objects.
I found this question: PHP JSON parsing (number attributes?)
But it is in PHP which I do not quite understand. And I am not sure whether it is same as what I need.
The intermediate data should be something like this:
{
"vidName": "Foo",
"vidInfo": [
{
"id": 0,
"size": {
"length": 10,
"width": 10
},
"quality": "Good"
},
{
"id": 1,
"size": {
"length": 7,
"width": 3
},
"quality": "Bad"
},
{
"id": 2,
"size": {
"length": 10,
"width": 2
},
"quality": "Excelent"
}
]
}
or like this:
{
"vidName": "Foo",
"vidInfo": [
{
"size": {
"length": 10,
"width": 10
},
"quality": "Good"
},
{
"size": {
"length": 7,
"width": 3
},
"quality": "Bad"
},
{
"size": {
"length": 10,
"width": 2
},
"quality": "Excelent"
}
]
}
I am stuck, and would need some hints to move on.
Could you please help?
Thanks a lot for your help.

I found this library https://github.com/amirziai/flatten which does the trick.
In [154]: some_json = {
...: "vidName": "Foo",
...: "vidInfo.0.size.length": 10,
...: "vidInfo.0.size.width": 10,
...: "vidInfo.0.quality": "Good",
...: "vidInfo.1.size.length": 7,
...: "vidInfo.1.size.width": 3,
...: "vidInfo.1.quality": "Bad",
...: "vidInfo.2.size.length": 10,
...: "vidInfo.2.size.width": 2,
...: "vidInfo.2.quality": "Excelent"
...: }
In [155]: some_json
Out[155]:
{'vidName': 'Foo',
'vidInfo.0.size.length': 10,
'vidInfo.0.size.width': 10,
'vidInfo.0.quality': 'Good',
'vidInfo.1.size.length': 7,
'vidInfo.1.size.width': 3,
'vidInfo.1.quality': 'Bad',
'vidInfo.2.size.length': 10,
'vidInfo.2.size.width': 2,
'vidInfo.2.quality': 'Excelent'}
In [156]: from flatten_json import unflatten_list
...: import json
...: nested_json = unflatten_list(json.loads(json.dumps(some_json)), '.')
In [157]: nested_json
Out[157]:
{'vidInfo': [{'quality': 'Good', 'size': {'length': 10, 'width': 10}},
{'quality': 'Bad', 'size': {'length': 7, 'width': 3}},
{'quality': 'Excelent', 'size': {'length': 10, 'width': 2}}],
'vidName': 'Foo'}

Related

How would I sort a dictionary of named dictionaries by a value in Python? [duplicate]

This question already has answers here:
How do I sort a list of dictionaries by a value of the dictionary?
(20 answers)
Closed 5 months ago.
Here is an example of a list
accounts = {
"user1": {
"password": "test",
"won": 8,
"lost": 1,
"colour": "green"
},
"user2": {
"password": "test",
"won": 12,
"lost": 4,
"colour": "blue"
},
"user3": {
"password": "test",
"won": 18,
"lost": 1,
"colour": "blue"
}}
How would I go about sorting these by 'won'?
I just cannot seem to figure it out.
Thanks!

You can try using the sorted function with a custom key.
var = {"user2": {
"password": "test",
"won": 12,
"lost": 4,
"colour": "blue"
},
"user1": {
"password": "test",
"won": 8,
"lost": 1,
"colour": "green"
},
"user3": {
"password": "test",
"won": 18,
"lost": 1,
"colour": "blue"
}}
result = sorted(var.items(),key=lambda x: x[1]["won"])
print(dict(result))
Output
{'user1':
{'password': 'test', 'won': 8, 'lost': 1, 'colour': 'green'},
'user2':
{'password': 'test', 'won': 12, 'lost': 4, 'colour': 'blue'},
'user3':
{'password': 'test', 'won': 18, 'lost': 1, 'colour': 'blue'}
}

How can I sort my JSON file by nested value?

So I have a JSON file that looks like this:
{
"PlayerA": {
"val": 200,
"level": 1
},
"PlayerB": {
"val": 1000,
"level": 1
},
"PlayerC": {
"val": 30,
"level": 1
}
}
And I want it to be sorted by "val," so that it looks like this:
{
"PlayerB": {
"val": 1000,
"level": 1
},
"PlayerA": {
"val": 200,
"level": 1
},
"PlayerC": {
"val": 30,
"level": 1
}
}
How would I go about doing this?

Try this:
data = {
"PlayerA": {
"val": 200,
"level": 1
},
"PlayerB": {
"val": 1000,
"level": 1
},
"PlayerC": {
"val": 30,
"level": 1
}
}
data = sorted(data.items(), key=lambda x: x[1]["val"], reverse=True)
print(data)
# [('PlayerB', {'val': 1000, 'level': 1}), ('PlayerA', {'val': 200, 'level': 1}), ('PlayerC', {'val': 30, 'level': 1})]

Group and sum list of dictionaries by parameter

I have a list of dictionaries of my products (drinks, food, etc), some of the products may be added several times. I need to group my products by product_id parameter and sum product_cost and product_quantity of each group to get the total product price.
I'm a newbie in python, understand how to group list of dictionaries but can't figure out how to sum some parameter values.
"products_list": [
{
"product_cost": 25,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 14,
},
{
"product_cost": 176.74,
"product_id": 2,
"product_name": "Apples",
"product_quantity": 800,
},
{
"product_cost": 13,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 7,
}
]
I need to achieve something like that:
"products_list": [
{
"product_cost": 38,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 21,
},
{
"product_cost": 176.74,
"product_id": 2,
"product_name": "Apples",
"product_quantity": 800,
}
]

You can start by sorting the list of dictionaries on product_name, and then group items based on product_name
Then for each group, calculate the total product and total quantity, create your final dictionary and update to the list, and then make your final dictionary
from itertools import groupby
dct = {"products_list": [
{
"product_cost": 25,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 14,
},
{
"product_cost": 176.74,
"product_id": 2,
"product_name": "Apples",
"product_quantity": 800,
},
{
"product_cost": 13,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 7,
}
]}
result = {}
li = []
#Sort product list on product_name
sorted_prod_list = sorted(dct['products_list'], key=lambda x:x['product_name'])
#Group on product_name
for model, group in groupby(sorted_prod_list,key=lambda x:x['product_name']):
grp = list(group)
#Compute total cost and qty, make the dictionary and add to list
total_cost = sum(item['product_cost'] for item in grp)
total_qty = sum(item['product_quantity'] for item in grp)
product_name = grp[0]['product_name']
product_id = grp[0]['product_id']
li.append({'product_name': product_name, 'product_id': product_id, 'product_cost': total_cost, 'product_quantity': total_qty})
#Make final dictionary
result['products_list'] = li
print(result)
The output will be
{
'products_list': [{
'product_name': 'Apples',
'product_id': 2,
'product_cost': 176.74,
'product_quantity': 800
},
{
'product_name': 'Coca-cola',
'product_id': 1,
'product_cost': 38,
'product_quantity': 21
}
]
}

You can try with pandas:
d = {"products_list": [
{
"product_cost": 25,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 14,
},
{
"product_cost": 176.74,
"product_id": 2,
"product_name": "Apples",
"product_quantity": 800,
},
{
"product_cost": 13,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 7,
}
]}
df=pd.DataFrame(d["products_list"])
Pass dict to pandas and perform groupby.
Then convert it back to dict with to_dict function.
result={}
result["products_list"]=df.groupby("product_name",as_index=False).sum().to_dict(orient="records")
Result:
{'products_list': [{'product_cost': 176.74,
'product_id': 2,
'product_name': 'Apples',
'product_quantity': 800},
{'product_cost': 38.0,
'product_id': 2,
'product_name': 'Coca-cola',
'product_quantity': 21}]}

Me personally I would reorganize it in to another dictionary by unique identifiers. Also, if you still need it in the list format you can still reorganize it in a dictionary, but you can just convert the dict.values() in to a list. Below is a function that does that.
def get_totals(product_dict):
totals = {}
for product in product_list["product_list"]:
if product["product_name"] not in totals:
totals[product["product_name"]] = product
else:
totals[product["product_name"]]["product_cost"] += product["product_cost"]
totals[product["product_name"]]["product_quantity"] += product["product_quantity"]
return list(totals.values())
output is:
[
{
'product_cost': 38,
'product_id': 1,
'product_name': 'Coca-cola',
'product_quantity': 21
},
{
'product_cost': 176.74,
'product_id': 2,
'product_name': 'Apples',
'product_quantity': 800
}
]
Now if you need it to belong to a product list key. Just reassign the list to the same key. Instead of returning list(total.values()) do
product_dict["product_list"] = list(total.values())
return product_dict
The output is a dictionary like:
{
"products_list": [
{
"product_cost": 38,
"product_id": 1,
"product_name": "Coca-cola",
"product_quantity": 21,
},
{
"product_cost": 176.74,
"product_id": 2,
"product_name": "Apples",
"product_quantity": 800,
}
]
}

Multiple relational tables to nested JSON format using Python

I'm trying to create nested JSON object by combining more than one relational tables using python/pandas. I'm a beginner in Python/pandas, so looking for bit of a help here...
In the following example, instead of tables, I'm using CSV files just to keep it simple
Table1.csv
Emp_id, Gender, Age
1, M, 32
2, M, 35
3, F, 31
Table2.csv
Emp_id, Month, Incentive
1, Aug, 3000
1, Sep, 3500
1, Oct, 2000
2, Aug, 1500
3, Aug, 5000
3, Sep, 2400
I want to create a JSON object like below
*Required output:
{
"data": [{
"employee": 1,
"gender": M,
"age": 32,
"incentive": [{
"aug": 3000,
"sep": 3500,
"oct": 2000
}],
"employee": 2,
"gender": M,
"age": 35,
"incentive": [{
"aug": 1500
}],
"employee": 3,
"gender": F,
"age": 31,
"incentive": [{
"aug": 5000,
"sep": 2400
}]
}]
}

Use merge with left join first, then groupby with lambda function for dictionaries and convert to_dict, last add top key value and convert to json:
d = (df1.merge(df2, on='Emp_id', how='left')
.groupby(['Emp_id','Gender','Age'])['Month','Incentive']
.apply(lambda x: [dict(x.values)])
.reset_index(name='Incentive')
.to_dict(orient='records')
)
#print (d)
import json
json = json.dumps({'data':d})
print (json)
{
"data": [{
"Emp_id": 1,
"Gender": "M",
"Age": 32,
"Incentive": [{
"Aug": 3000,
"Sep": 3500,
"Oct": 2000
}]
}, {
"Emp_id": 2,
"Gender": "M",
"Age": 35,
"Incentive": [{
"Aug": 1500
}]
}, {
"Emp_id": 3,
"Gender": "F",
"Age": 31,
"Incentive": [{
"Aug": 5000,
"Sep": 2400
}]
}]
}

What is the equivalent of array_column in python3

I have a list of dictionary and I want to get only a specific item from each dictionary. My data pattern is:
data = [
{
"_id": "uuid",
"_index": "my_index",
"_score": 1,
"_source": {
"id" : 1,
"price": 100
}
},
{
"_id": "uuid",
"_index": "my_index",
"_score": 1,
"_source": {
"id" : 2,
"price": 150
}
},
{
"_id": "uuid",
"_index": "my_index",
"_score": 1,
"_source": {
"id" : 3,
"price": 90
}
}
]
My desired output:
formatted_data = [
{
"id": 1,
"price": 100
},
{
"id": 2,
"price": 150
},
{
"id": 3,
"price": 90
}
]
To formate data I have used iteration (for) like
formatted_data = []
for item in data:
formatted_data.append(item['_source'])
In PHP I can use array_column() instead of for loop. So what will be the alternative of for in python3 in my case?
Thanks in advance.

You can use list comprehension:
In [11]: [e['_source'] for e in data]
Out[11]: [{'id': 1, 'price': 100}, {'id': 2, 'price': 150}, {'id': 3, 'price': 90}]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python/PySpark parse JSON string with numbered attributes - python

Related

How would I sort a dictionary of named dictionaries by a value in Python? [duplicate]

How can I sort my JSON file by nested value?

Group and sum list of dictionaries by parameter

Multiple relational tables to nested JSON format using Python

What is the equivalent of array_column in python3

Categories

Resources