Formatting JSON for Pandas Dataframe

Formatting JSON for Pandas Dataframe - python

I'm trying to wrangle some data to make a recommender system for an app. Of course, to do this I need a record of which users like which posts. I currently have that data in a JSON file that is formatted like this (numbers being post id, and letters being user ids):
{
"-1234": {
"abc": "abc",
"def": "def",
"ghi": "ghi"
},
"-5678": {
"jkl": "jkl",
"mno": "mno"
}
I'm trying to figure out how to get this into a pandas dataframe that would look like this:
example format
I've tried using a few online JSON to CSV converters out of laziness which unsurprisingly didn't bring it into a useable format for me. I've tried using "print(json_normalize(data))", as well which also did not work, and put each instance of a like into separate columns.
Any advice?

This is a solution optimized for the peculiarities in your dataset.
import pandas as pd
data = {
"-1234": {
"abc": "abc",
"def": "def",
"ghi": "ghi"
},
"-5678": {
"jkl": "jkl",
"mno": "mno"
}}
formatted = [{'PostID': d, 'User Like': list(data[d].keys())} for d in data]
df = pd.DataFrame.from_dict(formatted)
Output:

From my experience for such simple formats, writing a quick and dirty loop is usually the fastest method rather than finding some ready solution and customizing it. An example for the data you gave here:
import json
my_json=""" {
"-1234": {
"abc": "abc",
"def": "def",
"ghi": "ghi"
},
"-5678": {
"jkl": "jkl",
"mno": "mno"
}
}"""
parsed_json = json.loads(my_json)
print(parsed_json)
# result:
# {'-1234': {'abc': 'abc', 'def': 'def', 'ghi': 'ghi'},
# '-5678': {'jkl': 'jkl', 'mno': 'mno'}}
for key in parsed_json.keys():
line = ''
line += key
line += ' | '
for value in parsed_json[key].values():
line += value + ', '
line = line[:-2] # stripping the ', ' from the end of the line
print(line)
# result:
# -1234 | abc, def, ghi
# -5678 | jkl, mno

Setup
Thanks Zaroth
import json
my_json=""" {
"-1234": {
"abc": "abc",
"def": "def",
"ghi": "ghi"
},
"-5678": {
"jkl": "jkl",
"mno": "mno"
}
}"""
parsed_json = json.loads(my_json)
Comprehension
pd.DataFrame(
[(k, [*v]) for k, v in parsed_json.items()],
columns=['PostID', 'User Like']
)
PostID User Like
0 -1234 [abc, def, ghi]
1 -5678 [jkl, mno]
OR
pd.DataFrame({
'PostID': [*parsed_json],
'User Like': [[*v] for v in parsed_json.values()]
})

data = {"-1234": {"abc": "abc","def": "def","ghi": "ghi"},"-5678": {"jkl": "jkl","mno": "mno"}}
key = []
val = []
for k,v in data.items():
key.append(k)
val.append(list(v.values()))
pd.DataFrame(zip(key,val),columns=['PostID','User Like'])

Related

Changing value of a value in a dictionary within a list within a dictionary

I have a json like:
pd = {
"RP": [
{
"Name": "PD",
"Value": "qwe"
},
{
"Name": "qwe",
"Value": "change"
}
],
"RFN": [
"All"
],
"RIT": [
{
"ID": "All",
"IDT": "All"
}
]
}
I am trying to change the value change to changed. This is a dictionary within a list which is within another dictionary. Is there a better/ more efficient/pythonic way to do this than what I did below:
for key, value in pd.items():
ls = pd[key]
for d in ls:
if type(d) == dict:
for k,v in d.items():
if v == 'change':
pd[key][ls.index(d)][k] = "changed"
This seems pretty inefficient due to the amount of times I am parsing through the data.

String replacement could work if you don't want to write depth/breadth-first search.
>>> import json
>>> json.loads(json.dumps(pd).replace('"Value": "change"', '"Value": "changed"'))
{'RP': [{'Name': 'PD', 'Value': 'qwe'}, {'Name': 'qwe', 'Value': 'changed'}],
'RFN': ['All'],
'RIT': [{'ID': 'All', 'IDT': 'All'}]}

Output pandas dataframe to json in a particular format

My dataframe is
fname lname city state code
Alice Lee Athens Alabama PXY
Nor Xi Mesa Arizona ABC
The output of json should be
{
"Employees":{
"Alice Lee":{
"code":"PXY",
"Address":"Athens, Alabama"
},
"Nor Xi":{
"code":"ABC",
"Address":"Mesa, Arizona"
}
}
}
df.to_json() gives no hierarchy to the json. Can you please suggest what am I missing? Is there a way to combine columns and give them a 'keyname' while writing json in pandas?
Thank you.

Try:
names = df[["fname", "lname"]].apply(" ".join, axis=1)
addresses = df[["city", "state"]].apply(", ".join, axis=1)
codes = df["code"]
out = {"Employees": {}}
for n, a, c in zip(names, addresses, codes):
out["Employees"][n] = {"code": c, "Address": a}
print(out)
Prints:
{
"Employees": {
"Alice Lee": {"code": "PXY", "Address": "Athens, Alabama"},
"Nor Xi": {"code": "ABC", "Address": "Mesa, Arizona"},
}
}

We can populate a new dataframe with columns being "code" and "Address", and index being "full_name" where the latter two are generated from the dataframe's columns with string addition:
new_df = pd.DataFrame({"code": df["code"],
"Address": df["city"] + ", " + df["state"]})
new_df.index = df["fname"] + " " + df["lname"]
which gives
>>> new_df
code Address
Alice Lee PXY Athens, Alabama
Nor Xi ABC Mesa, Arizona
We can now call to_dict with orient="index":
>>> d = new_df.to_dict(orient="index")
>>> d
{"Alice Lee": {"code": "PXY", "Address": "Athens, Alabama"},
"Nor Xi": {"code": "ABC", "Address": "Mesa, Arizona"}}
To match your output, we wrap d with a dictionary:
>>> {"Employee": d}
{
"Employee":{
"Alice Lee":{
"code":"PXY",
"Address":"Athens, Alabama"
},
"Nor Xi":{
"code":"ABC",
"Address":"Mesa, Arizona"
}
}
}

json = json.loads(df.to_json(orient='records'))
employees = {}
employees['Employees'] = [{obj['fname']+' '+obj['lname']:{'code':obj['code'], 'Address':obj['city']+', '+obj['state']}} for obj in json]
This outputs -
{
'Employees': [
{
'Alice Lee': {
'code': 'PXY',
'Address': 'Athens, Alabama'
}
},
{
'Nor Xi': {
'code': 'ABC',
'Address': 'Mesa, Arizona'
}
}
]
}

you can solve this using df.iterrows()
employee_dict = {}
for row in df.iterrows():
# row[0] is the index number, row[1] is the data respective to that index
row_data = row[1]
employee_name = row_data.fname + ' ' + row_data.lname
employee_dict[employee_name] = {'code': row_data.code, 'Address':
row_data.city + ', ' + row_data.state}
json_data = {'Employees': employee_dict}
Result:
{'Employees': {'Alice Lee': {'code': 'PXY', 'Address': 'Athens, Alabama'},
'Nor Xi': {'code': 'ABC', 'Address': 'Mesa, Arizona'}}}

filter json file with python

How to filter a json file to show only the information I need?
To start off I want to say I'm fairly new to python and working with JSON so sorry if this question was asked before and I overlooked it.
I have a JSON file that looks like this:
[
{
"Store": 417,
"Item": 10,
"Name": "Burger",
"Modifiable": true,
"Price": 8.90,
"LastModified": "09/02/2019 21:30:00"
},
{
"Store": 417,
"Item": 15,
"Name": "Fries",
"Modifiable": false,
"Price": 2.60,
"LastModified": "10/02/2019 23:00:00"
}
]
I need to filter this file to only show Item and Price, like
[
{
"Item": 10,
"Price": 8.90
},
{
"Item": 15,
"Price": 2.60
}
]
I have a code that looks like this:
# Transform json input to python objects
with open("StorePriceList.json") as input_file:
input_dict = json.load(input_file)
# Filter python objects with list comprehensions
output_dict = [x for x in input_dict if ] #missing logical test here.
# Transform python object back into json
output_json = json.dumps(output_dict)
# Show json
print(output_json)
What logical test I should be doing here to do that?

Let's say we can use dict comprehension, then it will be
output_dict = [{k:v for k,v in x.items() if k in ["Item", "Price"]} for x in input_dict]

You can also do it like this :)
>>> [{key: d[key] for key in ['Item', 'Price']} for d in input_dict] # you should rename it to `input_list` rather than `input_dict` :)
[{'Item': 10, 'Price': 8.9}, {'Item': 15, 'Price': 2.6}]

import pprint
with open('data.json', 'r') as f:
qe = json.load(f)
list = []
for item in qe['<your data>']:
query = (f'{item["Item"]} {item["Price"]}')
print("query")

Using .values() with list of dictionaries?

I'm comparing json files between two different API endpoints to see which json records need an update, which need a create and what needs a delete. So, by comparing the two json files, I want to end up with three json files, one for each operation.
The json at both endpoints is structured like this (but they use different keys for same sets of values; different problem):
{
"records": [{
"id": "id-value-here",
"c": {
"d": "eee"
},
"f": {
"l": "last",
"f": "first"
},
"g": ["100", "89", "9831", "09112", "800"]
}, {
…
}]
}
So the json is represented as a list of dictionaries (with further nested lists and dictionaries).
If a given json endpoint (j1) id value ("id":) exists in the other endpoint json (j2), then that record should be added to j_update.
So far I have something like this, but I can see that .values() doesn't work because it's trying to operate on the list instead of on all the listed dictionaries(?):
j_update = {r for r in j1['records'] if r['id'] in
j2.values()}
This doesn't return an error, but it creates an empty set using test json files.
Seems like this should be simple, but tripping over the nesting I think of dictionaries in a list representing the json. Do I need to flatten j2, or is there a simpler dictionary method python has to achieve this?
====edit j1 and j2====
have same structure, use different keys; toy data
j1
{
"records": [{
"field_5": 2329309841,
"field_12": {
"email": "cmix#etest.com"
},
"field_20": {
"last": "Mixalona",
"first": "Clara"
},
"field_28": ["9002329309999", "9002329309112"],
"field_44": ["1002329309832"]
}, {
"field_5": 2329309831,
"field_12": {
"email": "mherbitz345#test.com"
},
"field_20": {
"last": "Herbitz",
"first": "Michael"
},
"field_28": ["9002329309831", "9002329309112", "8002329309999"],
"field_44": ["1002329309832"]
}, {
"field_5": 2329309855,
"field_12": {
"email": "nkatamaran#test.com"
},
"field_20": {
"first": "Noriss",
"last": "Katamaran"
},
"field_28": ["9002329309111", "8002329309112"],
"field_44": ["1002329309877"]
}]
}
j2
{
"records": [{
"id": 2329309831,
"email": {
"email": "mherbitz345#test.com"
},
"name_primary": {
"last": "Herbitz",
"first": "Michael"
},
"assign": ["8003329309831", "8007329309789"],
"hr_id": ["1002329309877"]
}, {
"id": 2329309884,
"email": {
"email": "yinleeshu#test.com"
},
"name_primary": {
"last": "Lee Shu",
"first": "Yin"
},
"assign": ["8002329309111", "9003329309831", "9002329309111", "8002329309999", "8002329309112"],
"hr_id": ["1002329309832"]
}, {
"id": 23293098338,
"email": {
"email": "amlouis#test.com"
},
"name_primary": {
"last": "Maxwell Louis",
"first": "Albert"
},
"assign": ["8002329309111", "8007329309789", "9003329309831", "8002329309999", "8002329309112"],
"hr_id": ["1002329309877"]
}]
}

If you read the json it will output a dict. You are looking for a particular key in the list of the values.
if 'records' in j2:
r = j2['records'][0].get('id', []) # defaults if id does not exist
It it prettier to do a recursive search but i dunno how you data is organized to quickly come up with a solution.
To give an idea for recursive search consider this example
def resursiveSearch(dictionary, target):
if target in dictionary:
return dictionary[target]
for key, value in dictionary.items():
if isinstance(value, dict):
target = resursiveSearch(value, target)
if target:
return target
a = {'test' : 'b', 'test1' : dict(x = dict(z = 3), y = 2)}
print(resursiveSearch(a, 'z'))

You tried:
j_update = {r for r in j1['records'] if r['id'] in j2.values()}
Aside from the r['id'/'field_5] problem, you have:
>>> list(j2.values())
[[{'id': 2329309831, ...}, ...]]
The id are buried inside a list and a dict, thus the test r['id'] in j2.values() always return False.
The basic solution is fairly simple.
First, create a set of j2 ids:
>>> present_in_j2 = set(record["id"] for record in j2["records"])
Then, rebuild the json structure of j1 but without the j1 field_5 that are not present in j2:
>>> {"records":[record for record in j1["records"] if record["field_5"] in present_in_j2]}
{'records': [{'field_5': 2329309831, 'field_12': {'email': 'mherbitz345#test.com'}, 'field_20': {'last': 'Herbitz', 'first': 'Michael'}, 'field_28': ['9002329309831', '9002329309112', '8002329309999'], 'field_44': ['1002329309832']}]}
It works, but it's not totally satisfying because of the weird keys of j1. Let's try to convert j1 to a more friendly format:
def map_keys(json_value, conversion_table):
"""Map the keys of a json value
This is a recursive DFS"""
def map_keys_aux(json_value):
"""Capture the conversion table"""
if isinstance(json_value, list):
return [map_keys_aux(v) for v in json_value]
elif isinstance(json_value, dict):
return {conversion_table.get(k, k):map_keys_aux(v) for k,v in json_value.items()}
else:
return json_value
return map_keys_aux(json_value)
The function focuses on dictionary keys: conversion_table.get(k, k) is conversion_table[k] if the key is present in the conversion table, or the key itself otherwise.
>>> j1toj2 = {"field_5":"id", "field_12":"email", "field_20":"name_primary", "field_28":"assign", "field_44":"hr_id"}
>>> mapped_j1 = map_keys(j1, j1toj2)
Now, the code is cleaner and the output may be more useful for a PUT:
>>> d1 = {record["id"]:record for record in mapped_j1["records"]}
>>> present_in_j2 = set(record["id"] for record in j2["records"])
>>> {"records":[record for record in mapped_j1["records"] if record["id"] in present_in_j2]}
{'records': [{'id': 2329309831, 'email': {'email': 'mherbitz345#test.com'}, 'name_primary': {'last': 'Herbitz', 'first': 'Michael'}, 'assign': ['9002329309831', '9002329309112', '8002329309999'], 'hr_id': ['1002329309832']}]}

Bidirectional data structure conversion in Python

Note: this is not a simple two-way map; the conversion is the important part.
I'm writing an application that will send and receive messages with a certain structure, which I must convert from and to an internal structure.
For example, the message:
{
"Person": {
"name": {
"first": "John",
"last": "Smith"
}
},
"birth_date": "1997.01.12",
"points": "330"
}
This must be converted to :
{
"Person": {
"firstname": "John",
"lastname": "Smith",
"birth": datetime.date(1997, 1, 12),
"points": 330
}
}
And vice-versa.
These messages have a lot of information, so I want to avoid having to manually write converters for both directions. Is there any way in Python to specify the mapping once, and use it for both cases?
In my research, I found an interesting Haskell library called JsonGrammar which allows for this (it's for JSON, but that's irrelevant for the case). But my knowledge of Haskell isn't good enough to attempt a port.

That's actually quite an interesting problem. You could define a list of transformation, for example in the form (key1, func_1to2, key2, func_2to1), or a similar format, where key could contain separators to indicate different levels of the dict, like "Person.name.first".
noop = lambda x: x
relations = [("Person.name.first", noop, "Person.firstname", noop),
("Person.name.last", noop, "Person.lastname", noop),
("birth_date", lambda s: datetime.date(*map(int, s.split("."))),
"Person.birth", lambda d: d.strftime("%Y.%m.%d")),
("points", int, "Person.points", str)]
Then, iterate the elements in that list and transform the entries in the dictionary according to whether you want to go from form A to B or vice versa. You will also need some helper function for accessing keys in nested dictionaries using those dot-separated keys.
def deep_get(d, key):
for k in key.split("."):
d = d[k]
return d
def deep_set(d, key, val):
*first, last = key.split(".")
for k in first:
d = d.setdefault(k, {})
d[last] = val
def convert(d, mapping, atob):
res = {}
for a, x, b, y in mapping:
a, b, f = (a, b, x) if atob else (b, a, y)
deep_set(res, b, f(deep_get(d, a)))
return res
Example:
>>> d1 = {"Person": { "name": { "first": "John", "last": "Smith" } },
... "birth_date": "1997.01.12",
... "points": "330" }
...
>>> print(convert(d1, relations, True))
{'Person': {'birth': datetime.date(1997, 1, 12),
'firstname': 'John',
'lastname': 'Smith',
'points': 330}}

Tobias has answered it quite well. If you are looking for a library that ensures the Model Transformation dynamically then you can explore the Python's Model transformation library PyEcore.
PyEcore allows you to handle models and metamodels (structured data model), and gives the key you need for building ModelDrivenEngineering-based tools and other applications based on a structured data model. It supports out-of-the-box:
Data inheritance,
Two-ways relationship management (opposite references),
XMI (de)serialization,
JSON (de)serialization etc
Edit
I have found something more interesting for you with example similar to yours, check out JsonBender.
import json
from jsonbender import bend, K, S
MAPPING = {
'Person': {
'firstname': S('Person', 'name', 'first'),
'lastname': S('Person', 'name', 'last'),
'birth': S('birth_date'),
'points': S('points')
}
}
source = {
"Person": {
"name": {
"first": "John",
"last": "Smith"
}
},
"birth_date": "1997.01.12",
"points": "330"
}
result = bend(MAPPING, source)
print(json.dumps(result))
Output:
{"Person": {"lastname": "Smith", "points": "330", "firstname": "John", "birth": "1997.01.12"}}

Here is my take on this (converter lambdas and dot-based notation idea taken from tobias_k):
import datetime
converters = {
(str, datetime.date): lambda s: datetime.date(*map(int, s.split("."))),
(datetime.date, str): lambda d: d.strftime("%Y.%m.%d"),
}
mapping = [
('Person.name.first', str, 'Person.firstname', str),
('Person.name.last', str, 'Person.lastname', str),
('birth_date', str, 'Person.birth', datetime.date),
('points', str, 'Person.points', int),
]
def covert_doc(doc, mapping, converters, inverse=False):
converted = {}
for keys1, type1, keys2, type2 in mapping:
if inverse:
keys1, type1, keys2, type2 = keys2, type2, keys1, type1
converter = converters.get((type1, type2), type2)
keys1 = keys1.split('.')
keys2 = keys2.split('.')
obj1 = doc
while keys1:
k, *keys1 = keys1
obj1 = obj1[k]
dict2 = converted
while len(keys2) > 1:
k, *keys2 = keys2
dict2 = dict2.setdefault(k, {})
dict2[keys2[0]] = converter(obj1)
return converted
# Test
doc1 = {
"Person": {
"name": {
"first": "John",
"last": "Smith"
}
},
"birth_date": "1997.01.12",
"points": "330"
}
doc2 = {
"Person": {
"firstname": "John",
"lastname": "Smith",
"birth": datetime.date(1997, 1, 12),
"points": 330
}
}
assert doc2 == covert_doc(doc1, mapping, converters)
assert doc1 == covert_doc(doc2, mapping, converters, inverse=True)
This nice things are that you can reuse converters (even to convert different document structures) and that you only need to define non-trivial conversions. The drawback is that, as it is, every pair of types must always use the same conversion (maybe it could be extended to add optional alternative conversions).

You can use lists to describe paths to values in objects with type converting functions, for example:
from_paths = [
(['Person', 'name', 'first'], None),
(['Person', 'name', 'last'], None),
(['birth_date'], lambda s: datetime.date(*map(int, s.split(".")))),
(['points'], lambda s: int(s))
]
to_paths = [
(['Person', 'firstname'], None),
(['Person', 'lastname'], None),
(['Person', 'birth'], lambda d: d.strftime("%Y.%m.%d")),
(['Person', 'points'], str)
]
and a little function to covert from and to (much like tobias suggests but without string separation and using reduce to get values from dict):
def convert(from_paths, to_paths, obj):
to_obj = {}
for (from_keys, convfn), (to_keys, _) in zip(from_paths, to_paths):
value = reduce(operator.getitem, from_keys, obj)
if convfn:
value = convfn(value)
curr_lvl_dict = to_obj
for key in to_keys[:-1]:
curr_lvl_dict = curr_lvl_dict.setdefault(key, {})
curr_lvl_dict[to_keys[-1]] = value
return to_obj
test:
from_json = '''{
"Person": {
"name": {
"first": "John",
"last": "Smith"
}
},
"birth_date": "1997.01.12",
"points": "330"
}'''
>>> obj = json.loads(from_json)
>>> new_obj = convert(from_paths, to_paths, obj)
>>> new_obj
{'Person': {'lastname': u'Smith',
'points': 330,
'birth': datetime.date(1997, 1, 12), 'firstname': u'John'}}
>>> convert(to_paths, from_paths, new_obj)
{'birth_date': '1997.01.12',
'Person': {'name': {'last': u'Smith', 'first': u'John'}},
'points': '330'}
>>>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Formatting JSON for Pandas Dataframe - python

data = {"-1234": {"abc": "abc","def": "def","ghi": "ghi"},"-5678": {"jkl": "jkl","mno": "mno"}} key = [] val = [] for k,v in data.items(): key.append(k) val.append(list(v.values())) pd.DataFrame(zip(key,val),columns=['PostID','User Like'])

Related

Changing value of a value in a dictionary within a list within a dictionary

Output pandas dataframe to json in a particular format

filter json file with python

Using .values() with list of dictionaries?

Bidirectional data structure conversion in Python

Categories

Resources