List key values for Json data file - python

I have a very long json file, that I need make sense of in order to query the correct data that is related to what I am interested in. In order to do this, I would like to extract all of the key values in order to know what is available to query. Is there an quick way of doing this, or should I just write a parser that traverses the json file and extracts anything in-between either { and : or , and : ?
given the example:
[{"Name": "key1", "Value": "value1"}, {"Name": "key2", "Value": "value2"}]
I am looking for the values:
"Name"
"Value"

That will depend on if there's any nesting. But the basic pattern is something like this:
import json
with open("foo.json", "r") as fh:
data = json.load(fh)
all_keys = set()
for datum in data:
keys = set(datum.keys())
all_keys.update(keys)

This:
dict = [{"Name": "key1", "Value": "value1"}, {"Name": "key2", "Value": "value2"}]
for val in dict:
print(val.keys())
gives you:
dict_keys(['Name', 'Value'])
dict_keys(['Name', 'Value'])

Related

parse JSON file to CSV with key values null in python

Example
{"data":"value1","version":"value2","version1":"value3"}
{"data":"value1","version1":"value3"}
{"data":"value1","version1":"value3","hi":{"a":"true,"b":"false"}}
I have a JSON file and need to convert it to csv, however the rows are not having same columns, and some rows have nested attributes,how to convert them in python script.
I tried JSON to csv using Python code, but it gives me an error
In order to convert a JSON file to a CSV file in Python, you will need to use the Pandas library.
import pandas as pd
data = [
{
"data": "value1",
"version": "value2",
"version1": "value3"
},
{
"data": "value1",
"version1": "value3"
},
{
"data": "value1",
"version1": "value3",
"hi": {
"a": "true,",
"b": "false"
}
}
]
df = pd.DataFrame(data)
df.to_csv('data.csv', index=False)
I have correctly formatted your JSON since it was giving errors.
You could convert the JSON data to a flat list of lists with column names on the first line. Then process that to make the CSV output.
def flatDict(D,p=""):
if not isinstance(D,dict):
return {"":D}
return {p+k+s:v for k,d in D.items() for s,v in flatDict(d,".").items()}
def flatData(data):
lines = [*map(flatDict,data)]
names = dict.fromkeys(k for d in lines for k in d)
return [[*names]] + [ [*map(line.get,names)] for line in lines ]
The flatDict function converts a nested dictionary structure to a single level dictionary with nested keys combined and brought up to the top level. This is done recursively so that it works for any depth of nesting
The flatData function processes each line, to make a list of flattened dictionaries (lines). The union of all keys in that list forms the list of columns names (using a dictionary constructor to get them in order of appearance). The list of names and lines is returned by converting each dictionary to a list mapping key names to line data where present (using the .get() method of dictionaries).
output:
E = [{"data":"value1","version":"value2","version1":"value3"},
{"data":"value1","version1":"value3"},
{"data":"value1","version1":"value3","hi":{"a":"true","b":"false"}} ]
for line in flatData(E):
print(line)
['data', 'version', 'version1', 'hi.a', 'hi.b'] # col names
['value1', 'value2', 'value3', None, None] # data ...
['value1', None, 'value3', None, None]
['value1', None, 'value3', 'true', 'false']

How can I remove a nested data If a specific key found similar in big JSON data?

The simple way to filter is to loop them all, but trust me as I have very massive data looping is very time consuming and maybe not be very efficient way,
[
{
"from_name": "Haio",
"from_id": 183556205,
"receiver_name": "Shubh M",
"targeted_id": 78545445,
"gift_value": '$56'
},
{
"from_name": "Mr. A",
"from_id": 54545455,
"receiver_name": "haio",
"targeted_id": 78545445,
"gift_value": '$7'
}]
What do I want to accomplish?
I just want to delete the dict If targeted_idis same
def remove_duplicates(lst, key=lambda x: x, acc=[], keys = []):
if lst == []:
return acc
elif key(lst[0]) in keys:
return remove_duplicates(lst[1:], key=key, acc=acc, keys=keys)
else:
return remove_duplicates(lst[1:], key=key, acc = acc + [lst[0]], keys=keys + [key(lst[0])])
```
Provided you can load the whole dataset into memory, use pandas and drop_duplicates. By default, it will keep the first of each set of duplicate records. There is some overhead associated with creating a dataframe, but if there are lots of records to go through then dropping duplicates this way should be much faster than a python loop.
import pandas as pd
data =[
{
"from_name": "Haio",
"from_id": 183556205,
"receiver_name": "Shubh M",
"targeted_id": 78545445,
"gift_value": '$56'
},
{
"from_name": "Mr. A",
"from_id": 54545455,
"receiver_name": "haio",
"targeted_id": 78545445,
"gift_value": '$7'
}]
df = pd.DataFrame(data).drop_duplicates(subset=['targeted_id'])
print(df.to_json()) # or specify a file name to save it

Using pandas.json_normalize to "unfold" a dictionary of a list of dictionaries

I am new to Python (and coding in general) so I'll do my best to explain the challenge I'm trying to work through.
I'm working with a large dataset which was exported as a CSV from a database. However, there is one column within this CSV export that contains a nested list of dictionaries (as best as I can tell). I've looked around extensively online for a solution, including on Stackoverflow, but haven't quite gotten a full solution. I think I understand conceptually what I'm trying to accomplish, but not clear as to the best method or data prepping process to use.
Here is an example of the data (pared down to just the two columns I'm interested in):
{
"app_ID": {
"0": 1abe23574,
"1": 4gbn21096
},
"locations": {
"0": "[ {"loc_id" : "abc1", "lat" : "12.3456", "long" : "101.9876"
},
{"loc_id" : "abc2", "lat" : "45.7890", "long" : "102.6543"}
]",
"1": "[ ]",
]"
}
}
Basically each app_ID can have multiple locations tied to a single ID, or it can be empty as seen above. I have attempted using some guides I found online using Panda's json_normalize() function to "unfold" or get the list of dictionaries into their own rows in a Panda dataframe.
I'd like to end up with something like this:
loc_id lat long app_ID
abc1 12.3456 101.9876 1abe23574
abc1 45.7890 102.6543 1abe23574
etc...
I am learning about how to use the different functions of json_normalize, like "record_path" and "meta", but haven't been able to get it to work yet.
I have tried loading the json file into a Jupyter Notebook using:
with open('location_json.json', 'r') as f:
data = json.loads(f.read())
df = pd.json_normalize(data, record_path = ['locations'])
but it only creates a dataframe that is 1 row and multiple columns long, where I'd like to have multiple rows generated from the inner-most dictionary that tie back to the app_ID and loc_ID fields.
Attempt at a solution:
I was able to get close to the dataframe format I wanted using:
with open('location_json.json', 'r') as f:
data = json.loads(f.read())
df = pd.json_normalize(data['locations']['0'])
but that would then require some kind of iteration through the list in order to create a dataframe, and then I'd lose the connection to the app_ID fields. (As best as I can understand how the json_normalize function works).
Am I on the right track trying to use json_normalize, or should I start over again and try a different route? Any advice or guidance would be greatly appreciated.
I can't say that suggesting you using convtools library is a good thing since you are a beginner, because this library is almost like another Python over the Python. It helps to dynamically define data conversions (generating Python code under the hood).
But anyway, here is the code if I understood the input data right:
import json
from convtools import conversion as c
data = {
"app_ID": {"0": "1abe23574", "1": "4gbn21096"},
"locations": {
"0": """[ {"loc_id" : "abc1", "lat" : "12.3456", "long" : "101.9876" },
{"loc_id" : "abc2", "lat" : "45.7890", "long" : "102.6543"} ]""",
"1": "[ ]",
},
}
# define it once and use multiple times
converter = (
c.join(
# converts "app_ID" data to iterable of dicts
(
c.item("app_ID")
.call_method("items")
.iter({"id": c.item(0), "app_id": c.item(1)})
),
# converts "locations" data to iterable of dicts,
# where each id like "0" is zipped to each location.
# the result is iterable of dicts like {"id": "0", "loc": {"loc_id": ... }}
(
c.item("locations")
.call_method("items")
.iter(
c.zip(id=c.repeat(c.item(0)), loc=c.item(1).pipe(json.loads))
)
.flatten()
),
# join on "id"
c.LEFT.item("id") == c.RIGHT.item("id"),
how="full",
)
# process results, where 0 index is LEFT item, 1 index is the RIGHT one
.iter(
{
"loc_id": c.item(1, "loc", "loc_id", default=None),
"lat": c.item(1, "loc", "lat", default=None),
"long": c.item(1, "loc", "long", default=None),
"app_id": c.item(0, "app_id"),
}
)
.as_type(list)
.gen_converter()
)
result = converter(data)
assert result == [
{'loc_id': 'abc1', 'lat': '12.3456', 'long': '101.9876', 'app_id': '1abe23574'},
{'loc_id': 'abc2', 'lat': '45.7890', 'long': '102.6543', 'app_id': '1abe23574'},
{'loc_id': None, 'lat': None, 'long': None, 'app_id': '4gbn21096'}
]

Writing 3 python dictionaries to a csv

I have 3 dictionaries( 2 of them are setdefault dicts with multiple values)-
Score_dict-
{'Id_1': [('100001124156327', 0.0),
('100003643614411',0.0)],
'Id_2': [('100000435456546',5.7),
('100000234354556',3.5)]}
post_dict-
{'Id_1':[(+,100004536)],
'Id_2' :[(-,100035430)]}
comment_dict-
{'Id_1':[(+,1023434234)],
'Id_2':[(-,10343534534)
(*,1097963644)]}
My current approach is to write them into 3 different csv files and then merging them,I want to merge them according to a common first row(ID_row).
But I am unable to figure out how to merge 3 csv files into a single csv file. Also , Is there any way which I can write all the 3 dictionaries into a single csv without writing them individually.
Output required-
Ids Score_Ids Post_Ids Comment_Ids
Id_1 100001124156327',0.0 +,100004536 +,1023434234
100003643614411',0.0
Id_2 100000435456546',5.7 -,100035430 -,10343534534
100000234354556',3.5 *,1097963644
How to do this in a correct way with the best approach?
You can merge them all first, then write them to a csv file:
import pprint
scores = {
'Id_1': [
('100001124156327', 0.0),
('100003643614411',0.0)],
'Id_2': [
('100000435456546',5.7),
('100000234354556',3.5)
]
}
post_dict = {
'Id_1':[
('+',100004536)
],
'Id_2' :[
('-',100035430)
]
}
comment_dict = {
'Id_1':[
('+',1023434234)
],
'Id_2':[
('-',10343534534),
('*',1097963644)
]
}
merged = {
key: {
"Score_Ids": value,
"Post_Ids": post_dict[key],
"Comment_Ids": comment_dict[key]
}
for key, value
in scores.iteritems()
}
pp = pprint.PrettyPrinter(depth=6)
pp.pprint(merged)
For reference: https://repl.it/repls/SqueakySlateblueDictionaries
I suggest you to transform your three dicts into one list of dicts before write it to a csv file.
Example
rows = [
{"Score_Id": "...", "Post_Id": "...", "Comment_Id": "..."},
{"Score_Id": "...", "Post_Id": "...", "Comment_Id": "..."},
{"Score_Id": "...", "Post_Id": "...", "Comment_Id": "..."},
...
]
And then use the csv.DictWriter class to write all the rows.
Since you have commas in your values (are you sure it's a good behaviour? Maybe splitting them in two different columns could be a better approach), be careful to use tabs or something else as separator
I suggest writing all three to the same file
You could get common keys by doing something like:
common_keys = set(score_dict.keys()+post_dict.keys()+comment_dict.keys())
for key_ in common_keys:
val_score = score_dict.get(key_, some_default_value)
post_score = post_dict.get(key_, some_default_value)
comment_score = comment_dict.get(key_, some_default_value)
# print key and vals to csv as before

Create a data frame from a complex nested dictionary?

I have a big nested, then nested then nested json file saved as .txt format. I need to access some specific key pairs and crate a data frame or another transformed json object for further use. Here is a small sample with 2 key pairs.
[
{
"ko_id": [819752],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
},
{
"ko_id": [819753],
"concepts": [
{
"id": ["11A71731B880:http://ontology.intranet.com/Taxonomy/116#en"],
"uri": ["http://ontology.intranet.com/Taxonomy/116"],
"language": ["en"],
"prefLabel": ["Client coverage & relationship management"]
}
]
}
]
The following code load the data as list but I need to access to the data probably as a dictionary and I need the "ko_id", "uri" and "prefLabel" from each key pair and put it to a pandas data frame or a dictionary for further analysis.
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
The following code gives me the exact value of the first element. But donot actually know how to put it together and build the ultimate algorithm to create the dataframe.
print(sample_dict["ko_id"][0])
print(sample_dict["concepts"][0]["prefLabel"][0])
print(sample_dict["concepts"][0]["uri"][0])
for record in sample_dict:
df = pd.DataFrame(record['concepts'])
df['ko_id'] = record['ko_id']
final_df = final_df.append(df)
You can pass the data to pandas.DataFrame using a generator:
import pandas as pd
import json as js
with open('sample_data.txt') as data_file:
json_sample = js.load(data_file)
df = pd.DataFrame(data = ((key["ko_id"][0],
key["concepts"][0]["prefLabel"][0],
key["concepts"][0]["uri"][0]) for key in json_sample),
columns = ("ko_id", "prefLabel", "uri"))
Output:
>>> df
ko_id prefLabel uri
0 819752 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116
1 819753 Client coverage & relationship management http://ontology.intranet.com/Taxonomy/116

Categories

Resources