Reading JSON / array of JSONs from set of files - python

I have a series of files, each with a JSON, that I'd like to read into Pandas. This is pretty straightforward:
data_unfiltered = [json.load(open(jf)) for jf in json_input_files]
# next call used to be df = pandas.DataFrame(data_unfiltered)
# instead, json_normalize flattens nested dicts
df = json_normalize(data_unfiltered)
Now, a new wrinkle. Some of these input files no longer have just a plain JSON but instead a (python) list of JSONs: [ { JSON }, { JSON }... ].
json.load is pretty great because it inputs a whole file and puts it straight into JSON; I don't have to parse the file at all. How would I now turn a list of files, some of which have a single JSON object and some of which have a list of JSON objects, into a list of parsed JSON objects?
Bonus question: I used to be able to add the filename into each JSON with
df['filename'] = pandas.Series(json_input_files).values
but now I can't do that any more since now one input file might correspond to many JSONs in the output list. How can I add the filenames to the JSONs before I put them into a pandas dataframe?
Edit: Work in progress, per request in comments:
data_unfiltered = []
for jf_file in json_input_files:
jf = open(jf_file)
if isinstance(jf, list): # this is obviously wrong
for j in jf:
d = json.load(j) # this is what I need to fix
d['details'] = jf_file
data_unfiltered.append(d)
else: # not a list, assume dict
d = json.load(jf)
d['details'] = jf_file
data_unfiltered.append(d)
but json.load() worked perfectly for what I wanted (file object to JSON) and has no equiv for arrays. I figure I have to manually parse the file into a list of blobs and then do json.loads() on each blob? That's pretty kludgey though.

Related

How to parse JSON with a list of lists?

I am trying to parse a "complicated" JSON string that is returned to me by an API.
It looks like this:
{
"data":[
["Distance to last strike","23.0","miles"],
["Time of last strike","1/14/2022 9:23:42 AM",""],
["Number of strikes today","1",""]
]
}
While the end goal will be to extract the distance, date/time, as well as count, for right now I am just trying to successfully get the distance.
My python script is:
import requests
import json
response_API = requests.get('http://localhost:8998/api/extra/lightning.json')
data = response_API.text
parse_json = json.loads(data)
value = parse_json['Distance to last strike']
print(value)
This does not work. If I change the value line to
value = parse_json['data']
then the entire string I listed above is returned.
I am hoping it's just a simple formatting issue. Suggestions?
You have an object with a list of lists. If you fetch
value = parse_json['data']
Then you will have a list containing three lists. So:
print(value[0][1])
will print "23.0".

How to store and load a Python dictionary with HDF5

I'm having issues loading (I think storing is working – a file is being created and contains data) a dictionary (string key and array/list value) from a HDF5 file. I'm receiving the following error:
ValueError: malformed node or string: < HDF5 dataset "dataset_1": shape (), type "|O" >
My code is:
import h5py
def store_table(self, filename):
table = dict()
table['test'] = list(np.zeros(7,dtype=int))
with h5py.File(filename, "w") as file:
file.create_dataset('dataset_1', data=str(table))
file.close()
def load_table(self, filename):
file = h5py.File(filename, "r")
data = file.get('dataset_1')
print(ast.literal_eval(data))
I've read online using the ast method literal_eval should work but it doesn't appear to help... How do I 'unpack' the HDF5 so it's a dictionary again?
Any ideas would be appreciated.
It's not clear to me what you really want to accomplish. (I suspect your dictionaries have more than seven zeros. Otherwise, HDF5 is overkill to store your data.) If you have a lot of very large dictionaries, it would be better to covert the data to a NumPy array then either 1) create and load the dataset with data= or 2) create the dataset with an appropriate dtype then populate. You can create datasets with mixed datatypes, which is not addressed in the previous solution. If those situations don't apply, you might want to save the dictionary as attributes. Attributes can be associated to a group, a dataset, or the file object itself. Which is best depends on your requirements.
I wrote a short example to show how to load dictionary key/value pairs as attribute names/value pairs tagged to a group. For this example, I assumed the dictionary has a name key with the group name for association. The process is almost identical for a dataset or file object (just change the object reference).
import h5py
def load_dict_to_attr(h5f, thisdict) :
if 'name' not in thisdict:
print('Dictionary missing name key. Skipping function.')
return
dname = thisdict.get('name')
if dname in h5f:
print('Group:' + dname + ' exists. Skipping function.')
return
else:
grp = h5f.create_group(dname)
for key, val in thisdict.items():
grp.attrs[key] = val
###########################################
def get_grp_attrs(name, node) :
grp_dict = {}
for k in node.attrs.keys():
grp_dict[k]= node.attrs[k]
print (grp_dict)
###########################################
car1 = dict( name='my_car', brand='Ford', model='Mustang', year=1964,
engine='V6', disp=260, units='cu.in' )
car2 = dict( name='your_car', brand='Chevy', model='Camaro', year=1969,
engine='I6', disp=250, units='cu.in' )
car3 = dict( name='dads_car', brand='Mercedes', model='350SL', year=1972,
engine='V8', disp=4520, units='cc' )
car4 = dict( name='moms_car', brand='Plymouth', model='Voyager', year=1989,
engine='V6', disp=289, units='cu.in' )
a_truck = dict( brand='Dodge', model='RAM', year=1984,
engine='V8', disp=359, units='cu.in' )
garage = dict(my_car=car1,
your_car=car2,
dads_car=car3,
moms_car=car4,
a_truck=a_truck )
with h5py.File('SO_61226773.h5','w') as h5w:
for car in garage:
print ('\nLoading dictionary:', car)
load_dict_to_attr(h5w, garage.get(car))
with h5py.File('SO_61226773.h5','r') as h5r:
print ('\nReading dictionaries from Group attributes:')
h5r.visititems (get_grp_attrs)
If I understand what you are trying to do, this should work:
import numpy as np
import ast
import h5py
def store_table(filename):
table = dict()
table['test'] = list(np.zeros(7,dtype=int))
with h5py.File(filename, "w") as file:
file.create_dataset('dataset_1', data=str(table))
def load_table(filename):
file = h5py.File(filename, "r")
data = file.get('dataset_1')[...].tolist()
file.close();
return ast.literal_eval(data)
filename = "file.h5"
store_table(filename)
data = load_table(filename)
print(data)
My preferred solution is just to convert them to ascii and then store this binary data.
import h5py
import json
import itertools
#generate a test dictionary
testDict={
"one":1,
"two":2,
"three":3,
"otherStuff":[{"A":"A"}]
}
testFile=h5py.File("test.h5","w")
#create a test data set containing the binary representation of my dictionary data
testFile.create_dataset(name="dictionary",shape=(len([i.encode("ascii","ignore") for i in json.dumps(testDict)]),1),dtype="S10",data=[i.encode("ascii","ignore") for i in json.dumps(testDict)])
testFile.close()
testFile=h5py.File("test.h5","r")
#load the test data back
dictionary=testFile["dictionary"][:].tolist()
dictionary=list(itertools.chain(*dictionary))
dictionary=json.loads(b''.join(dictionary))
The two key parts are:
testFile.create_dataset(name="dictionary",shape=(len([i.encode("ascii","ignore") for i in json.dumps(testDict)]),1),dtype="S10",data=[i.encode("ascii","ignore") for i in json.dumps(testDict)])
Where
data=[i.encode("ascii","ignore") for i in json.dumps(testDict)])
Converts the dictionary to a list of ascii charecters (The string shape may also be calculated from this)
Decoding back from the hdf5 container is a little simpler:
dictionary=testFile["dictionary"][:].tolist()
dictionary=list(itertools.chain(*dictionary))
dictionary=json.loads(b''.join(dictionary))
All that this is doing is loading the string from the hdf5 container and converting it to a list of bytes. Then I coerce this into a bytes object which I can convert back to a dictionary with json.loads
If you are ok with the extra library usage (json, ittertools) I think this offers a somewhat more pythonic solution (which in my case wasnt a problem since I was using them anyway).

How to add an element to an empty JSON in python?

I created an empty string & convert it into a JSON by json.dump. Once I want to add element, it fails & show
AttributeError: 'str' object has no attribute 'append'
I tried both json.insert & json.append but neither of them works.
It seems that it's data type problem. As Python can't declare data type as Java & C can, how can I avoid the problem?
import json
data = {}
json_data = json.dumps(data)
json_data.append(["A"]["1"])
print (json_data)
JSON is a string representation of data, such as lists and dictionaries. You don't append to the JSON, you append to the original data and then dump it.
Also, you don't use append() with dictionaries, it's used with lists.
data = {} # This is a dictionary
data["a"] = "1"; # Add an item to the dictionary
json_data = json.dumps(data) # Convert the dictionary to a JSON string
print(json_data)

Python: JSON to CSV - special array handling

I am using this JSON to CSV Converter to convert my JSON data to CSV, which I can further work on in Excel:
https://github.com/vinay20045/json-to-csv
The structure of my JSON data looks like following: https://pastebin.com/rPkqcXiF
{
"page": 1,
"pages": 1,
"limit": 100,
"total": 20,
"items": [
{...}
]
}
In line 64 there is an array of items. First item is shown from line 65 to 92.
The next array with the same content would then be started in line 93, when available.
My problem now is: I fetch 2 datasets from the REST API.
One of those datasets has an items array of 2 items. Then the python script will generate further columns for new items. First array is items_0, next is items_1 and so on.
Example where you can see that I mean, with formatting for view in Excel: Pastebin EqGHX07U (only 2 links allowed here)
Instead of generating new columns when the amount of array elements rise, I'd like to have only one set of columns in the header of the csv. When the amount of array elements rise, there should be a new line generated with all other data like before - only the data of the new array changes.
Example where you can see that I mean, with formatting for view in Excel: Pastebin QLnaiqDs (only 2 links allowed here)
It would be awesome if you could help me out here! A few hints how to solve that are highly appreciated - I am not used to python though :(
Thank you so much!
If i understood correctly, here you have an approach. Think about it:
headers = []
csv_text = ""
def add_empty_fields():
"""Adds ';' char in each line in order to have empty fields for each new header"""
csv_lines = csv_text.split("\n")
csv_text = ""
for line in csv_lines[:-1]:
csv_text+=line+";\n"
csv_text+=csv_lines[-1]
def add_ordered_data(json_parsed_to_dict):
#Get headers of json
tmp_list = set(json_parsed_to_dict.keys())
#Add ordered data with headers list
for h in headers:
if h in tmp_list:
tmp_list.discard(h)
csv_text+=json_parsed_to_dict[h]+";"
else:
csv_text+=";"
#Add new headers behind it
for new_header in tmp_list:
headers.append(new_header)
csv_text+=json_parsed_to_dict[h]+";"
#add_empty_fields()
""" You can do csv_lines.replace("\n",";\n") here instead of add_empty_fields hahah """
csv_text+="\n"
I wrote all directly here, probably it have some fails but I hope this helps you
Apart from handling JSON data with node element. The script can also handle a JSON array without a node element as input. Refer readme file and sample_2 in the repo.
So, you can pre-process the input, to get all the items from both the API and merge them before feeding the list to the converter. Like...
data_to_be_processed = items_from_api_1 + items_from_api_2
You could write this pre-processor as a standalone module or modify my script just after line 77.
Hope it helps...

Error: Set not JSON Serializable while converting TSV File into JSON Format using Python

I'm looking to convert a TSV File I have into JSON Format for mapping (google fusion maps doesn't support mapping multiple objects in the same location so I'm converting it to JSON format to try on Mapbox). Here's my TSV file if you're curious:
https://github.com/yongcho822/Movies-in-the-park/blob/master/MovieParksGeocodeTest.tsv
And here is my corresponding python code thus far:
import json
import csv
def create_map(datafile):
geo_map = {"type":"FeatureCollection"}
item_list = []
with open(datafile, 'r') as tsvfile:
reader = csv.DictReader(tsvfile, delimiter = '\t')
for i, line in enumerate(reader):
data = {}
data['type'] = 'Feature'
data['id'] = i
data['properties']={'title': line['Movie Title'],
'description': line['Amenities'],
'date': line['Date']}
data['name'] = {line['Location']}
data['geometry'] = {'type':'Point',
'coordinates':(line['Lat'], line['Lng'])}
item_list.append(data)
#print item_list
for point in item_list:
geo_map.setdefault('features', []).append(point)
print 'CHECKPOINT'
with open("thedamngeojson.geojson", 'w') as f:
f.write(json.dumps(geo_map))
create_map('MovieParksGeocodeTest.tsv')
It's throwing me an error at the end (after it prints CHECKPOINT), saying
TypeError: set(['Edgebrook Park, Chicago ']) is not JSON serializable
I figure the last two lines is where the error is.. but what's wrong and how do I fix it??
JSON is designed to be a very simple, very portable format; the only kinds of values it understands are strings, numbers, booleans, null (like Python None), objects (like a Python dict) and arrays (like a Python list).
But at least one of the values in your dictionary is a set:
data['name'] = {line['Location']}
Since JSON doesn't have a set type, you get an error telling you that set … is not JSON serializable.
If you don't actually need this to be a set instead of a list (which you probably don't—if it really only ever have one element, who cares which collection type it is?), the easy answer is to change it to be a list:
data['name'] = [line['Location']]
(In fact, even when you do need this to be a set during processing, you usually don't need it to be a set during storage/interchange. If the consumer of your file needs to use it as a set, it can always convert it back later.)

Categories

Resources