How to use Python to effectively parse a frequently updated json file? - python

I need to use Python to parse a json file that that has this format:
{
"seldom_changed_key_0": "frequently_changed_value_0",
"seldom_changed_key_1": "frequently_changed_value_1",
"seldom_changed_key_2": "frequently_changed_value_2",
"seldom_changed_key_3": "frequently_changed_value_3",
....
"seldom_changed_key_n": "frequently_changed_value_n",
}
and this json file is huge in size(greater than 100M bytes)
As the format indicates, the content might change frequently but the format(or structure) itself seldom changes.
Each time this json file is updated, I have to re-parse it from scratch.
Do we have any optimization method that can reduce the parse time?
Any suggestions are appreciated.

Related

Load part of JSON with `json` and part with `pandas` in Python

I have a JSON file that contains "informational" fields, followed by a massive "data" array (also in JSON format). I'll analyze the data array as a Pandas DataFrame. I'm trying to make my script robust to possibly enormous JSON files.
So I'd like to avoid loading the massive data array into memory twice. For example, the following code will create the Data array (which could eventually be huge) twice, - once with json.load(), and a second time with pandas.DataFrame.from_records().
Example JSON file "dummydata.json" :
{
"ID":4,
"DUT":"Resistor1",
"Timestamp":"2022-09-23T16:56:29.653-05:00",
"Voltage Units":"V",
"Measurement Time Units":"hr",
"Current Units":"A",
"Data":{
"Voltage":9.9984,
"Measurement Time":[ 0.000085, 0.000363, 0.000641, ... ],
"Current":[ 0.000000, 0.010600, 0.010500, ... ],
}
}
(You can see the Data dict at the end is the part that can become very large over time, as we record Current over a very long time.)
# Load the whole JSON file
import json
data = json.load( open('dummydata.json', 'r') )
# Create Pandas DataFrame from measurement data:
import pandas as pd
df = pd.DataFrame.from_records( data['Data'] )
The above code has loaded the Data array into both a Dict data and a DataFrame df, which seems inefficient at best. Currently I'm only loading ~215k lines as we test the system, but I expect this to become up to 10 Million data lines in later renditions.
Is there a convenient way to grab only the "informational" fields from the JSON file (and avoid the "Data" dict),
then have Pandas load only the "Data" dict?
Or another way to sequentially load huge datafiles without overloading system memory (for things like plotting)?
By "convenient" I mean without me writing code to parse the file line by line myself, which I can certainly do, but hopefully this is a problem already solved by some brilliant module out there. Maybe indexing into the data file by line # or something similar?
My data files will always be ordered in this same order, with "Data" at the bottom, if that helps.
What you have is pretty efficient, honestly. (You could swap in a different, faster JSON parser such as orjson if you like.) Trying to mangle and parse the JSON data "by hand" to just extract Data is likely to be brittle and not worth it.
When you get to the point where you have 10M entries in the "Data" lists in the JSON, you'll likely have considered using something other than JSON for interchange anyway, since encoding such a JSON file will have taken a while as well – not to mention that pandas will need to spend a while inferring e.g. data formats from that dict where another format wouldn't need that.

How do I use output of beam.io.ReadFromText as a input in my python class?

I am reading one Json file in dataflow pipeline using beam.io.ReadFromText, When I pass its output to any of the class (ParDo) it will become element. I wanted to use this json file content in my class, How do I do this?
Content in Json File:
{"query": "select * from tablename", "Unit": "XX", "outputFileLocation": "gs://test-bucket/data.csv", "location": "US"}
Here I want to use each of its value like query, Unit, location and outputFileLocation in class Query():
p | beam.io.ReadFromText(file_pattern=user_options.inputFile) | 'Executing Query' >> beam.ParDo(Query())
My class:
class Query(beam.DoFn):
def process(self, element):
# do something using content available in element
.........
I don't think it is possible with current set of IOs.
the reason being that a multiline json requires parsing complete file to identify a single json block. This could have been possible if we had no parallelism while reading. However, as File based IOs run on multiple workers in parallel using certain partitioning logic and Line delimiter, parsing multiline json is not possible.
If you have multiple smaller files then you can probably read those files separately and emit the parsed json. You can further use a reshuffle to evenly distribute the data for the down stream operations.
The pipeline would look something like this.
Get File List -> Reshuffle -> Read content of individual files and emit the parsed json -> Reshuffle -> Do things.

How to extract some text from json file without loading it?

python lxml can be used to extract text (e.g., with xpath) from XML files without having to fully parse XML. For example, I can do the following which is faster than BeautifulSoup, especially for large input. I'd like to have some equivalent code for JSON.
from lxml import etree
tree = etree.XML('<foo><bar>abc</bar></foo>')
print type(tree)
r = tree.xpath('/foo/bar')
print [x.tag for x in r]
I see http://goessner.net/articles/JsonPath/. But I don't see an example python code to extract some text from a json file without having use json.load(). Could anybody show me an example? Thanks.
I'm assuming you don't want to load the entire JSON for performance reasons.
If that's the case, perhaps ijson is what you need. I used it to search huge JSON files (>8gb) and it works well.
However, you will have to implement the search code yourself.

Parsing N-Triples Via Streaming

I was fairly confused about this for some time but I finally learned how to parse a large N-Triples RDF store (.nt) using Raptor and the Redland Python Extensions.
A common example is to do the following:
import RDF
parser=RDF.Parser(name="ntriples")
model=RDF.Model()
stream=parser.parse_into_model(model,"file:./mybigfile.nt")
for triple in model:
print triple.subject, triple.predicate, triple.object
Parse_into_model() by default loads the object into memory, so if you are parsing a big file you could consider using a HashStorage as your model and serializing it that way.
But what if you want to just read the file and say, add it to MongoDB without loading it into a Model or anything complicated like that?
import RDF
parser=RDF.NTriplesParser()
for triple in parser.parse_as_stream("file:./mybigNTfile.nt"):
print triple.subject, triple.predicate, triple.object

Parameter with dictionary path

I am very new to Python and am not very familiar with the data structures in Python.
I am writing an automatic JSON parser in Python, the JSON message is read into a dictionary using Ultra-JSON:
jsonObjs = ujson.loads(data)
Now, if I try something like:
jsonObjs[param1][0][param2] it works fine
However, I need to get the path from an external source (I read it from the DB), we initially thought we'll just write in the DB:
myPath = [param1][0][param2]
and then try to access:
jsonObjs[myPath]
But after a couple of failures I realized I'm trying to access:
jsonObjs[[param1][0][param2]]
Is there a way to fix this without parsing myPath?
Many thanks for your help and advice
Store the keys in a format that preserves type information, e.g. JSON, and then use reduce() to perform recursive accesses on the structure.

Categories

Resources