I have a large number of JSON documents. I would like to store them in an RDBMS for querying. Once there they will never change; it's a data warehousing issue. I have lots of RDBMS data that I want to match the JSON data with, so it would be inefficient to store the JSON in a more traditional manner (e.g. CouchDB).
From hunting the web, I gather that the best approach might be to create JSON schema files using a tool such as JSON Schema Generator and then use that to build a structured RDBMS series of tables. My data is sufficiently limited in scope (minimal JSON nesting) that I could do this by hand if needed, but a tool that automatically converted from JSON schema to DB DDL statements would be great if it is out there.
My question has two parts but is aimed at the first issue - is there a tool or method by which I can create a master schema that describes all of my data. Many instances are missing various fields (and I have tens of gigabytes of JSON data)? The second part is with the serialization process. Does there exist a library (ideally Python) that would take a schema file and a JSON object and output the DML to insert that into a RDBMS?
We just published this package in https://github.com/deepstartup/jsonutils. May be you will find it useful. If you need us to update something, open up a JIRA.
Try:
pip install DDLJ
from DDLj import genddl
genddl(*param1,param2,*param3,*param4)
Where
param1= JSON Schema File
param2=Database (Default Oracle)
Param3= Glossary file
Param4= DDL output script
Some draft Python for converting JSON to DDL. You'll have to adapt it for JSON schema.
import json
import sys
fp = open(sys.argv[1])
jsobj = json.load(fp)
print "Create table("
for elt in jsobj["fields"]:
print elt["name"], elt["type"], ","
print ");"
Related
I'm querying a real estate API using Python (requests), with POST data submitted in JSON format.
I'm getting responses as expected - however each time I want to make a query I'm editing the fields in a hardcoded JSON object in the .py file.
I'd like to do something a bit more robust - eg using a user prompt to populate the JSON object to be submitted, based on the API search schema (see JSON file (pastebin)) (open to alternative python based solutions to this).
The linked schema includes the full list of parameters available to query - I'll likely trim this down to the ones that are most relevant to the queries that I'm building/POSTing, so that there are less parameters to deal with. I'd like to know of a Pythonic way to cycle through the Parameters in the Schema and then add the ones I wish to submit for a query to the JSON object?
TIA.
I have various data located within various data sources like Json ,various apis etc,.Now there is requirement to collate all these data and push it into cayley graph data base.
this will eventually act as an input for a chatbot framework. i am currently not aware of how collate existing data and push it into cayley graph n retrieve cayley graph database.
help needed …
thanks in advance
Unfortunately, Cayley cannot import JSON data directly by design.
The main reason is that it has no way of knowing which values in JSON are node IDs and which are regular string values.
However, it supports JSON-LD format which is the same as regular JSON but includes some additional annotations. These annotations help to solve an uncertainity I mentioned.
I suggest checking JSON-LD Playground examples first and then schema.org for a list of well-known object types. Note that it's also possible to define your own types. See JSON-LD documentation for details.
The last step would be to use Cayley's HTTP API v2 to import the data. Make sure to pass a correct Content-Type header, or use Cayley client that supports JSON-LD.
I am building a warehouse consisting of data that's found from a public facing API. In order to store & analyze the data, I'd like to save the JSON files I'm receiving into a structured SQL database. Meaning, all the JSON contents shouldn't be contained in 1 column. The contents should be parsed out and stored in various other tables in a relational database.
From a process standpoint, I need to do the following:
Call API
Receive JSON
Parse JSON file
Insert/Update table(s) in a SQL database
(This process will be repeated hundreds and hundreds of times)
Is there a best practice to accomplish this - from either a process or resource standpoint? I'd like to do this in Python if possible.
Thanks.
You should be able to use json.dumps(json_value) to convert your JSON object into a JSON string that can be put into an sql database.
I am creating a new application which uses ZODB and I need to import legacy data mainly from a postgres database but also from some csv files. There is a limited amount of manipulation needed to the data (sql joins to merge linked tables and create properties, change names of some properties, deal with empty columns etc).
With a subset of the postgres data I did a dump to csv files of all the relevant tables, read these into pandas dataframes and did the manipulation. This works but there are errors which are partly due to transferring the data into a csv first.
I now want to load all of the data in (and get rid of the errors). I am wondering if it makes sense to connect directly to the database and use read_sql or to carry on using the csv files.
The largest table (csv file) is only 8MB so I shouldn't have memory issues, I hope. Most of the errors are to do with encoding and or choice of separator (the data contains |,;,: and ').
Any advice? I have also read about something called Blaze and wonder if I should actually be using that.
If your CSV files aren't very large (as you say) then I'd try loading everything into postgres with odo, then using blaze to perform the operations, then finally dumping to a format that ZODB can understand. I wouldn't worry about the performance of operations like join inside the database versus in memory at the scale you're talking about.
Here's some example code:
from blaze import odo, Data, join
for csv, tablename in zip(csvs, tablenames):
odo(csv, 'postgresql://localhost/db::%s' % tablename)
db = Data('postgresql://localhost/db')
# see the link above for more operations
expr = join(db.table1, db.table2, 'column_to_join_on')
# execute `expr` and dump the result to a CSV file for loading into ZODB
odo(expr, 'joined.csv')
Is there a generic/automatic way in R or in python to parse xml files with its nodes and attributes, automatically generate mysql tables for storing that information and then populate those tables.
Regarding
Is there a generic/automatic way in R
to parse xml files with its nodes and
attributes, automatically generate
mysql tables for storing that
information and then populate those
tables.
the answer is a good old yes you can, at least in R.
The XML package for R can read XML documents and return R data.frame types in a single call using the xmlToDataFrame() function.
And the RMySQL package can transfer data.frame objects to the database in a single command---including table creation if need be---using the dbWriteTable() function defined in the common DBI backend for R and provided for MySQL by RMySQL.
So in short: two lines can do it, so you can easily write yourself a new helper function that does it along with a commensurate amount of error checking.
They're three separate operations: parsing, table creation, and data population. You can do all three with python, but there's nothing "automatic" about it. I don't think it's so easy.
For example, XML is hierarchical and SQL is relational, set-based. I don't think it's always so easy to get a good relational schema for every single XML stream you can encounter.
There's the XML package for reading XML into R, and the RMySQL package for writing data from R into MySQL.
Between the two there's a lot of work. XML surpasses the scope of a RDBMS like MySQL so something that could handle any XML thrown at it would be either ridiculously complex or trivially useless.
We do something like this at work sometimes but not in python. In that case, each usage requires a custom program to be written. We only have a SAX parser available. Using an XML decoder to get a dictionary/hash in a single step would help a lot.
At the very least you'd have to tell it which tags map to which to tables and fields, no pre-existing lib can know that...