Split a large json file into multiple smaller files

Split a large json file into multiple smaller files - python

I have a large JSON file, about 5 million records and a file size of about 32GB, that I need to get loaded into our Snowflake Data Warehouse. I need to get this file broken up into chunks of about 200k records (about 1.25GB) per file. I'd like to do this in either Node.JS or Python for deployment to an AWS Lambda function, unfortunately I haven't coded in either, yet. I have C# and a lot of SQL experience, and learning both node and python are on my to do list, so why not dive right in, right!?
My first question is "Which language would better serve this function? Python, or Node.JS?"
I know I don't want to read this entire JSON file into memory (or even the output smaller file). I need to be able to "stream" it in and out into the new file based on a record count (200k), properly close up the json objects, and continue into a new file for another 200k, and so on. I know Node can do this, but if Python can also do this, I feel like it would be easier to quickly start using for other ETL stuff I'll be doing soon.
My second question is "Based on your recommendation above, can you also recommend what modules I should require/import to help me get started? Primarily as it relates to not pulling the entire json file into memory? Maybe some tips, tricks, or 'How would you do it's? And if you're feeling really generous, some code example to help push me into the deep end on this?
I can't include a sample of the JSON data, as it contains personal information. But I can provide the JSON schema ...
{
"$schema": "http://json-schema.org/draft-04/schema#",
"items": {
"properties": {
"activities": {
"properties": {
"activity_id": {
"items": {
"type": "integer"
},
"type": "array"
},
"frontlineorg_id": {
"items": {
"type": "integer"
},
"type": "array"
},
"import_id": {
"items": {
"type": "integer"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"is_source": {
"items": {
"type": "boolean"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"address": {
"properties": {
"city": {
"items": {
"type": "string"
},
"type": "array"
},
"congress_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"congress_dist_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"congress_end_yr": {
"items": {
"type": "integer"
},
"type": "array"
},
"congress_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"congress_start_yr": {
"items": {
"type": "integer"
},
"type": "array"
},
"county": {
"items": {
"type": "string"
},
"type": "array"
},
"formatted": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"latitude": {
"items": {
"type": "number"
},
"type": "array"
},
"longitude": {
"items": {
"type": "number"
},
"type": "array"
},
"number": {
"items": {
"type": "string"
},
"type": "array"
},
"observes_dst": {
"items": {
"type": "boolean"
},
"type": "array"
},
"post_directional": {
"items": {
"type": "null"
},
"type": "array"
},
"pre_directional": {
"items": {
"type": "null"
},
"type": "array"
},
"school_district": {
"items": {
"properties": {
"school_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"school_dist_type": {
"items": {
"type": "string"
},
"type": "array"
},
"school_grade_high": {
"items": {
"type": "string"
},
"type": "array"
},
"school_grade_low": {
"items": {
"type": "string"
},
"type": "array"
},
"school_lea_code": {
"items": {
"type": "integer"
},
"type": "array"
}
},
"type": "object"
},
"type": "array"
},
"secondary_number": {
"items": {
"type": "null"
},
"type": "array"
},
"secondary_unit": {
"items": {
"type": "null"
},
"type": "array"
},
"state": {
"items": {
"type": "string"
},
"type": "array"
},
"state_house_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"state_house_dist_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"state_senate_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"state_senate_dist_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"street": {
"items": {
"type": "string"
},
"type": "array"
},
"suffix": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"timezone": {
"items": {
"type": "string"
},
"type": "array"
},
"utc_offset": {
"items": {
"type": "integer"
},
"type": "array"
},
"zip": {
"items": {
"type": "integer"
},
"type": "array"
}
},
"type": "object"
},
"age": {
"type": "integer"
},
"anniversary": {
"properties": {
"date": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"baptism": {
"properties": {
"church_id": {
"type": "null"
},
"date": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"birth_dd": {
"type": "integer"
},
"birth_mm": {
"type": "integer"
},
"birth_yyyy": {
"type": "integer"
},
"church_attendance": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"cohabiting": {
"properties": {
"confidence": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood": {
"items": {
"type": "null"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"dating": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"divorced": {
"properties": {
"bool": {
"items": {
"type": "null"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood_considering": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"education": {
"properties": {
"est_level": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"email": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"is_work_school": {
"items": {
"type": "boolean"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"engaged": {
"properties": {
"insert_datetime_utc": {
"type": "null"
},
"likelihood": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"est_income": {
"properties": {
"est_level": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"ethnicity": {
"type": "string"
},
"first_name": {
"type": "string"
},
"formatted_birthdate": {
"type": "string"
},
"gender": {
"type": "string"
},
"head_of_household": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"home_church": {
"properties": {
"church_id": {
"type": "null"
},
"group_participant": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"is_coaching": {
"type": "null"
},
"is_giving": {
"type": "null"
},
"is_serving": {
"type": "null"
},
"membership_date": {
"type": "null"
},
"regular_attendee": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"hub_poid": {
"type": "integer"
},
"insert_datetime_utc": {
"type": "string"
},
"ip_address": {
"properties": {
"insert_datetime_utc": {
"type": "null"
},
"string": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"last_name": {
"type": "string"
},
"marriage_segment": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"married": {
"properties": {
"bool": {
"items": {
"type": "boolean"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"middle_name": {
"type": "string"
},
"miscellaneous": {
"properties": {
"attribute": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"value": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"name_suffix": {
"type": "null"
},
"name_title": {
"type": "null"
},
"newlywed": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"parent": {
"properties": {
"bool": {
"items": {
"type": "boolean"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood_expecting": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"person_id": {
"type": "integer"
},
"phone": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"number": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"type": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"property_rights": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"psychographic_cluster": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"religion": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"religious_segment": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"separated": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"significant_other": {
"properties": {
"first_name": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"last_name": {
"type": "null"
},
"middle_name": {
"type": "null"
},
"name_suffix": {
"type": "null"
},
"name_title": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"suppressed_datetime_utc": {
"type": "string"
},
"target_group": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
}
},
"type": "object"
},
"type": "array"
}

Use this code in linux command prompt
split -b 53750k <your-file>
cat xa* > <your-file>
Refer to this link:
https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file

Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.
If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:
https://www.npmjs.com/package/JSONStream
https://www.npmjs.com/package/stream-json
https://www.npmjs.com/package/json-stream
If you go with Python, there are streaming JSON parsers here as well:
https://github.com/kashifrazzaqui/json-streamer
https://github.com/danielyule/naya
http://www.enricozini.org/blog/2011/tips/python-stream-json/

consider to use jq to preprocessing your json files
it could split and stream your large json files
jq is like sed for JSON data - you can use it to slice
and filter and map and transform structured data with
the same ease that sed, awk, grep and friends let you play with text.
see the official documentation and this questions for more.
extra: for your first questions jq is written by C, it's faster than python/node isn't it ?

Snowflake has a very special treatment for JSON and if we understand them, it would be easy to draw the design.
JSON/Parquet/Avro/XML is considered as semi-structure data
They are stored as Variant data type in Snowflake.
While loading the JSON data into stage location, flag the strip_outer_array=true
copy into <table>
from #~/<file>.json
file_format = (type = 'JSON' strip_outer_array = true);
Each row size can not exceed 16Mb compressed when loaded in snowflake.
Snowflake data loading works well if the file size is split in the range of 10-100Mb in size.
Use the utilities which can split the file based on per line and have the file size note more than 100Mb and that brings the power of parallelism as well as accuracy for your data.
As per your data set size, you will get around 31K small files (of 100Mb size).
It means that the 31k parallel process run, however, it is not possible.
So choose an x-large size warehouse (16 v-core & 32 threads)
31k/32 = (approximately) 1000 rounds
This will not take more than a few minutes to load data based on your network bandwidth. Even if we think of 3sec per round, it may load the data in 50min.
Look at the warehouse configuration & throughput details and refer semi-structured data loading best practice.

The easiest approach that worked for me was this:
json_file = <your_file>
chunks = 200
for i in range(0,len(json_file), chunks):
print(json_file[i:i+chunks])

To split and compress at the same time with bash, resulting in files of ~100MB each:
cat bigfile.json | split -C 1000000000 -d -a4 - output_prefix --filter='gzip > $FILE.gz'
See more: https://stackoverflow.com/a/68718176/132438

You can use Python3 with the following script:
import json
def split_json(file_path):
with open(file_path, 'r') as json_file:
data = json.load(json_file)
chunk_size = len(data) // 3
for i in range(3):
with open(f"part{i}.json", 'w') as outfile:
outfile.write(json.dumps(data[i*chunk_size:(i+1)*chunk_size]))
file_path = input("Enter the file path of the JSON file: ")
split_json(file_path)

Related

Why is JSON-validation always successful using this schema containing 'allOf'?

I have a JSON schema with which I want to validate some data, using python and the jsonschema module. However, this doesn't quite work as expected, as some of the accepted data doesn't appear valid at all (to me and the purpose of my application). Sadly, the schema is provided, so I can't change the schema itself - at least not manually.
This is a shortened version of the schema ('schema.json' in code below):
{
"type": "object",
"allOf": [
{
"type": "object",
"allOf": [
{
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
}
}
},
{
"type": "object",
"properties": {
"language": {
"type": "integer"
}
}
}
]
},
{
"type": "object",
"properties": {
"addressArray": {
"type": "array",
"items": {
"type": "object",
"properties": {
"streetNumber": {
"type": "string"
},
"street": {
"type": "string"
},
"city": {
"type": "string"
}
}
}
}
}
}
]
}
This is an example of what should be a valid instance ('person.json' in code below):
{
"firstName": "Sherlock",
"lastName": "Holmes",
"language": 1,
"addresses": [
{
"streetNumber": "221B",
"street": "Baker Street",
"city": "London"
}
]
}
This is an example of what should be considered invalid ('no_person.json' in code below):
{
"name": "eggs",
"colour": "white"
}
And this is the code I used for validating:
from json import load
from jsonschema import Draft7Validator, exceptions
with open('schema.json') as f:
schema = load(f)
with open('person.json') as f:
person = load(f)
with open('no_person.json') as f:
no_person = load(f)
validator = Draft7Validator(schema)
try:
validator.validate(person)
print("person.json is valid")
except exceptions.ValidationError:
print("person.json is invalid")
try:
validator.validate(no_person)
print("no_person.json is valid")
except exceptions.ValidationError:
print("no_person.json is invalid")
Result:
person.json is valid
no_person.json is valid
I expected no_person.json to be invalid. What can there be done to have only data such as person.json to be validated successfully? Thank you very much for your help, I'm very new to this (spent ages searching for an answer).

This is work schema and pay attention on "required" (when there is no such key - if field is doesn't get it just skipped):
{
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"language": {
"type": "integer"
},
"addresses": {
"type": "array",
"items": {
"type": "object",
"properties": {
"streetNumber": {
"type": "string"
},
"street": {
"type": "string"
},
"city": {
"type": "string"
}
},
"required": [
"streetNumber",
"street",
"city"
]
}
}
},
"required": [
"firstName",
"lastName",
"language",
"addresses"
]
}
I've got:
person.json is valid
no_person.json is invalid
If you have hardest structure of response (array of objects, which contain objects etc) let me known

Script to append JSON structure

I have a JSON structure which needs some code to be appended. I tried with SED and bash, that only appends at the end of a string or file, not the end of the structure.
{
"$schema": "http://json-schema.org/draft-04/schema#",
"required": [
"accounts"
],
"accounts": {
"required": "account",
"properties": {
"account": {
"type": "array",
"minItems": 1,
"maxItems": 999,
"required": [
"scheme",
"accountType",
"accountSubType"
],
"items": {
"type": "object",
"properties": {
"scheme": {
"description": "scheme",
"type": "object",
"required": [
"schemeName",
"identification"
],
"properties": {
"schemeName": {
"type": "string",
"maxLength": 40
},
"identification": {
"type": "string",
"maxLength": 256
},
"name": {
"type": "string",
"maxLength": 70
},
"secondaryIdentification": {
"type": "string",
"maxLength": 35
}
}
},
"currency": {
"type": "string",
"format": "iso-4217",
"pattern": "^[A-Z]{3,3}$",
"maxLength": 3,
"example": "EUR"
},
"accountType": {
"type": "string"
},
"accountSubType": {
"type": "string",
"maxLength": 35
}
}
}
}
}
}
}
I would like to update the above as
{
"$schema": "http://json-schema.org/draft-04/schema#",
"required": [
"accounts"
],
"accounts": {
"required": "account",
"properties": {
"account": {
"type": "array",
"minItems": 1,
"maxItems": 999,
"required": [
"scheme",
"accountType",
"accountSubType"
],
"items": {
"type": "object",
"properties": {
"scheme": {
"description": "scheme",
"type": "object",
"required": [
"schemeName",
"identification"
],
"properties": {
"schemeName": {
"type": "string",
"maxLength": 40
},
"identification": {
"type": "string",
"maxLength": 256
},
"name": {
"type": "string",
"maxLength": 70
},
"secondaryIdentification": {
"type": "string",
"maxLength": 35
}
},
"additionalProperties": false
},
"currency": {
"type": "string",
"format": "iso-4217",
"pattern": "^[A-Z]{3,3}$",
"maxLength": 3,
"example": "EUR"
},
"accountType": {
"type": "string"
},
"accountSubType": {
"type": "string",
"maxLength": 35
}
},
"additionalProperties": false
}
}
},
"additionalProperties": false
}
}
The difference is at the end of every "properties" section. I have appened it with "additionalProperties": false
Is there a way to do this through a script I can check and append all properties with that?

You can do this with jq (Requires jq 1.6 because it uses the walk() function to traverse the entire structure):
$ jq 'walk(if type == "object" and has("properties") then . + { additionalProperties: false } else . end)' your.json

Does it matter if "additionalProperties" comes after or before "properties"?
If not, you could use sed to add "additionalProperties" before the object "properties" like this:
sed -E 's/([[:space:]]*)"properties": {/\1"additionalProperties": false,|\1"properties": {/g'| tr '|' '\n'
With you you will get
{
"$schema": "http://json-schema.org/draft-04/schema#",
"required": [
"accounts"
],
"accounts": {
"required": "account",
"additionalProperties": false,
"properties": {
"account": {
"type": "array",
"minItems": 1,
"maxItems": 999,
"required": [
"scheme",
"accountType",
"accountSubType"
],
"items": {
"type": "object",
"additionalProperties": false,
"properties": {
"scheme": {
"description": "scheme",
"type": "object",
"required": [
"schemeName",
"identification"
],
"additionalProperties": false,
"properties": {
"schemeName": {
"type": "string",
"maxLength": 40
},
"identification": {
"type": "string",
"maxLength": 256
},
"name": {
"type": "string",
"maxLength": 70
},
"secondaryIdentification": {
"type": "string",
"maxLength": 35
}
}
},
"currency": {
"type": "string",
"format": "iso-4217",
"pattern": "^[A-Z]{3,3}$",
"maxLength": 3,
"example": "EUR"
},
"accountType": {
"type": "string"
},
"accountSubType": {
"type": "string",
"maxLength": 35
}
}
}
}
}
}
}

How to validate a specific jsonschema based on document type

I have a JSON schema for user's new message:
message_creation = {
"title": "Message",
"type": "object",
"properties": {
"post": {
"oneOf": [
{
"type": "object",
"properties": {
"content": {
"type": "string"
}
},
"additionalProperties": False,
"required": ["content"]
},
{
"type": "object",
"properties": {
"image": {
"type": "string"
}
},
"additionalProperties": False,
"required": ["image"]
},
{
"type": "object",
"properties": {
"video_path": {
"type": "string"
}
},
"additionalProperties": False,
"required": ["video"]
}
]
},
"doc_type": {
"type": "string",
"enum": ["text", "image", "video"]
}
},
"required": ["post", "doc_type"],
"additionalProperties": False
}
It's simple as that! There is two fields one is type and the other is post. So a payload like below succeeds:
{
"post": {
"image": "Hey there!"
},
"type": "image"
}
Now problem is that if user sets type value to text I cannot validate if text's schema has been given. How should I verify this? How should I check in case type is set to image then make sure that image exists inside of post?

You can do it, but it's complicated. This uses a boolean logic concept called implication to ensure that if schema A matches then schema B must also match.
{
"type": "object",
"properties": {
"post": {
"type": "object",
"properties": {
"content": { "type": "string" },
"image": { "type": "string" },
"video_path": { "type": "string" }
},
"additionalProperties": false
},
"doc_type": {
"type": "string",
"enum": ["text", "image", "video"]
}
},
"required": ["post", "doc_type"],
"additionalProperties": false,
"allOf": [
{ "$ref": "#/definitions/image-requires-post-image" },
{ "$ref": "#/definitions/text-requires-post-content" },
{ "$ref": "#/definitions/video-requires-post-video-path" }
],
"definitions": {
"image-requires-post-image": {
"anyOf": [
{ "not": { "$ref": "#/definitions/type-image" } },
{ "$ref": "#/definitions/post-image-required" }
]
},
"type-image": {
"properties": {
"doc_type": { "const": "image" }
}
},
"post-image-required": {
"properties": {
"post": { "required": ["image"] }
}
},
"text-requires-post-content": {
"anyOf": [
{ "not": { "$ref": "#/definitions/type-text" } },
{ "$ref": "#/definitions/post-content-required" }
]
},
"type-text": {
"properties": {
"doc_type": { "const": "text" }
}
},
"post-content-required": {
"properties": {
"post": { "required": ["content"] }
}
},
"video-requires-post-video-path": {
"anyOf": [
{ "not": { "$ref": "#/definitions/type-video" } },
{ "$ref": "#/definitions/post-video-path-required" }
]
},
"type-video": {
"properties": {
"doc_type": { "const": "video" }
}
},
"post-video-path-required": {
"properties": {
"post": { "required": ["video_path"] }
}
}
}
}

Edit Json schema with python

I have the main Json-schema. It looks like that:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"type": "object",
"properties": {
"listInfo": {
"type": "object",
"properties": {
"limit": {
"type": "integer"
},
"count": {
"type": "integer"
}
},
"required": [
"offset",
"count"
]
},
"items": {
"type": ["array", "null"],
"items": {
"type": "object",
"properties": {
"startDate": {
"type": "string"
},
"customer": {
"type": "object",
"properties": {
"customerId": {
"type": "integer"
},
"name": {
"type": "string"
}
},
"required": [
"customerId",
"name"
]
}
},
"required": [
"startDate",
"customer"
]
}
}
},
"required": [
"listInfo",
"items"
]
}
Every time after sending a get query to a host I check Json on a validation with my schema.
But Sometimes I need not all fields in it. For example I can add "&fields=startDate" in the end of my GET query.
How I can generate new Json schema for my new data (I need to automatically delete lines about "customer" in my old Json schema and generate a new json-schema file)?

Failed validating 'type' json schema

I've writen a small chunk os json schema but I'm getting a validation error using python jsonschema.
Here is my schema:
{
"$schema": "http://json-schema.org/draft-04/schema#",
"definitions": {
"output": {
"type": "object",
"properties": {
"Type": {
"type": "object",
"properties": {
"Type": {
"type": "string"
},
"Value": {
"type": "string"
},
"Default": {
"type": "string"
},
"Description": {
"type": "string"
},
"Options": {
"type": "array"
}
},
"required": [
"Type",
"Value",
"Default",
"Description",
"Options"
]
},
"Inverted": {
"type": "object",
"properties": {
"Type": {
"type": "string"
},
"Value": {
"type": "bool"
},
"Default": {
"type": "bool"
},
"Description": {
"type": "string"
}
},
"required": [
"Type",
"Value",
"Default",
"Description"
]
},
"Pulse Width": {
"type": "object",
"properties": {
"Type": {
"type": "string"
},
"Value": {
"type": "number"
},
"Default": {
"type": "number"
},
"Description": {
"type": "string"
}
},
"required": [
"Type",
"Value",
"Default",
"Description"
]
}
},
"required": [
"Type",
"Inverted",
"Pulse Width"
]
}
}
}
Here is the error I'm receiving:
Failed validating u'type' in schema
I'm attempting to validate my schema with:
schema = ""
with open(jsonSchemaFilePath, 'r') as schema_file:
schema = schema_file.read()
try:
Draft4Validator.check_schema(schema)
except SchemaError as schemaError:
print schemaError
What am I doing wrong with the schema I've written? Am I not allowed to have a property named Type?

My problem was Draft4Validator.check_schema takes a dic not a string, nor a json object.
Here was my solution:
schema = {}
with open(jsonSchemaFilePath, 'r') as schema_file:
schema = json.loads(schema_file.read())
try:
Draft4Validator.check_schema(schema)
except SchemaError as schemaError:
print schemaError

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Split a large json file into multiple smaller files - python

Use this code in linux command prompt split -b 53750k <your-file> cat xa* > <your-file> Refer to this link: https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file

The easiest approach that worked for me was this: json_file = <your_file> chunks = 200 for i in range(0,len(json_file), chunks): print(json_file[i:i+chunks])

To split and compress at the same time with bash, resulting in files of ~100MB each: cat bigfile.json | split -C 1000000000 -d -a4 - output_prefix --filter='gzip > $FILE.gz' See more: https://stackoverflow.com/a/68718176/132438

Related

Why is JSON-validation always successful using this schema containing 'allOf'?

Script to append JSON structure

How to validate a specific jsonschema based on document type

Edit Json schema with python

Failed validating 'type' json schema

Categories

Resources