I would like to modify the value of a field on a specific index of a nested type depending on another value of the same nested object or a field outside of the nested object.
As example, I have the current mapping of my index feed:
{
"feed": {
"mappings": {
"properties": {
"attacks_ids": {
"type": "keyword"
},
"created_by": {
"type": "keyword"
},
"date": {
"type": "date"
},
"groups_related": {
"type": "keyword"
},
"indicators": {
"type": "nested",
"properties": {
"date": {
"type": "date"
},
"description": {
"type": "text"
},
"role": {
"type": "keyword"
},
"type": {
"type": "keyword"
},
"value": {
"type": "keyword"
}
}
},
"malware_families": {
"type": "keyword"
},
"published": {
"type": "boolean"
},
"references": {
"type": "keyword"
},
"tags": {
"type": "keyword"
},
"targeted_countries": {
"type": "keyword"
},
"title": {
"type": "text"
},
"tlp": {
"type": "keyword"
}
}
}
}
}
Take the following document as example:
{
"took": 194,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 1,
"relation": "eq"
},
"max_score": 1,
"hits": [
{
"_index": "feed",
"_type": "_doc",
"_id": "W3CS7IABovFpcGfZjfyu",
"_score": 1,
"_source": {
"title": "Test",
"date": "2022-05-22T16:21:09.159711",
"created_by": "finch",
"tlp": "white",
"published": true,
"references": [
"test",
"test"
],
"tags": [
"tag1",
"tag2"
],
"targeted_countries": [
"Italy",
"Germany"
],
"malware_families": [
"family1",
"family2"
],
"groups_related": [
"group1",
"griup2"
],
"attacks_ids": [
""
],
"indicators": [
{
"value": "testest",
"description": "This is a test",
"type": "sha256",
"role": "file",
"date": "2022-05-22T16:21:09.159560"
},
{
"value": "testest2",
"description": "This is a test 2",
"type": "ipv4",
"role": "c2",
"date": "2022-05-22T16:21:09.159699"
}
]
}
}
]
}
}
I would like to make this update: indicators[0].value = 'changed'
if _id == 'W3CS7IABovFpcGfZjfyu'
or if title == 'some_title'
or if indicators[0].role == 'c2'
I already tried with a script, but it seems I can't manage to get it work, I hope the explanation is clear, ask any question if not, thank you.
Edit 1:
I managed to make it work, however it needs the _id, still looking for a way to do that without it.
My partial solution:
update = Pulse.get(id="XHCz7IABovFpcGfZWfz9") #Pulse is my document
update.update(script="for (indicator in ctx._source.indicators) {if (indicator.value=='changed2') {indicator.value='changed3'}}")
# Modify depending on the value of a field inside the same nested object
I'm using kafka kafka_2.11-0.11.0.2 and confluent version 3.3.0 for schema registry.
I have defined an avro schema as follows:
{
"namespace": "com.myntra.search",
"type": "record",
"name": "SearchDataIngestionObject",
"fields": [
{"name": "timestamp","type":"long"},
{"name": "brandList", "type":{ "type" : "array", "items" : "string" }},
{"name": "articleTypeList", "type":{ "type" : "array", "items" : "string" }},
{"name": "gender", "type":{ "type" : "array", "items" : "string" }},
{"name": "masterCategoryList", "type":{ "type" : "array", "items" : "string" }},
{"name": "subCategoryList", "type":{ "type" : "array", "items" : "string" }},
{"name": "quAlgo","type":{ "type" : "array", "items" : "string" }},
{"name": "colours", "type":{ "type" : "array", "items" : "string" }},
{"name": "isLandingPage", "type": "boolean"},
{"name": "isUserQuery", "type": "boolean"},
{"name": "isAutoSuggest", "type": "boolean"},
{"name": "userQuery", "type": "string"},
{"name": "correctedQuery", "type": "string"},
{"name": "completeSolrQuery", "type": "string"},
{"name": "atsaList", "type":{"type": "map", "values":{ "type" : "array", "items" : "string" }}},
{"name": "quMeta", "type": {"type": "map", "values": "string"}},
{"name": "requestId", "type": "string"}
]
}
And I'm trying to write some data to kafka as follows:
value = {
"timestamp": 1597399323000,
"brandList": ["brand_value"],
"articleTypeList": ["articleType_value"],
"gender": ["gender_value"],
"masterCategoryList": ["masterCategory_value"],
"subCategoryList": ["subCategory_value"],
"quAlgo": ["quAlgo_value"],
"colours": ["colours_value"],
"isLandingPage": False,
"isUserQuery": False,
"isAutoSuggest": False,
"userQuery": "userQuery_value",
"correctedQuery": "correctedQuery_value",
"completeSolrQuery": "completeSolrQuery_value",
"atsaList": {
"atsa_key1": ["atsa_value1"],
"atsa_key2": ["atsa_value2"],
"atsa_key3": ["atsa_value3"]
},
"quMeta": {
"quMeta_key1": "quMeta_value1",
"quMeta_key2": "quMeta_value2",
"quMeta_key3": "quMeta_value3"
},
"requestId": "requestId_value"
}
topic = "search"
key = str(uuid.uuid4())
producer.produce(topic=topic, key=key, value=value)
producer.flush()
But I'm getting the following error:
Traceback (most recent call last):
File "producer.py", line 61, in <module>
producer.produce(topic=topic, key=key, value=value)
File "/Library/Python/2.7/site-packages/confluent_kafka/avro/__init__.py", line 99, in produce
value = self._serializer.encode_record_with_schema(topic, value_schema, value)
File "/Library/Python/2.7/site-packages/confluent_kafka/avro/serializer/message_serializer.py", line 118, in encode_record_with_schema
return self.encode_record_with_schema_id(schema_id, record, is_key=is_key)
File "/Library/Python/2.7/site-packages/confluent_kafka/avro/serializer/message_serializer.py", line 152, in encode_record_with_schema_id
writer(record, outf)
File "/Library/Python/2.7/site-packages/confluent_kafka/avro/serializer/message_serializer.py", line 86, in <lambda>
return lambda record, fp: writer.write(record, avro.io.BinaryEncoder(fp))
File "/Library/Python/2.7/site-packages/avro/io.py", line 979, in write
raise AvroTypeException(self.writers_schema, datum)
avro.io.AvroTypeException: The datum {'quAlgo': ['quAlgo_value'], 'userQuery': 'userQuery_value', 'isAutoSuggest': False, 'isLandingPage': False, 'timestamp': 1597399323000, 'articleTypeList': ['articleType_value'], 'colours': ['colours_value'], 'correctedQuery': 'correctedQuery_value', 'quMeta': {'quMeta_key1': 'quMeta_value1', 'quMeta_key2': 'quMeta_value2', 'quMeta_key3': 'quMeta_value3'}, 'requestId': 'requestId_value', 'gender': ['gender_value'], 'isUserQuery': False, 'brandList': ['brand_value'], 'masterCategoryList': ['masterCategory_value'], 'subCategoryList': ['subCategory_value'], 'completeSolrQuery': 'completeSolrQuery_value', 'atsaList': {'atsa_key1': ['atsa_value1'], 'atsa_key2': ['atsa_value2'], 'atsa_key3': ['atsa_value3']}} is not an example of the schema {
"namespace": "com.myntra.search",
"type": "record",
"name": "SearchDataIngestionObject",
"fields": [
{
"type": "long",
"name": "timestamp"
},
{
"type": {
"items": "string",
"type": "array"
},
"name": "brandList"
},
{
"type": {
"items": "string",
"type": "array"
},
"name": "articleTypeList"
},
{
"type": {
"items": "string",
"type": "array"
},
"name": "gender"
},
{
"type": {
"items": "string",
"type": "array"
},
"name": "masterCategoryList"
},
{
"type": {
"items": "string",
"type": "array"
},
"name": "subCategoryList"
},
{
"type": {
"items": "string",
"type": "array"
},
"name": "quAlgo"
},
{
"type": {
"items": "string",
"type": "array"
},
"name": "colours"
},
{
"type": "boolean",
"name": "isLandingPage"
},
{
"type": "boolean",
"name": "isUserQuery"
},
{
"type": "boolean",
"name": "isAutoSuggest"
},
{
"type": "string",
"name": "userQuery"
},
{
"type": "string",
"name": "correctedQuery"
},
{
"type": "string",
"name": "completeSolrQuery"
},
{
"type": {
"values": {
"items": "string",
"type": "array"
},
"type": "map"
},
"name": "atsaList"
},
{
"type": {
"values": "string",
"type": "map"
},
"name": "quMeta"
},
{
"type": "string",
"name": "requestId"
}
]
}
I even trying the same example as given here but it doesn't work and throws the same error.
In your exception, there error is saying that the data you are providing it is the following:
{'userQuery': 'userQuery_value',
'isAutoSuggest': False,
'isLandingPage': False,
'correctedQuery': 'correctedQuery_value',
'isUserQuery': False,
'timestamp': 1597399323000,
'completeSolrQuery': 'completeSolrQuery_value',
'requestId': 'requestId_value'}
This is much less than what you claim you are providing it in your example.
Can you go back to your original code and on line 60 before you do producer.produce(topic=topic, key=key, value=value) just do a simple print(value) to make sure you are sending it the right value and that the value hasn't gotten overwritten by some other line of code.
I have a JSON schema with which I want to validate some data, using python and the jsonschema module. However, this doesn't quite work as expected, as some of the accepted data doesn't appear valid at all (to me and the purpose of my application). Sadly, the schema is provided, so I can't change the schema itself - at least not manually.
This is a shortened version of the schema ('schema.json' in code below):
{
"type": "object",
"allOf": [
{
"type": "object",
"allOf": [
{
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
}
}
},
{
"type": "object",
"properties": {
"language": {
"type": "integer"
}
}
}
]
},
{
"type": "object",
"properties": {
"addressArray": {
"type": "array",
"items": {
"type": "object",
"properties": {
"streetNumber": {
"type": "string"
},
"street": {
"type": "string"
},
"city": {
"type": "string"
}
}
}
}
}
}
]
}
This is an example of what should be a valid instance ('person.json' in code below):
{
"firstName": "Sherlock",
"lastName": "Holmes",
"language": 1,
"addresses": [
{
"streetNumber": "221B",
"street": "Baker Street",
"city": "London"
}
]
}
This is an example of what should be considered invalid ('no_person.json' in code below):
{
"name": "eggs",
"colour": "white"
}
And this is the code I used for validating:
from json import load
from jsonschema import Draft7Validator, exceptions
with open('schema.json') as f:
schema = load(f)
with open('person.json') as f:
person = load(f)
with open('no_person.json') as f:
no_person = load(f)
validator = Draft7Validator(schema)
try:
validator.validate(person)
print("person.json is valid")
except exceptions.ValidationError:
print("person.json is invalid")
try:
validator.validate(no_person)
print("no_person.json is valid")
except exceptions.ValidationError:
print("no_person.json is invalid")
Result:
person.json is valid
no_person.json is valid
I expected no_person.json to be invalid. What can there be done to have only data such as person.json to be validated successfully? Thank you very much for your help, I'm very new to this (spent ages searching for an answer).
This is work schema and pay attention on "required" (when there is no such key - if field is doesn't get it just skipped):
{
"type": "object",
"properties": {
"firstName": {
"type": "string"
},
"lastName": {
"type": "string"
},
"language": {
"type": "integer"
},
"addresses": {
"type": "array",
"items": {
"type": "object",
"properties": {
"streetNumber": {
"type": "string"
},
"street": {
"type": "string"
},
"city": {
"type": "string"
}
},
"required": [
"streetNumber",
"street",
"city"
]
}
}
},
"required": [
"firstName",
"lastName",
"language",
"addresses"
]
}
I've got:
person.json is valid
no_person.json is invalid
If you have hardest structure of response (array of objects, which contain objects etc) let me known
I have a large JSON file, about 5 million records and a file size of about 32GB, that I need to get loaded into our Snowflake Data Warehouse. I need to get this file broken up into chunks of about 200k records (about 1.25GB) per file. I'd like to do this in either Node.JS or Python for deployment to an AWS Lambda function, unfortunately I haven't coded in either, yet. I have C# and a lot of SQL experience, and learning both node and python are on my to do list, so why not dive right in, right!?
My first question is "Which language would better serve this function? Python, or Node.JS?"
I know I don't want to read this entire JSON file into memory (or even the output smaller file). I need to be able to "stream" it in and out into the new file based on a record count (200k), properly close up the json objects, and continue into a new file for another 200k, and so on. I know Node can do this, but if Python can also do this, I feel like it would be easier to quickly start using for other ETL stuff I'll be doing soon.
My second question is "Based on your recommendation above, can you also recommend what modules I should require/import to help me get started? Primarily as it relates to not pulling the entire json file into memory? Maybe some tips, tricks, or 'How would you do it's? And if you're feeling really generous, some code example to help push me into the deep end on this?
I can't include a sample of the JSON data, as it contains personal information. But I can provide the JSON schema ...
{
"$schema": "http://json-schema.org/draft-04/schema#",
"items": {
"properties": {
"activities": {
"properties": {
"activity_id": {
"items": {
"type": "integer"
},
"type": "array"
},
"frontlineorg_id": {
"items": {
"type": "integer"
},
"type": "array"
},
"import_id": {
"items": {
"type": "integer"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"is_source": {
"items": {
"type": "boolean"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"address": {
"properties": {
"city": {
"items": {
"type": "string"
},
"type": "array"
},
"congress_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"congress_dist_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"congress_end_yr": {
"items": {
"type": "integer"
},
"type": "array"
},
"congress_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"congress_start_yr": {
"items": {
"type": "integer"
},
"type": "array"
},
"county": {
"items": {
"type": "string"
},
"type": "array"
},
"formatted": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"latitude": {
"items": {
"type": "number"
},
"type": "array"
},
"longitude": {
"items": {
"type": "number"
},
"type": "array"
},
"number": {
"items": {
"type": "string"
},
"type": "array"
},
"observes_dst": {
"items": {
"type": "boolean"
},
"type": "array"
},
"post_directional": {
"items": {
"type": "null"
},
"type": "array"
},
"pre_directional": {
"items": {
"type": "null"
},
"type": "array"
},
"school_district": {
"items": {
"properties": {
"school_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"school_dist_type": {
"items": {
"type": "string"
},
"type": "array"
},
"school_grade_high": {
"items": {
"type": "string"
},
"type": "array"
},
"school_grade_low": {
"items": {
"type": "string"
},
"type": "array"
},
"school_lea_code": {
"items": {
"type": "integer"
},
"type": "array"
}
},
"type": "object"
},
"type": "array"
},
"secondary_number": {
"items": {
"type": "null"
},
"type": "array"
},
"secondary_unit": {
"items": {
"type": "null"
},
"type": "array"
},
"state": {
"items": {
"type": "string"
},
"type": "array"
},
"state_house_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"state_house_dist_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"state_senate_dist_name": {
"items": {
"type": "string"
},
"type": "array"
},
"state_senate_dist_number": {
"items": {
"type": "integer"
},
"type": "array"
},
"street": {
"items": {
"type": "string"
},
"type": "array"
},
"suffix": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"timezone": {
"items": {
"type": "string"
},
"type": "array"
},
"utc_offset": {
"items": {
"type": "integer"
},
"type": "array"
},
"zip": {
"items": {
"type": "integer"
},
"type": "array"
}
},
"type": "object"
},
"age": {
"type": "integer"
},
"anniversary": {
"properties": {
"date": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"baptism": {
"properties": {
"church_id": {
"type": "null"
},
"date": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"birth_dd": {
"type": "integer"
},
"birth_mm": {
"type": "integer"
},
"birth_yyyy": {
"type": "integer"
},
"church_attendance": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"cohabiting": {
"properties": {
"confidence": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood": {
"items": {
"type": "null"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"dating": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"divorced": {
"properties": {
"bool": {
"items": {
"type": "null"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood_considering": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"education": {
"properties": {
"est_level": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"email": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"is_work_school": {
"items": {
"type": "boolean"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"engaged": {
"properties": {
"insert_datetime_utc": {
"type": "null"
},
"likelihood": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"est_income": {
"properties": {
"est_level": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"ethnicity": {
"type": "string"
},
"first_name": {
"type": "string"
},
"formatted_birthdate": {
"type": "string"
},
"gender": {
"type": "string"
},
"head_of_household": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"home_church": {
"properties": {
"church_id": {
"type": "null"
},
"group_participant": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"is_coaching": {
"type": "null"
},
"is_giving": {
"type": "null"
},
"is_serving": {
"type": "null"
},
"membership_date": {
"type": "null"
},
"regular_attendee": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"hub_poid": {
"type": "integer"
},
"insert_datetime_utc": {
"type": "string"
},
"ip_address": {
"properties": {
"insert_datetime_utc": {
"type": "null"
},
"string": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"last_name": {
"type": "string"
},
"marriage_segment": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"married": {
"properties": {
"bool": {
"items": {
"type": "boolean"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"middle_name": {
"type": "string"
},
"miscellaneous": {
"properties": {
"attribute": {
"items": {
"type": "string"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"value": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"name_suffix": {
"type": "null"
},
"name_title": {
"type": "null"
},
"newlywed": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"parent": {
"properties": {
"bool": {
"items": {
"type": "boolean"
},
"type": "array"
},
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"likelihood_expecting": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"person_id": {
"type": "integer"
},
"phone": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"number": {
"items": {
"type": "integer"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"type": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"property_rights": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"psychographic_cluster": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"religion": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"religious_segment": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
},
"separated": {
"properties": {
"bool": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"significant_other": {
"properties": {
"first_name": {
"type": "null"
},
"insert_datetime_utc": {
"type": "null"
},
"last_name": {
"type": "null"
},
"middle_name": {
"type": "null"
},
"name_suffix": {
"type": "null"
},
"name_title": {
"type": "null"
},
"suppressed_datetime_utc": {
"type": "null"
}
},
"type": "object"
},
"suppressed_datetime_utc": {
"type": "string"
},
"target_group": {
"properties": {
"insert_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
},
"string": {
"items": {
"type": "string"
},
"type": "array"
},
"suppressed_datetime_utc": {
"items": {
"type": "string"
},
"type": "array"
}
},
"type": "object"
}
},
"type": "object"
},
"type": "array"
}
Use this code in linux command prompt
split -b 53750k <your-file>
cat xa* > <your-file>
Refer to this link:
https://askubuntu.com/questions/28847/text-editor-to-edit-large-4-3-gb-plain-text-file
Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.
If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:
https://www.npmjs.com/package/JSONStream
https://www.npmjs.com/package/stream-json
https://www.npmjs.com/package/json-stream
If you go with Python, there are streaming JSON parsers here as well:
https://github.com/kashifrazzaqui/json-streamer
https://github.com/danielyule/naya
http://www.enricozini.org/blog/2011/tips/python-stream-json/
consider to use jq to preprocessing your json files
it could split and stream your large json files
jq is like sed for JSON data - you can use it to slice
and filter and map and transform structured data with
the same ease that sed, awk, grep and friends let you play with text.
see the official documentation and this questions for more.
extra: for your first questions jq is written by C, it's faster than python/node isn't it ?
Snowflake has a very special treatment for JSON and if we understand them, it would be easy to draw the design.
JSON/Parquet/Avro/XML is considered as semi-structure data
They are stored as Variant data type in Snowflake.
While loading the JSON data into stage location, flag the strip_outer_array=true
copy into <table>
from #~/<file>.json
file_format = (type = 'JSON' strip_outer_array = true);
Each row size can not exceed 16Mb compressed when loaded in snowflake.
Snowflake data loading works well if the file size is split in the range of 10-100Mb in size.
Use the utilities which can split the file based on per line and have the file size note more than 100Mb and that brings the power of parallelism as well as accuracy for your data.
As per your data set size, you will get around 31K small files (of 100Mb size).
It means that the 31k parallel process run, however, it is not possible.
So choose an x-large size warehouse (16 v-core & 32 threads)
31k/32 = (approximately) 1000 rounds
This will not take more than a few minutes to load data based on your network bandwidth. Even if we think of 3sec per round, it may load the data in 50min.
Look at the warehouse configuration & throughput details and refer semi-structured data loading best practice.
The easiest approach that worked for me was this:
json_file = <your_file>
chunks = 200
for i in range(0,len(json_file), chunks):
print(json_file[i:i+chunks])
To split and compress at the same time with bash, resulting in files of ~100MB each:
cat bigfile.json | split -C 1000000000 -d -a4 - output_prefix --filter='gzip > $FILE.gz'
See more: https://stackoverflow.com/a/68718176/132438
You can use Python3 with the following script:
import json
def split_json(file_path):
with open(file_path, 'r') as json_file:
data = json.load(json_file)
chunk_size = len(data) // 3
for i in range(3):
with open(f"part{i}.json", 'w') as outfile:
outfile.write(json.dumps(data[i*chunk_size:(i+1)*chunk_size]))
file_path = input("Enter the file path of the JSON file: ")
split_json(file_path)