Dynamic (Changing) JSON to Hive Schema using UDF - python

I am having a JSON file with below structure:
{
"A": {
"AId": {
"AId": "123",
"idType": "XYZ"
},
"fN": "RfN",
"oN": "ON",
"mail": [
"abc#kml.com",
"xyz#kml.com"
],
"ph": [
{
"nu": "999-999-9999",
"t": "Of",
"ext": "1234"
},
{
"nu": "999-999-9999",
"t": "Of",
"ext": "1234"
}
],
"add": {
"addLines": [
"Addr Line 1",
"Addr Line 2"
],
"c": "C",
"sC": "S"
},
"c": [
{
"cT": "CT",
"cN": "9999"
}
],
"serId": "XXX"
},
"int": {
"endTS": null,
"cId": {
"cId": "null",
"cC": "null"
},
"cmpgn": null,
"sTC": null,
"cCID": {
"tIC": "null",
"tC": "null",
"cC": []
},
"int": "Un",
"rep": [],
"pp": "null",
"cf": {
"a": 1234,
"b": 1234
},
"iA": {
"sId": {
"s": "null",
"sId": "null"
},
"cId": "null",
"lId": "null"
},
"sRequest": null,
"vBu": "VBU",
"fId": "FId",
"k": [
"k"
],
"eng": [
{
"EC": "E_CODE::12345",
"cT": "2011-01-28T23:12:12.666Z",
"up": null,
"rep": {
"rep": {
"type": "B",
"id": "ID"
},
"fullName": "FullName"
}
}
]
}
}
Few Points:
From the above structure need to create hive schema.
JSON structure can change dynamically. For each change in JSON structure. Need to regenerate hive schema.
I tried, using JSON library of Python; but not of much use. I was not able to obtain tag names, which can be used as field names of hive schema.
Want to auto mate the process of generating JSON to Hive schema.
Exploring Python JSON Encoder, Decoder class; to parse the JSON and put own logic to create Hive schema out of it. But there is no good example available to use JSON Encoder, Decoder class.
Finally, want to put everything in form of Python UDF. I good with any Java UDF alternative too.
Note: The above JSON can be structured using http://jsonlint.com/

Related

how do I access this json data in python?

hi I'm pretty new at coding and I was trying to create a program in python that reads and save in another file the data inside a json file (not everything, just what I want). I googled how to parse data but there's something I don't understand.
that's a part of the json file:
`
{
"profileRevision": 548789,
"profileId": "campaign",
"profileChangesBaseRevision": 548789,
"profileChanges": [
{
"changeType": "fullProfileUpdate",
"profile": {
"_id": "2da4f079f8984cc48e84fc99dace495d",
"created": "2018-03-29T11:02:15.190Z",
"updated": "2022-10-31T17:34:43.284Z",
"rvn": 548789,
"wipeNumber": 9,
"accountId": "63881e614ef543b2932c70fed1196f34",
"profileId": "campaign",
"version": "refund_teddy_perks_september_2022",
"items": {
"8ec8f13f-6bf6-4933-a7db-43767a055e66": {
"templateId": "Quest:heroquest_loadout_constructor_2",
"attributes": {
"quest_state": "Claimed",
"creation_time": "min",
"last_state_change_time": "2019-05-18T16:09:12.750Z",
"completion_complete_pve03_diff26_loadout_constructor": 300,
"level": -1,
"item_seen": true,
"sent_new_notification": true,
"quest_rarity": "uncommon",
"xp_reward_scalar": 1
},
"quantity": 1
},
"6940c71b-c74b-4581-9f1e-c0a87e246884": {
"templateId": "Worker:workerbasic_sr_t01",
"attributes": {
"gender": "2",
"personality": "Homebase.Worker.Personality.IsDreamer",
"level": 1,
"item_seen": true,
"squad_slot_idx": -1,
"portrait": "WorkerPortrait:IconDef-WorkerPortrait-Dreamer-F02",
"building_slot_used": -1,
"set_bonus": "Homebase.Worker.SetBonus.IsMeleeDamageLow"
}
}
}
]
}
`
I can access profileChanges. I wrote this to create another json file with only the profileChanges things:
`
myjsonfile= open("file.json",'r')
jsondata=myjsonfile.read()
obj=json.loads(jsondata)
ciso=obj['profileChanges']
for i in ciso:
print(i)
with open("file2", "w") as outfile:
json.dump( ciso, outfile, indent=1)
the issue I have is that I can't access "profile" (inside profileChanges) in the same way by parsing the new file and I have no idea on how to do it
Access to JSON or dict element is realized by list indexes, please look at below example:
a = [
{
"friends": [
{
"id": 0,
"name": "Reba May"
}
],
"greeting": "Hello, Doris Gallagher! You have 2 unread messages.",
"favoriteFruit": "strawberry"
},
]
b = a['friends']['id] # b = 0
I've added a couple of closing braces to make your snippet valid json:
s = '''{
"profileRevision": 548789,
"profileId": "campaign",
"profileChangesBaseRevision": 548789,
"profileChanges": [
{
"changeType": "fullProfileUpdate",
"profile": {
"_id": "2da4f079f8984cc48e84fc99dace495d",
"created": "2018-03-29T11:02:15.190Z",
"updated": "2022-10-31T17:34:43.284Z",
"rvn": 548789,
"wipeNumber": 9,
"accountId": "63881e614ef543b2932c70fed1196f34",
"profileId": "campaign",
"version": "refund_teddy_perks_september_2022",
"items": {
"8ec8f13f-6bf6-4933-a7db-43767a055e66": {
"templateId": "Quest:heroquest_loadout_constructor_2",
"attributes": {
"quest_state": "Claimed",
"creation_time": "min",
"last_state_change_time": "2019-05-18T16:09:12.750Z",
"completion_complete_pve03_diff26_loadout_constructor": 300,
"level": -1,
"item_seen": true,
"sent_new_notification": true,
"quest_rarity": "uncommon",
"xp_reward_scalar": 1
},
"quantity": 1
},
"6940c71b-c74b-4581-9f1e-c0a87e246884": {
"templateId": "Worker:workerbasic_sr_t01",
"attributes": {
"gender": "2",
"personality": "Homebase.Worker.Personality.IsDreamer",
"level": 1,
"item_seen": true,
"squad_slot_idx": -1,
"portrait": "WorkerPortrait:IconDef-WorkerPortrait-Dreamer-F02",
"building_slot_used": -1,
"set_bonus": "Homebase.Worker.SetBonus.IsMeleeDamageLow"
}
}
}
}
}
]
}
'''
d = json.loads(s)
print(d['profileChanges'][0]['profile']['version'])
This prints refund_teddy_perks_september_2022
Explanation:
d is a dict
d['profileChanges'] is a list of dicts
d['profileChanges'][0] is the first dict in the list
d['profileChanges'][0]['profile'] is a dict
d['profileChanges'][0]['profile']['version'] is the value of version key in the profile dict in the first entry of the profileChanges list.

Create/ re-create a list of dictionaries from a dictionary via Python Recursion function

So, I'm trying to parse this json object into multiple events, as it's the expected input for a ETL tool. I know this is quite straight forward if we do this via loops, if statements and explicitly defining the search fields for given events. This method is not feasible because I have multiple heavily nested JSON objects and I would prefer to let the python recursions handle the heavy lifting. The following is a sample object, which consist of string, list and dict (basically covers most use-cases, from the data I have).
{
"event_name": "restaurants",
"properties": {
"_id": "5a9909384309cf90b5739342",
"name": "Mangal Kebab Turkish Restaurant",
"restaurant_id": "41009112",
"borough": "Queens",
"cuisine": "Turkish",
"address": {
"building": "4620",
"coord": {
"0": -73.9180155,
"1": 40.7427742
},
"street": "Queens Boulevard",
"zipcode": "11104"
},
"grades": [
{
"date": 1414540800000,
"grade": "A",
"score": 12
},
{
"date": 1397692800000,
"grade": "A",
"score": 10
},
{
"date": 1381276800000,
"grade": "A",
"score": 12
}
]
}
}
And I want to convert it to this following list of dictionaries
[
{
"event_name": "restaurants",
"properties": {
"restaurant_id": "41009112",
"name": "Mangal Kebab Turkish Restaurant",
"cuisine": "Turkish",
"_id": "5a9909384309cf90b5739342",
"borough": "Queens"
}
},
{
"event_name": "restaurant_address",
"properties": {
"zipcode": "11104",
"ref_id": "41009112",
"street": "Queens Boulevard",
"building": "4620"
}
},
{
"event_name": "restaurant_address_coord"
"ref_id": "41009112"
"0": -73.9180155,
"1": 40.7427742
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1414540800000,
"ref_id": "41009112",
"score": 12,
"grade": "A",
"index": "0"
}
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1397692800000,
"ref_id": "41009112",
"score": 10,
"grade": "A",
"index": "1"
}
},
{
"event_name": "restaurant_grades",
"properties": {
"date": 1381276800000,
"ref_id": "41009112",
"score": 12,
"grade": "A",
"index": "2"
}
}
]
And most importantly these events will be broken up into independent structured tables to conduct joins, we need to create primary keys/ unique identifiers. So the deeply nested dictionaries should have its corresponding parents_id field as ref_id. In this case ref_id = restaurant_id from its parent dictionary.
Most of the example on the internet flatten's the whole object to be normalized and into a dataframe, but to utilise this ETL tool to its full potential it would be ideal to solve this problem via recursions and outputting as list of dictionaries.
This is what one might call a brute force method. Create a translator function to move each item into the correct part of the new structure (like a schema).
# input dict
d = {
"event_name": "demo",
"properties": {
"_id": "5a9909384309cf90b5739342",
"name": "Mangal Kebab Turkish Restaurant",
"restaurant_id": "41009112",
"borough": "Queens",
"cuisine": "Turkish",
"address": {
"building": "4620",
"coord": {
"0": -73.9180155,
"1": 40.7427742
},
"street": "Queens Boulevard",
"zipcode": "11104"
},
"grades": [
{
"date": 1414540800000,
"grade": "A",
"score": 12
},
{
"date": 1397692800000,
"grade": "A",
"score": 10
},
{
"date": 1381276800000,
"grade": "A",
"score": 12
}
]
}
}
def convert_structure(d: dict):
''' function to convert to new structure'''
# the new dict
e = {}
e['event_name'] = d['event_name']
e['properties'] = {}
e['properties']['restaurant_id'] = d['properties']['restaurant_id']
# and so forth...
# keep building the new structure / template
# return a list
return [e]
# run & print
x = convert_structure(d)
print(x)
the reuslt (for the part done) looks like this:
[{'event_name': 'demo', 'properties': {'restaurant_id': '41009112'}}]
If a pattern is identified, then the above could be improved...

Creating a config file to map model in json format to csv

How can I build a config file that has a python model in json format that maps to the csv column name?
My plan is when a json data comes in, it will look into this config file, map the data's field with the config's json field, then get the csv column name. Example as below:
First I have a json as below:
{
"id": {
"type": "integer",
"format": "int64"
},
"category": {
"$ref": "#/definitions/Category",
"x-scope": [
"https://petstore.swagger.io/v2/swagger.json"
]
},
"name": {
"type": "string",
"example": "doggie"
},
"photoUrls": {
"type": "array",
"xml": {
"name": "photoUrl",
"wrapped": true
},
"items": {
"type": "string"
}
},
"tags": {
"type": "array",
"xml": {
"name": "tag",
"wrapped": true
},
"items": {
"$ref": "#/definitions/Tag",
"x-scope": [
"https://petstore.swagger.io/v2/swagger.json"
]
}
},
"status": {
"type": "string",
"description": "pet status in the store",
"enum": [
"available",
"pending",
"sold"
]
}
}
Then I have csv column name as below:
id, category, name, photoUrls, tags, status
Now I want a config file which has mapping of this 2 field and probably some settings like delimiter type , etc.
How can I achieve that using python?

Python parse large JSON nests and lists - string indices must be integers

NIST recently released all CVE data in JSON format, and I am trying to parse it out to add to a MySQL database so I can compare my security findings to what NIST shows.
The data, is very confusing to parses because there is a lot of nesting, with some lists included.
Here is a snippet of the JSON.
{
"CVE_data_type": "CVE",
"CVE_data_format": "MITRE",
"CVE_data_version": "4.0",
"CVE_data_numberOfCVEs": "600",
"CVE_data_timestamp": "Fri Apr 28 16:00:10 EDT 2017",
"CVE_Items": [
{
"CVE_data_meta": {
"CVE_ID": "CVE-2007-6761"
},
"CVE_affects": {
"CVE_vendor": {
"CVE_data_version": "4.0",
"CVE_vendor_data": [
{
"CVE_vendor_name": "linux",
"CVE_product": {
"CVE_product_data": [
{
"CVE_data_version": "4.0",
"CVE_product_name": "linux_kernel",
"CVE_version": {
"CVE_version_data": [
{
"CVE_version_value": "2.6.23",
"CVE_version_affected": "<="
}
]
}
}
]
}
}
]
}
},
"CVE_configurations": {
"CVE_data_version": "4.0",
"CVE_configuration_data": [
{
"operator": "OR",
"cpe": [
{
"vulnerable": true,
"previousVersions": true,
"cpeMatchString": "cpe:/o:linux:linux_kernel:2.6.23",
"cpe23Uri": "cpe:2.3:o:linux:linux_kernel:2.6.23:*:*:*:*:*:*:*"
}
]
}
]
},
"CVE_description": {
"CVE_data_version": "4.0",
"CVE_description_data": [
{
"lang": "en",
"value": "drivers/media/video/videobuf-vmalloc.c in the Linux kernel before 2.6.24 does not initialize videobuf_mapping data structures, which allows local users to trigger an incorrect count value and videobuf leak via unspecified vectors, a different vulnerability than CVE-2010-5321."
}
]
},
"CVE_references": {
"CVE_data_version": "4.0",
"CVE_reference_data": [
{
"url": "http://www.linuxgrill.com/anonymous/kernel/v2.6/ChangeLog-2.6.24",
"name": "CONFIRM",
"publish_date": "04/24/2017"
},
{
"url": "http://www.securityfocus.com/bid/98001",
"name": "BID",
"publish_date": "04/26/2017"
},
{
"url": "https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=827340",
"name": "MISC",
"publish_date": "04/24/2017"
},
{
"url": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0b29669c065f60501e7289e1950fa2a618962358",
"name": "CONFIRM",
"publish_date": "04/24/2017"
},
{
"url": "https://github.com/torvalds/linux/commit/0b29669c065f60501e7289e1950fa2a618962358",
"name": "CONFIRM",
"publish_date": "04/24/2017"
}
]
},
"CVE_impact": {
"CVE_impact_cvssv2": {
"bm": {
"av": "LOCAL",
"ac": "LOW",
"au": "NONE",
"c": "PARTIAL",
"i": "PARTIAL",
"a": "PARTIAL",
"score": "4.6"
}
},
"CVE_impact_cvssv3": {
"bm": {
"av": "LOCAL",
"ac": "LOW",
"pr": "LOW",
"ui": "NONE",
"scope": "UNCHANGED",
"c": "HIGH",
"i": "HIGH",
"a": "HIGH",
"score": "7.8"
}
}
},
"CVE_problemtype": {
"CVE_data_version": "4.0",
"CVE_problemtype_data": [
{
"description": [
{
"lang": "en",
"value": "CWE-119"
}
]
}
]
}
}
]
}
When I try to parse it to get the info I want, I run into errors. Here is the code test.
import json
with open('/tmp/nvdcve-1.0-recent.json') as data_file:
cve_data = json.load(data_file)
product_list = []
for data_list in cve_data["CVE_Items"]:
for cve_tag,cve_id in data_list["CVE_data_meta"].items():
cve = str(cve_id)
for vendor_data in data_list["CVE_affects"]["CVE_vendor"]["CVE_vendor_data"]["CVE_product"]:
for data_version,product_name,version_set in vendor_data["CVE_product_data"].items():
print(product_name)
The Error
TypeError Traceback (most recent call last)
<ipython-input-10-81b0239327c1> in <module>()
10 cve = str(cve_id)
11
---> 12 for vendor_data in data_list["CVE_affects"]["CVE_vendor"]["CVE_vendor_data"]["CVE_product"]:
13 for data_version,product_name,version_set in vendor_data["CVE_product_data"].items():
14 print data_version
TypeError: list indices must be integers, not str
This is confusing to me because there is nests within nests, and lists within theses nests. I am having a hard time figuring out how to get some of this super nested info.
I feel your pain, but after closer inspection "CVE_vendor_data" is not a dictionary, but a list of dictionaries. Notice the "[]" after the colon. That is why it needs integers to index the list. Same goes for "CVE_product_data". It is also a list of dictionaries.

Python built JSON with mixed types

Actually I build Json object starting from a python object.
My starting JSON is:
responseMsgObject = {'Version': 1,
'Id': 'xc23',
'Local': "US"
'Type': "Test",
'Message' : "Message body" }
responseMsgJson = json.dumps(responseMsgObject, sort_keys=False )
Every things works but now I need to put the JSON below into "Message" field.
{
"DepID": "001",
"Assets": [
{
"Type": "xyz",
"Text": [
"abc",
"def"
],
"Metadata": {
"V": "1",
"Req": true,
"Other": "othervalue"
},
"Check": "refdw321"
},
{
"Type": "jkl",
"Text": [
"ghi"
],
"Metadata": {
"V": "6"
},
"Check": "345ghsdan"
}
]
}
I built many other json (but simpler) but I'm in trouble with this json.
Thanks for the help.
try to replace true with True works fine for me
import json
responseMsgObject = {
'Version': 1,
'Id': 'xc23',
'Local': "US",
'Type': "Test",
'Message': {
"DepID": "001",
"Assets": [{
"Type": "xyz",
"Text": [
"abc",
"def"
],
"Metadata": {
"V": "1",
"Req": True,
"Other": "othervalue"
},
"Check": "refdw321"
}, {
"Type": "jkl",
"Text": [
"ghi"
],
"Metadata": {
"V": "6"
},
"Check": "345ghsdan4"
}]
}
}
responseMsgJson = json.dumps(responseMsgObject, sort_keys=False )
print("responseMsgJson", responseMsgJson)
DEMO

Categories

Resources