I have the following invalid JSON string which I'd like to convert into valid JSON (so each "template" will have a vuln-x key before it):
{"template":"network/vsftpd-detection.yaml","matcher-status":true}{"template":"cves/2018/CVE-2018-15473.yaml","matcher-status":true}{"template":"cves/2016/CVE-2016-6210.yaml","matcher-status":true}
I'm currently doing the following in order to format it:
import json
s1 = '{"template":"network/vsftpd-detection.yaml","matcher-status":true}{"template":"cves/2018/CVE-2018-15473.yaml","matcher-status":true}{"template":"cves/2016/CVE-2016-6210.yaml","matcher-status":true}'
s2 = s1.split('{"template":')
num = s1.count('{"template":')
out_json = "{"
for x in range(num):
out_json += '"vuln-' + str(x) + '":{"template":' + s2[x+1]
new_json = out_json.replace("true}", "true},")
cleaned_json = new_json[:-1] + "}"
print(cleaned_json)
I feel this is incredibly messy and am sure there's a cleaner way to do it - any ideas?
Here's the desired output which I'm getting with my current script:
{
"vuln-0":{
"template":"network/vsftpd-detection.yaml",
"matcher-status":true
},
"vuln-1":{
"template":"cves/2018/CVE-2018-15473.yaml",
"matcher-status":true
},
"vuln-2":{
"template":"cves/2016/CVE-2016-6210.yaml",
"matcher-status":true
}
}
Add a delimiter between the dictionaries to enable easier splitting, then process as dictionaries:
import json
s = '{"template":"network/vsftpd-detection.yaml","matcher-status":true}{"template":"cves/2018/CVE-2018-15473.yaml","matcher-status":true}{"template":"cves/2016/CVE-2016-6210.yaml","matcher-status":true}'
# add a delimiter not used in the string (nul) and split on it.
strings = s.replace('}{', '}\0{').split('\0')
# dict comprehension
data = {f'vuln-{i}': json.loads(v) for i, v in enumerate(strings)}
print(json.dumps(data, indent=2))
Output:
{
"vuln-0": {
"template": "network/vsftpd-detection.yaml",
"matcher-status": true
},
"vuln-1": {
"template": "cves/2018/CVE-2018-15473.yaml",
"matcher-status": true
},
"vuln-2": {
"template": "cves/2016/CVE-2016-6210.yaml",
"matcher-status": true
}
}
Related
Below is the json with array of elements. How to get all the name values in a array? Is there a simplar way of doing it without for loop.
import json
from unicodedata import name
# Define json variable
jsondata = """[
{
"name":"Pen",
"unit_price":5
},
{
"name":"Eraser",
"unit_price":3
},
{
"name":"Pencil",
"unit_price":10
},
{
"name":"White paper",
"unit_price":15
}
]"""
# load the json data
items = json.loads(jsondata)
namelist = []
for keyval in items:
namelist.append((keyval['name']))
print(namelist)
names = [it['name'] for it in items]
on a nested JSON object, I would like to modify values and adding a JSON Object.
Assume a JSON Object like this:
{
"key1": "value1",
"key2": {
"key2_1": "value2_2 ",
"key2_2": {
"key2_2_1 ": "value2_2_1"
},
"key2_3": "value2_3",
"key2_4": {
"key2_4_1": [{
"key2_4_1_1a": "value2_4_1_1a",
"key2_4_1_2a": "value2_4_1_2a"
}, {
"key2_4_1_1b": "value2_4_1_1b",
"key2_4_1_2b": "value2_4_1_2b"
}]
}
},
"key3": {
"key3_1": "value3_2 ",
"key3_2": {
"key3_2_1 ": "value3_2_1"
},
"key3_3": "value3_3",
"key3_4": {
"key3_4_1": {
"key3_4_1_1": "value3_4_1_1"
}
}
}
}
now the JSON will be recursive iterated to search for a specific value.
The replacement value can be a string
repl = 'MyString'
a dict string
repl = '''{"MyKey": [{"MyKey1": "MyValye1"},{"MyKey2": "MyValye2"}]}'''
or a list
repl = '''[{"MyKey1": "MyValye1"},{"MyKey2": "MyValye2"}]'''
so after I found the key where the replacement to add, I would like to replace the existing value for the given key.
eg for the string:
a[key] = repl
How I can do this for dict or list replacements?
The result could be depending on the replacement variable, the string eg in "key2_1", the dict in "key2_2_1" or the list in "key2_3". The keys where string,dict or list are inserted are examples.
{
"key1": "value1",
"key2": {
"key2_1": "MyString",
"key2_2": {
"key2_2_1 ": {"MyKey": [{"MyKey1": "MyValye1"},{"MyKey2": "MyValye2"}]}
},
"key2_3": [{"MyKey1": "MyValye1"},{"MyKey2": "MyValye2"}],
"key2_4": {
"key2_4_1": [{
"key2_4_1_1a": "value2_4_1_1a",
"key2_4_1_2a": "value2_4_1_2a"
}, {
"key2_4_1_1b": "value2_4_1_1b",
"key2_4_1_2b": "value2_4_1_2b"
}]
}
}
}
i have a search function:
def searchNreplace(data, search_val, replace_val):
if isinstance(data, list):
return [searchNreplace(listd, search_val, replace_val) for listd in data]
if isinstance(data, dict):
return {dictkey: searchNreplace(dictvalue, search_val, replace_val) for dictkey, dictvalue in data.items()}
return replace_val if data == search_val else data
print(searchNreplace(data, "key3", repl))
If you really don't struggle with finding a key, you can use json library to parse your string to object and just assign it as str.
import json
repl = """{"MyKey": [{"MyKey1": "MyValye1"},{"MyKey2": "MyValye2"}]}"""
a[key] = json.loads(repl)
After that you can dump content back to file
with open("my_file", "w+") as f:
json.dump(a, f)
Problem Statement:
I have around 500 ZIP files with lots of XMLS, i am able to convert them to JSON and parse them to parquet files as example below for one nested JSON file.
Not able to process multiple files with spark also
I have below code that flattens whole JSON into pandas data frame but now have to run this code over 150,000 files. when my JSON is very big it takes around 2 minutes to flatten whole data. Also if i run it using SPARK over my RDD of multiple files it fails with either OOM or struct error.
Am i doing something wrong SPARK wise ?
import xmltodict
import pandas as pd
def parser(master_tree):
flatten_tree_node = []
def _process_leaves(tree:dict,prefix:str = "node", tree_node:dict = dict(), update:bool = True):
is_nested = False
if isinstance(tree,dict):
for k in tree.keys():
if type(tree[k]) == str:
colName = prefix + "_" + k
tree_node[colName] = tree[k]
elif type(tree[k]) == dict:
prefix += "_" + k
leave = tree[k]
_process_leaves(leave,prefix = prefix, tree_node = tree_node, update = False)
for k in tree.keys():
if type(tree[k]) == list:
is_nested = True
prefix += "_" + k
for leave in tree[k]:
_process_leaves(leave,prefix = prefix, tree_node = tree_node.copy())
if not is_nested and update:
flatten_tree_node.append(tree_node)
_process_leaves(master_tree)
df = pd.DataFrame(flatten_tree_node)
df.columns = df.columns.str.replace("#", "_")
df.columns = df.columns.str.replace("#", "_")
return df
def extractor(file_name,file):
data = file.decode('utf-8')
d = bytes(bytearray(data, encoding='utf-8'))
data = xmltodict.parse(d)
flatten_data = parser(dict_data)
return (file_name,flatten_data)
def extract_files(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return [extractor(file_name,file_obj.open(file_name).read()) for file_name in files]
zip_rdd = spark.read.format('binaryFile').load('/home/me/sample.zip').select('path','content').rdd
Fails here at the time of collection:
collected_data = zip_rdd.map(extract_files).collect()
Below Errors:
org.apache.spark.api.python.PythonException: 'struct.error: 'i' format requires -2147483648 <= number <= 2147483647'. Full traceback
or
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123
Although everything works fine when ran one only single file.
Example Run of parsing nested JSON using parser function is like below:
Is there a way to make it memory and speed efficient ?
import pandas as pd
tree= {
"products":
[
{
"id":"0",
"name": "First",
"emptylist":[],
"properties" :
{
"id" : "",
"name" : ""
}
},
{
"id":"1",
"name": "Second",
"emptylist":[],
"properties":
{
"id" : "23",
"name" : "a useful product",
"features" :
[
{
"name":"Features",
"id":"18",
"features":
[
{
"id":"1001",
"name":"Colour",
"value":"Black"
},
{
"id":"2093",
"name":"Material",
"value":"Plastic"
}
]
},
{
"name":"Sizes",
"id":"34",
"features":
[
{
"id":"4736",
"name":"Length",
"value":"56"
},
{
"id":"8745",
"name":"Width",
"value":"76"
}
]
}
]
}
},
{
"id":"2",
"name": "Third",
"properties" :
{
"id" : "876",
"name" : "another one",
"features" :
[
{
"name":"Box",
"id":"937",
"features":
[
{
"id":"3758",
"name":"Amount",
"value":"1"
},
{
"id":"2222",
"name":"Packaging",
"value":"Blister"
}
]
},
{
"name":"Features",
"id":"8473",
"features":
[
{
"id":"9372",
"name":"Colour",
"value":"White"
},
{
"id":"9375",
"name":"Position",
"value":"A"
},
{
"id":"2654",
"name":"Amount",
"value":"6"
}
]
}
]
}
}
]
}
def parser(master_tree):
flatten_tree_node = []
def _process_leaves(tree:dict,prefix:str = "node", tree_node:dict = dict(), update:bool = True):
is_nested = False
if isinstance(tree,dict):
for k in tree.keys():
if type(tree[k]) == str:
colName = prefix + "_" + k
tree_node[colName] = tree[k]
elif type(tree[k]) == dict:
prefix += "_" + k
leave = tree[k]
_process_leaves(leave,prefix = prefix, tree_node = tree_node, update = False)
for k in tree.keys():
if type(tree[k]) == list:
is_nested = True
prefix += "_" + k
for leave in tree[k]:
_process_leaves(leave,prefix = prefix, tree_node = tree_node.copy())
if not is_nested and update:
flatten_tree_node.append(tree_node)
_process_leaves(master_tree)
df = pd.DataFrame(flatten_tree_node)
df.columns = df.columns.str.replace("#", "_")
df.columns = df.columns.str.replace("#", "_")
return df
print(parser(tree))
node_products_id node_products_name ... node_products_properties_features_features_name node_products_properties_features_features_value
0 1 Second ... Colour Black
1 1 Second ... Material Plastic
2 1 Second ... Length 56
3 1 Second ... Width 76
4 2 Third ... Amount 1
5 2 Third ... Packaging Blister
6 2 Third ... Colour White
7 2 Third ... Position A
8 2 Third ... Amount 6
9 2 Third ... NaN NaN
[10 rows x 9 columns]
Do not collect this data, it's likely it will never fit in memory as you are trying to pull all the data into the driver.
You can just save it to a file directly.
collected_data = zip_rdd.map(extract_files).toDF("column","names","go","here")
collected_data.write.parquet("/path/to/folder")
I do not have spark 3.2 but I'm aware of the features it posses. And in this case it will make your life easy. unionByName is a new feature that will let you magically join schemas.
collected_data = spark.createDataFrame( data = [], schema = [] )
zip_array = spark.read.format('binaryFile').load('/home/me/sample.zip').select('path').collect() # this will likely fit in driver memory so it's OK to call. After all it's just a list of file paths.
for my_file in zip_array:
collected_data = collected_data.unionByName( spark.createDataFrame(extract_files(my_file)), allowMissingColumns=True )
collected_data.write.parquet("/path/to/folder")
For better efficiency you want to use mapParitions. There are a couple reasons why but this actually goes back to map/reduce era. You want to create an iterator as this can work at lower levels. Can be optimized and pipelined better.(Hence the use of yield)
MapParitition code will execute inside an executor, and can only contain 'python code'. No spark code allowed as you don't have access to the sparkContext in an executor. Sometimes requires imports to be completed in the function itself as the scope is local not global.
If you are looking to save more memory, you might want to reconsider an alternative to xmltodict.parse(d) and re-writing reformat. You could use a library that you initiate once per partition and re-use it for the entire set of rows in the partition. This would be more efficient than the static call to xmltodict.parse(d) that just uses memory to create the struct just to be thrown away immediately by the garbage collector as it goes out of scope. (Google lists several alternatives you can review and determine what one best fits your needs)
zip_array = spark.read.format('binaryFile').load('/home/me/sample.zip').select('path').collect() # this will likely fit in driver memory so it's OK to call. After all it's just a list of file paths.
def reformat(partitionData):
for row in partitionData:
in_memory_data = io.BytesIO(row[1])
file_obj = zipfile.ZipFile(in_memory_data, "r").namelist()
for file_name in file_obj:
yield extractor(file_name,file_obj.open(file_name).read())
collected_data = zip_array.rdd.mapPartitions(reformat).toDF("file_name","flattened_data")
collected_data.write.parquet("/path/to/folder")
I am trying to modify a JSON with Python but I can not do it correctly.
I have tried the module that comes with Python by default to deal with JSON but I can not do the next step.
The JSON to modify is this:
{
"uuid":"789ce6ed-ec0f-418b-8fad-6ba64cb8bd70",
"assetTemplate":[
{
"id":14,
"name":"Template-conectividad"
},
{
"id":54,
"name":"Template-discos-agata"
},
{
"id":17,
"name":"Template-servidor-linux"
}
],
"info":null
}
And it should look like this:
{
"uuid":"789ce6ed-ec0f-418b-8fad-6ba64cb8bd70",
"assetTemplate":[
{
"id":54,
"name":"Template-discos-agata"
},
{
"id":17,
"name":"Template-servidor-linux"
},
{
"id":85,
"name":"Template-conectividad-test"
}
],
"info":null
}
This is what I tried to remove the part I do not want but I have the part to insert the new data:
#!/usr/bin/python
import json
# We load JSON to modify
x = '{"uuid":"789ce6ed-ec0f-418b-8fad-6ba64cb8bd70","assetTemplate":[{"id":14,"name":"Template-conectividad"},{"id":54,"name":"Template-discos-agata"},{"id":17,"name":"Template-servidor-linux"}],"info":null}'
y = json.loads(x)
obj = y["assetTemplate"]
# We remove the object that we dont want
for i in range(len(obj)):
if obj[i]['id'] == 14:
del obj[i]
break
print(obj)
# We make output of what has been achieved
x = json.dumps(y)
print(x)
When you load a json, its contents are loaded as dictionaries ({} with contents as key:value) and lists ([]).
That means obj is a normal list - which you already kinda know because you iterate over it.
Because it's a normal list, you can just .append what you want as a dictionary, so:
d={"id":85,"name":"Template-conectividad-test"}
obj.append(d)
payload = [
{
"Beds:": "3"
},
{
"Baths:": "2.0"
},
{
"Sqft:": "1,260"
},
]
How would I have such list be like:
payload = [{'Beds':"3","Baths":"2.0","Sqft":"1,260"}]
instead of multiple dictionaries; I want one dictionary within the list.
Try this:
payload_new = [{i: j[i] for j in payload for i in j}]
This should help. Use the replace method to remove ":"
payload = [
{
"Beds:": "3"
},
{
"Baths:": "2.0"
},
{
"Sqft:": "1,260"
},
]
newDict = [{k.replace(":", ""): v for j in payload for k,v in j.items()}]
print newDict
Output:
[{'Beds': '3', 'Sqft': '1,260', 'Baths': '2.0'}]
Python 3 has built-in dictionary unfolding, try this
payload = {**payload_ for payload_ in payload}
To merge dictionaries in a big dictionary, you can write it this way:
payload={"Beds": 3 ,
"Baths": 2.0,
"Sqft": 1260
}
output:
>>>payload["Baths"]
2.0
views:
using [] was making it a array/list rather than a dictionary.
using "" on keys (e.g: "3") was making them strings instead of integers.