How to I convert it to dict - python

the list I have -
[
"Mathematics-2 (21SMT-125)",
"Mid-Semester Test-1",
"40",
"23.5",
"Mid-Semester Test-2",
"40",
"34",
"Disruptive Technologies - 2 (21ECH-103)",
"Experiment-1",
"20",
"19",
"Experiment-2",
"20",
"17",
"Experiment-3",
"20",
"18.5",
]
This list of stings is parsed from html using bs4
format to convert in :
{
"Subject": {
"Mathematics-2 (21SMT-125)": {
"Mid-Semester Test-1": [40,23.5],
"Mid-Semester Test-2": [40,34]
},
"Disruptive Technologies - 2 (21ECH-103)": {
"Experiment-1": [20,19],
"Experiment-2": [20,17],
"Experiment-3": [20,18.5]
}
}
}

The problem is that the list you provided is a flat list of items with no indicator of their hierarchical position in the desired structure.
One approach you could consider is if the entries that represent a parent object (Mathematics, etc...) are the only entries that contain parentheses, you could iterate on your list and use either string matching or regex to identify the parent, create a top level object for it then you'd need to add the next two entries as the value of the key/value pair as a list.
This assumes that you'll always have two subsequent values at the child level. If the number of attributes isn't fixed but they're always numeric you could use regex to determine if it's numeric or non-numeric and keep adding items to the value list until you hit another non-numeric entry, which would be treated as the next sibling in the hierarchy.

I would review the approach and check whether information from bs4 can be parsed in some smarter way - try to do more scrapping steps, first to reach subject, second "Semester/Experiment" third - grades.
If it's not possible and data returned from bs4 cannot be changed.. Only thing you can do is to try determine whether string is name of subject, semester or grade/score and try to use some while loops. Name of subject seems to have special code in the end, which can be distinguished from name of the semester/experiment using regexp and grade/scrore can be always parsed to number..

For data exactly like yours (where a string with a ( denotes a top-level entry, and there are always two numbers per entry), you could come up with a state machine sort of thing like this -- but like I commented, you really should improve your parsing code instead, since the HTML you're scraping your data off is likely already structured.
def is_float(s):
try:
float(s)
return True
except ValueError:
return False
def parse_inp(inp):
flat_map = {}
stack = []
x = 0
while x < len(inp):
if "(" in inp[x]:
stack.clear()
if is_float(inp[x]) and is_float(inp[x + 1]):
flat_map[tuple(stack)] = (float(inp[x]), float(inp[x + 1]))
x += 2
stack.pop(-1)
continue
stack.append(inp[x])
x += 1
return flat_map
def nest_flat_map(flat_map):
root = {}
for key_path, values_list in flat_map.items():
dst = root
for key in key_path[:-1]:
dst = dst.setdefault(key, {})
dst[key_path[-1]] = values_list
return root
inp = [
# ... data from original post
]
nested_map = nest_flat_map(parse_inp(inp))
print(nested_map)
This outputs the expected
{
"Mathematics-2 (21SMT-125)": {
"Mid-Semester Test-1": (40.0, 23.5),
"Mid-Semester Test-2": (40.0, 34.0),
},
"Disruptive Technologies - 2 (21ECH-103)": {
"Experiment-1": (20.0, 19.0),
"Experiment-2": (20.0, 17.0),
"Experiment-3": (20.0, 18.5),
},
}

You can use a fuzzy form of itertools.groupby to find the groups in this list of strings. This assumes that every class ends with the pattern "(classref-section)", and that it is followed by test or homework names each followed by one or more numeric scores.
source_data = [
"Mathematics-2 (21SMT-125)",
"Mid-Semester Test-1",
"40",
"23.5",
"Mid-Semester Test-2",
"40",
"34",
"Disruptive Technologies - 2 (21ECH-103)",
"Experiment-1",
"20",
"19",
"Experiment-2",
"20",
"17",
"Experiment-3",
"20",
"18.5",
]
from collections import defaultdict
import itertools
import json
import re
class_id_pattern = re.compile(r"\([A-Z0-9]+-\d+\)")
def is_class_reference(s):
return bool(class_id_pattern.match(s.rsplit(" ", 1)[-1]))
def group_by_class(s):
if is_class_reference(s):
group_by_class.current_class = s
return group_by_class.current_class
group_by_class.current_class = ""
def convert_numeric(s):
try:
return int(s)
except ValueError:
try:
return float(s)
except ValueError:
return None
def is_score(s):
return convert_numeric(s) is not None
def is_test(s):
return not is_score(s)
def group_by_test(s):
if is_test(s):
group_by_test.current_test = s
return group_by_test.current_test
group_by_test.current_test = ""
accum = defaultdict(lambda: defaultdict(list))
for class_name, class_name_and_tests in itertools.groupby(source_data, key=group_by_class):
class_name, *tests = class_name_and_tests
for test_name, test_name_and_scores in itertools.groupby(tests, key=group_by_test):
test_name, *scores = test_name_and_scores
accum[class_name][test_name].extend(convert_numeric(s) for s in scores)
print(json.dumps(accum, indent=4))
Prints:
{
"Mathematics-2 (21SMT-125)": {
"Mid-Semester Test-1": [
40,
23.5
],
"Mid-Semester Test-2": [
40,
34
]
},
"Disruptive Technologies - 2 (21ECH-103)": {
"Experiment-1": [
20,
19
],
"Experiment-2": [
20,
17
],
"Experiment-3": [
20,
18.5
]
}
}
Read more about fuzzy groupby in my blog post: https://thingspython.wordpress.com/2020/11/11/fuzzy-groupby-unusual-restaurant-part-ii/

Related

How to make JSON flattening memory efficient?

Problem Statement:
I have around 500 ZIP files with lots of XMLS, i am able to convert them to JSON and parse them to parquet files as example below for one nested JSON file.
Not able to process multiple files with spark also
I have below code that flattens whole JSON into pandas data frame but now have to run this code over 150,000 files. when my JSON is very big it takes around 2 minutes to flatten whole data. Also if i run it using SPARK over my RDD of multiple files it fails with either OOM or struct error.
Am i doing something wrong SPARK wise ?
import xmltodict
import pandas as pd
def parser(master_tree):
flatten_tree_node = []
def _process_leaves(tree:dict,prefix:str = "node", tree_node:dict = dict(), update:bool = True):
is_nested = False
if isinstance(tree,dict):
for k in tree.keys():
if type(tree[k]) == str:
colName = prefix + "_" + k
tree_node[colName] = tree[k]
elif type(tree[k]) == dict:
prefix += "_" + k
leave = tree[k]
_process_leaves(leave,prefix = prefix, tree_node = tree_node, update = False)
for k in tree.keys():
if type(tree[k]) == list:
is_nested = True
prefix += "_" + k
for leave in tree[k]:
_process_leaves(leave,prefix = prefix, tree_node = tree_node.copy())
if not is_nested and update:
flatten_tree_node.append(tree_node)
_process_leaves(master_tree)
df = pd.DataFrame(flatten_tree_node)
df.columns = df.columns.str.replace("#", "_")
df.columns = df.columns.str.replace("#", "_")
return df
def extractor(file_name,file):
data = file.decode('utf-8')
d = bytes(bytearray(data, encoding='utf-8'))
data = xmltodict.parse(d)
flatten_data = parser(dict_data)
return (file_name,flatten_data)
def extract_files(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return [extractor(file_name,file_obj.open(file_name).read()) for file_name in files]
zip_rdd = spark.read.format('binaryFile').load('/home/me/sample.zip').select('path','content').rdd
Fails here at the time of collection:
collected_data = zip_rdd.map(extract_files).collect()
Below Errors:
org.apache.spark.api.python.PythonException: 'struct.error: 'i' format requires -2147483648 <= number <= 2147483647'. Full traceback
or
java.lang.OutOfMemoryError
at java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:123
Although everything works fine when ran one only single file.
Example Run of parsing nested JSON using parser function is like below:
Is there a way to make it memory and speed efficient ?
import pandas as pd
tree= {
"products":
[
{
"id":"0",
"name": "First",
"emptylist":[],
"properties" :
{
"id" : "",
"name" : ""
}
},
{
"id":"1",
"name": "Second",
"emptylist":[],
"properties":
{
"id" : "23",
"name" : "a useful product",
"features" :
[
{
"name":"Features",
"id":"18",
"features":
[
{
"id":"1001",
"name":"Colour",
"value":"Black"
},
{
"id":"2093",
"name":"Material",
"value":"Plastic"
}
]
},
{
"name":"Sizes",
"id":"34",
"features":
[
{
"id":"4736",
"name":"Length",
"value":"56"
},
{
"id":"8745",
"name":"Width",
"value":"76"
}
]
}
]
}
},
{
"id":"2",
"name": "Third",
"properties" :
{
"id" : "876",
"name" : "another one",
"features" :
[
{
"name":"Box",
"id":"937",
"features":
[
{
"id":"3758",
"name":"Amount",
"value":"1"
},
{
"id":"2222",
"name":"Packaging",
"value":"Blister"
}
]
},
{
"name":"Features",
"id":"8473",
"features":
[
{
"id":"9372",
"name":"Colour",
"value":"White"
},
{
"id":"9375",
"name":"Position",
"value":"A"
},
{
"id":"2654",
"name":"Amount",
"value":"6"
}
]
}
]
}
}
]
}
def parser(master_tree):
flatten_tree_node = []
def _process_leaves(tree:dict,prefix:str = "node", tree_node:dict = dict(), update:bool = True):
is_nested = False
if isinstance(tree,dict):
for k in tree.keys():
if type(tree[k]) == str:
colName = prefix + "_" + k
tree_node[colName] = tree[k]
elif type(tree[k]) == dict:
prefix += "_" + k
leave = tree[k]
_process_leaves(leave,prefix = prefix, tree_node = tree_node, update = False)
for k in tree.keys():
if type(tree[k]) == list:
is_nested = True
prefix += "_" + k
for leave in tree[k]:
_process_leaves(leave,prefix = prefix, tree_node = tree_node.copy())
if not is_nested and update:
flatten_tree_node.append(tree_node)
_process_leaves(master_tree)
df = pd.DataFrame(flatten_tree_node)
df.columns = df.columns.str.replace("#", "_")
df.columns = df.columns.str.replace("#", "_")
return df
print(parser(tree))
node_products_id node_products_name ... node_products_properties_features_features_name node_products_properties_features_features_value
0 1 Second ... Colour Black
1 1 Second ... Material Plastic
2 1 Second ... Length 56
3 1 Second ... Width 76
4 2 Third ... Amount 1
5 2 Third ... Packaging Blister
6 2 Third ... Colour White
7 2 Third ... Position A
8 2 Third ... Amount 6
9 2 Third ... NaN NaN
[10 rows x 9 columns]
Do not collect this data, it's likely it will never fit in memory as you are trying to pull all the data into the driver.
You can just save it to a file directly.
collected_data = zip_rdd.map(extract_files).toDF("column","names","go","here")
collected_data.write.parquet("/path/to/folder")
I do not have spark 3.2 but I'm aware of the features it posses. And in this case it will make your life easy. unionByName is a new feature that will let you magically join schemas.
collected_data = spark.createDataFrame( data = [], schema = [] )
zip_array = spark.read.format('binaryFile').load('/home/me/sample.zip').select('path').collect() # this will likely fit in driver memory so it's OK to call. After all it's just a list of file paths.
for my_file in zip_array:
collected_data = collected_data.unionByName( spark.createDataFrame(extract_files(my_file)), allowMissingColumns=True )
collected_data.write.parquet("/path/to/folder")
For better efficiency you want to use mapParitions. There are a couple reasons why but this actually goes back to map/reduce era. You want to create an iterator as this can work at lower levels. Can be optimized and pipelined better.(Hence the use of yield)
MapParitition code will execute inside an executor, and can only contain 'python code'. No spark code allowed as you don't have access to the sparkContext in an executor. Sometimes requires imports to be completed in the function itself as the scope is local not global.
If you are looking to save more memory, you might want to reconsider an alternative to xmltodict.parse(d) and re-writing reformat. You could use a library that you initiate once per partition and re-use it for the entire set of rows in the partition. This would be more efficient than the static call to xmltodict.parse(d) that just uses memory to create the struct just to be thrown away immediately by the garbage collector as it goes out of scope. (Google lists several alternatives you can review and determine what one best fits your needs)
zip_array = spark.read.format('binaryFile').load('/home/me/sample.zip').select('path').collect() # this will likely fit in driver memory so it's OK to call. After all it's just a list of file paths.
def reformat(partitionData):
for row in partitionData:
in_memory_data = io.BytesIO(row[1])
file_obj = zipfile.ZipFile(in_memory_data, "r").namelist()
for file_name in file_obj:
yield extractor(file_name,file_obj.open(file_name).read())
collected_data = zip_array.rdd.mapPartitions(reformat).toDF("file_name","flattened_data")
collected_data.write.parquet("/path/to/folder")

python for loop processed in strict order

I am using a python for loop to process json data, but I need to enforce certain data is processed first. What I want to happen is within my list, items where each_ADSL["VRF"] = 17 are processed before any others.
My json data i am interpreting looks something like this:
"ADSL": [
{
"CE_HOSTNAME": "TESTCE-DCNCE-01",
"VRF": "19",
},
{
"CE_HOSTNAME": "TESTCE-DCNCE-01",
"VRF": "17",
}
]
I am interpreting this, then processing the data:
for each_ADSL in order["ADSL"]:
do something
This needs to take into account numbers lower than 17 (so a simple sort won't work.) Can i turn order["ADSL"] into a list and sort it by criteria somehow?
How about something like this
myjson = {
"ADSL": [
{
"CE_HOSTNAME": "TESTCE-DCNCE-01",
"VRF": "19",
},
{
"CE_HOSTNAME": "TESTCE-DCNCE-01",
"VRF": "17",
}
]
}
mylist = myjson["ADSL"]
list17 = []
for item in mylist:
if item["VRF"] == "17":
list17.append(item)
for item in list17:
do_first_action()
for item in mylist:
if item["VRF"] != "17":
do_second_action()

The best way to transform a response to a json format in the example

Appreciate if you could help me for the best way to transform a result into json as below.
We have a result like below, where we are getting an information on the employees and the companies. In the result, somehow, we are getting a enum like T, but not for all the properties.
[ {
"T.id":"Employee_11",
"T.category":"Employee",
"node_id":["11"]
},
{
"T.id":"Company_12",
"T.category":"Company",
"node_id":["12"],
"employeecount":800
},
{
"T.id":"id~Employee_11_to_Company_12",
"T.category":"WorksIn",
},
{
"T.id":"Employee_13",
"T.category":"Employee",
"node_id":["13"]
},
{
"T.id":"Parent_Company_14",
"T.category":"ParentCompany",
"node_id":["14"],
"employeecount":900,
"childcompany":"Company_12"
},
{
"T.id":"id~Employee_13_to_Parent_Company_14",
"T.category":"Contractorin",
}]
We need to transform this result into a different structure and grouping based on the category, if category in Employee, Company and ParentCompany, then it should be under the node_properties object, else, should be in the edge_properties. And also, apart from the common properties(property_id, property_category and node), different properties to be added if the category is company and parent company. There are few more logic also where we have to get the from and to properties of the edge object based on the 'to' . the expected response is,
"node_properties":[
{
"property_id":"Employee_11",
"property_category":"Employee",
"node":{node_id: "11"}
},
{
"property_id":"Company_12",
"property_category":"Company",
"node":{node_id: "12"},
"employeecount":800
},
{
"property_id":"Employee_13",
"property_category":"Employee",
"node":{node_id: "13"}
},
{
"property_id":"Company_14",
"property_category":"ParentCompany",
"node":{node_id: "14"},
"employeecount":900,
"childcompany":"Company_12"
}
],
"edge_properties":[
{
"from":"Employee_11",
"to":"Company_12",
"property_id":"Employee_11_to_Company_12",
},
{
"from":"Employee_13",
"to":"Parent_Company_14",
"property_id":"Employee_13_to_Parent_Company_14",
}
]
In java, we have used the enhanced for loop, switch etc. How we can write the code in the python to get the structure as above from the initial result structure. ( I am new to python), thank you in advance.
Regards
Here is a method that I quickly made, you can adjust it to your requirements. You can use regex or your own function to get the IDs of the edge_properties then assign it to an object like the way I did. I am not so sure of your full requirements but if that list that you gave is all the categories then this will be sufficient.
def transform(input_list):
node_properties = []
edge_properties = []
for input_obj in input_list:
# print(obj)
new_obj = {}
if input_obj['T.category'] == 'Employee' or input_obj['T.category'] == 'Company' or input_obj['T.category'] == 'ParentCompany':
new_obj['property_id'] = input_obj['T.id']
new_obj['property_category'] = input_obj['T.category']
new_obj['node'] = {input_obj['node_id'][0]}
if "employeecount" in input_obj:
new_obj['employeecount'] = input_obj['employeecount']
if "childcompany" in input_obj:
new_obj['childcompany'] = input_obj['childcompany']
node_properties.append(new_obj)
else: # You can do elif == to as well based on your requirements if there are other outliers
# You can use regex or whichever method here to split the string and add the values like above
edge_properties.append(new_obj)
return [node_properties, edge_properties]

Python, parsing JSON-order by/sort by

I have this JSON data:
"InstanceProfileList": [
{
"InstanceProfileId": "AIPAI6ZC646GGONRADRSK",
"Roles": [
{
"AssumeRolePolicyDocument": {
"Version": "2012-10-17",
"Statement": [
{
"Action": "sts:AssumeRole",
"Effect": "Allow",
"Principal": {
"Service": [
"ec2.amazonaws.com",
"ssm.amazonaws.com"
]
}
}
]
},
"RoleId": "AROAJMI3DEQ4AW5JJMFII",
"CreateDate": "2018-03-23T15:23:28Z",
"RoleName": "ec2ssmMaintWindow",
"Path": "/",
"Arn": "arn:aws:iam::279052847476:role/ec2ssmMaintWindow"
}
]
I use the following code to parse it:
def get_user_group_service(element):
s = ''
for e in element['AssumeRolePolicyDocument']['Statement']:
p = e['Principal']
if 'Federated' in p:
s += p['Federated']
if 'Service' in p:
obj = p['Service']
if type(obj) is str:
s += obj # element is string
else:
s += ''.join(obj) # element is array of strings
if 'AWS' in p:
s += p['AWS']
return s
Now, the issue is that sometimes the Service element contains:
ec2.amazonaws.com ssm.amazonaws.com
and sometimes:
ssm.amazonaws.com ec2.amazonaws.com
The order is different every time.
It really doesn't matter in which order it will be shown, I just need the output to be consistent. Is there any way to order this output alphabetically?
I googled it and it seems obj.sort() will fix it but don't know how to apply it.
From what I understand you want a sorted string which are space separted. Here is my approach.
x = 'ec2ssmMaintWindow,AmazonSSMMaintenanceWindowRole,ec2.amazonaws.com ssm.amazonaws.com'
sorted_only_space_sparated = [ ' '.join( z for z in sorted(y.split(' '), reverse=True)) for y in x.split(',')]
print(','.join(str(i) for i in sorted_only_space_sparated))
Output:
ec2ssmMaintWindow,AmazonSSMMaintenanceWindowRole,ssm.amazonaws.com ec2.amazonaws.com
Let me know if it helps.
The problem is your string have both upper case and lower case use the key parameter in sorted method to sort the data irrespective to cases:
Services=["ec2ssmMaintWindow","AmazonSSMMaintenanceWindowRole","ssm.amazonaws.com" ,"ec2.amazonaws.com"]
s=""
s+=" ".join(sorted(Services,key=lambda x:x.lower()))
OUT
AmazonSSMMaintenanceWindowRole ec2.amazonaws.com ec2ssmMaintWindow ssm.amazonaws.com
if the Services have always these 4 values then sorting the list will rearrange the index and make the index same for each time then you can simple access the value by its index.
Thanks everyone, i found another solution: i had issues because code in question is native python3 code (i ran it under python 2.7) and obj.sort() thrown an error that
unicode' object has no attribute 'sort',
i corrected code so it checks if obj is unicode
def get_user_group_service(element):
s = ''
for e in element['AssumeRolePolicyDocument']['Statement']:
p = e['Principal']
if 'Federated' in p:
s += p['Federated']
if 'Service' in p:
obj = p['Service']
if type(obj) in (str, unicode):
s += obj # element is string
else:
obj.sort()
s += ''.join(obj) # element is array of strings
if 'AWS' in p:
s += p['AWS']
return s
Now values are alphabetically sorted on every itteration

Dynamic approach to iterate nested dict and list of dict in Python

I am looking for a dynamic approach to solve my issue. I have a very complex structure, but for simplicity,
I have a dictionary structure like this:
dict1={
"outer_key1" : {
"total" : 5 #1.I want the value of "total"
},
"outer_key2" :
[{
"type": "ABC", #2. I want to count whole structure where type="ABC"
"comments": {
"nested_comment":[
{
"key":"value",
"id": 1
},
{
"key":"value",
"id": 2
}
] # 3. Count Dict inside this list.
}}]}
I want to this iterate dictionary and solve #1, #2 and #3.
My attempt to solve #1 and #3:
def getTotal(dict1):
#for solving #1
for key,val in dict1.iteritems():
val = dict1[key]
if isinstance(val, dict):
for k1 in val:
if k1=='total':
total=val[k1]
print total #gives output 5
#for solving #3
if isinstance(val,list):
print len(val[0]['comment']['nested_comment']) #gives output 2
#How can i get this dynamicallty?
Output:
total=5
2
Que 1 :What is a pythonic way to get the total number of dictionaries under "nested_comment" list ?
Que 2 :How can i get total count of type where type="ABC". (Note: type is a nested key under "outer_key2")
Que 1 :What is a pythonic way to get the total number of dictionaries under "nested_comment" list ?
User Counter from the standard library.
from collections import Counter
my_list = [{'hello': 'world'}, {'foo': 'bar'}, 1, 2, 'hello']
dict_count = Counter([x for x in my_list if type(x) is dict])
Que 2 :How can i get total count of type where type="ABC". (Note: type is a nested key under "outer_key2")
It's not clear what you're asking for here. If by "total count", you are referring to the total number of comments in all dicts where "type" equals "ABC":
abcs = [x for x in dict1['outer_key2'] if x['type'] == 'ABC']
comment_count = sum([len(x['comments']['nested_comment']) for x in abcs])
But I've gotta say, that is some weird data you're dealing with.
You got answers for #1 and #3, check this too
from collections import Counter
dict1={
"outer_key1" : {
"total" : 5 #1.I want the value of "total"
},
"outer_key2" :
[{
"type": "ABC", #2. I want to count whole structure where type="ABC"
"comments": {
"nested_comment":[
{
"key":"value",
"key": "value"
},
{
"key":"value",
"id": 2
}
] # 3. Count Dict inside this list.
}}]}
print "total: ",dict1['outer_key1']['total']
print "No of nested comments: ", len(dict1['outer_key2'][0]['comments'] ['nested_comment']),
Assuming that below is the data structure for outer_key2 this is how you get total number of comments of type='ABC'
dict2={
"outer_key1" : {
"total" : 5
},
"outer_key2" :
[{
"type": "ABC",
"comments": {'...'}
},
{
"type": "ABC",
"comments": {'...'}
},
{
"type": "ABC",
"comments": {'...'}
}]}
i=0
k=0
while k < len(dict2['outer_key2']):
#print k
if dict2['outer_key2'][k]['type'] == 'ABC':
i+=int(1)
else:
pass
k+=1
print ("\r\nNo of dictionaries with type = 'ABC' : "), i

Categories

Resources