I am trying to create a list with all possible paths in a tree. I have following structure given (subset from DB):
text = """
1,Product1,INVOICE_FEE,
3,Product3,INVOICE_FEE,
7,Product7,DEFAULT,
2,Product2,DEFAULT,7
4,Product4,DEFAULT,7
5,Product5,DEFAULT,2
"""
where the columns are: ID, product-name, invoice-type, reference-to-parent-ID.
I would like to create list with all possible paths, like in the example:
[[Product1],[Product3],[Product7,Product2,Product5],[Product7,Product4]]
I do following:
lines = [ l.strip() for l in text.strip().splitlines() ]
hierarchy = [ tuple(l.split(',')) for l in lines ]
parents = defaultdict(list)
for p in hierarchy:
parents[p[3]].append(p)
for creation the tree and then I would like to find all paths:
def pathsMet(parents, node=''):
childNodes = parents.get(node)
if not childNodes:
return []
paths = []
for ID, productName, invoiceType, parentID in childNodes:
paths.append([productName] + pathsMet(parents, ID))
return paths
print(pathsMet(parents))
The result which I got is following:
[['FeeCashFlow1'], ['FeeCashFlow3'], ['PrincipalCashFlow7', ['AmortisationCashFlow3', ['AmortisationCashFlow2']], ['AmortisationCashFlow4']]]
How to correct the code to have following output:
[['FeeCashFlow1'], ['FeeCashFlow3'], ['PrincipalCashFlow7', 'AmortisationCashFlow3', 'AmortisationCashFlow2'], ['PrincipalCashFlow7','AmortisationCashFlow4']]
You can do this by first building a tree of your data nodes and then going through all branches to build a list of paths:
text = """
1,Product1,INVOICE_FEE,
3,Product3,INVOICE_FEE,
7,Product7,DEFAULT,
2,Product2,DEFAULT,7
4,Product4,DEFAULT,7
5,Product5,DEFAULT,2
"""
data = [ line.split(",") for line in text.split("\n") if line.strip() ]
keys = { k:name for k,name,*_ in data } # to get names from keys
tree = { k:{} for k in keys } # initial tree structure with all keys
root = tree[""] = dict() # tree root
for k,_,_,parent in data:
tree[parent].update({k:tree[k]}) # connect children to their parent
nodes = [[k] for k in root] # cumulative paths of keys
paths = [] # list of paths by name
while nodes:
kPath = nodes.pop(0)
subs = tree[kPath[-1]] # get children
if subs: nodes.extend(kPath+[k] for k in subs) # accumulate nodes
else : paths.append([keys[k] for k in kPath]) # return path if leaf node
output:
print(paths)
[['Product1'], ['Product3'], ['Product7', 'Product4'], ['Product7', 'Product2', 'Product5']]
Your code seems correct except that you are appending entire list to the paths variable, instead of list elements.
Try this modification:
def pathsMet(parents, node=''):
childNodes = parents.get(node)
if not childNodes:
return [[]]
paths = []
for ID, productName, invoiceType, parentID in childNodes:
for p in pathsMet(parents, ID):
paths.append([productName] + p)
return paths
Related
I have a list of tuples. This could look like this:
tuple_list = [
('species', 'flower'),
('flower', 'dorsal flower'),
('dorsal flower', 'pink'),
('pink', 'white'),
('pink', 'greenish'),
('species', 'branch'),
]
Note: The tuples are not in order and in this example, they could also vary in order. The 'deepness' can also vary.
I would like to create a dict of dict that would look like this:
dod = {'species': {'branch':{},'flower': {'dorsal flower':{'pink': {'white':{}}, 'greenish':{}}}}}
In this case I want the species at top level, as it has no items that 'contain' species'. E.g. species contains 'flower' and 'branch' and so on.
I feel this entire process can be wrapped in a simple recursive function (e.g. yield from) instead of writing an elaborative for loop that iterates over all values.
In the end, I want to use this function to create a list of lists that contains the proper values as a list (Kudos to #Stef for this function):
def undict_to_lists(d, acc = []):
if d == {}:
yield acc
else:
for k, v in d.items():
yield from undict_to_tuples(v, acc + [k,])
This would result in the following:
print(list(undict_to_lists(dod)))
[['species', 'branch'],
['species', 'flower', 'dorsal flower', 'pink', 'white'],
['species', 'flower', 'dorsal flower', 'greenish']]
Thanks for thinking along! All suggestions are welcome.
You could first create a dictionary key (with {} as value) for each key that occurs in the input. Then iterate those tuples to find the value that corresponds to the start key, and populate the sub dictionary with the end key, and the subdictionary that corresponds to that end key.
Finally, derive which is the root by excluding all those nodes that are children.
tuple_list = [('species', 'flower'), ('flower', 'dorsal flower'), ('dorsal flower', 'pink'),('pink', 'white'),('pink', 'greenish'),('species', 'branch')]
d = { key: {} for pair in tuple_list for key in pair }
for start, end in tuple_list:
d[start][end] = d[end]
root = None
for key in set(d.keys()).difference(end for _, end in tuple_list):
root = d[key]
print(root)
tuple_list = [
('species', 'flower'),
('flower', 'dorsal flower'),
('dorsal flower', 'pink'),
('pink', 'white'),
('pink', 'greenish'),
('species', 'branch'),
]
# Create the nested dict, using a "master" dict
# to quickly look up nodes in the nested dict.
nested_dict, master_dict = {}, {}
for a, b in tuple_list:
if a not in master_dict:
nested_dict[a] = master_dict[a] = {}
master_dict[a][b] = master_dict[b] = {}
# Flatten into lists.
def flatten_dict(d):
if not d:
return [[]]
return [[k] + f for k, v in d.items() for f in flatten_dict(v)]
print(flatten_dict(nested_dict))
#[['species', 'flower', 'dorsal flower', 'pink', 'white'],
# ['species', 'flower', 'dorsal flower', 'pink', 'greenish'],
# ['species', 'branch']]
Here's another alternative (loosely based on #trincot answer) that uses a defaultdict to simplify the code slightly and which figures out the root of the tree as it goes through the list of tuples:
from collections import defaultdict
d = defaultdict(dict)
root = tuple_list[0][0] # first parent value
for parent, child in tuple_list:
d[parent][child] = d[child]
if root == child:
root = parent
result = { root : d[root] }
Output:
{
"species": {
"branch": {},
"flower": {
"dorsal flower": {
"pink": {
"greenish": {},
"white": {}
}
}
}
}
}
Alternative :
def find_node( tree, parent, child ):
if parent in tree:
tree[parent][child] = {}
return True
for node in tree.values():
if find_node( node, parent, child ):
return True
# new node
tree[parent] = { child : {} }
root = {}
for parent, child in tuple_list:
find_node( root, parent, child )
How to get all combinations (listed) from a given dictionary, in python ?
My Dictionary Input :
node_data = {
"1":["2","3","4","5"],#1
"2":["7","8"],#2
"3":["6"],#3
"4":[],#4
"5":[],#5
"6":["11"],#6
"7":[],#7
"8":["9","10",],#8
"9":["12"],#9
"10":[],#10
"11":["13"],#11
"12":[],#12
"13":["14"],#13
"14":[]#14
}
Desidered output (sort by the longest node):
["1","3","6","11","13","14"]
["1","2","8","9","12"]
["1","2","8","10"]
["1","2","7"]
["1","4"]
["1","5"]
I did something like this and it seems to work:
def recurse(current, nodes, path, all_path):
path.append(current)
if nodes[current]:
for child in nodes[current]:
recurse(child, nodes, path.copy(), all_path)
else:
all_path.append(path)
return all_path
if __name__ == '__main__':
node_data = {
"1":["2","3","4","5"],#1
"2":["7","8"],#2
"3":["6"],#3
"4":[],#4
"5":[],#5
"6":["11"],#6
"7":[],#7
"8":["9","10",],#8
"9":["12"],#9
"10":[],#10
"11":["13"],#11
"12":[],#12
"13":["14"],#13
"14":[]#14
}
toto = recurse("1", node_data, [], [])
toto.sort(key=len, reverse=True)
print(toto)
Hope it'll help you
I am trying to read the XML file and convert it to pandas. However it returns empty data
This is the sample of xml structure:
<Instance ID="1">
<MetaInfo StudentID ="DTSU040" TaskID="LP03_PR09.bLK.sh" DataSource="DeepTutorSummer2014"/>
<ProblemDescription>A car windshield collides with a mosquito, squashing it.</ProblemDescription>
<Question>How does this work tion?</Question>
<Answer>tthis is my best </Answer>
<Annotation Label="correct(0)|correct_but_incomplete(1)|contradictory(0)|incorrect(0)">
<AdditionalAnnotation ContextRequired="0" ExtraInfoInAnswer="0"/>
<Comments Watch="1"> The student forgot to tell the opposite force. Opposite means opposite direction, which is important here. However, one can argue that the opposite is implied. See the reference answers.</Comments>
</Annotation>
<ReferenceAnswers>
1: Since the windshield exerts a force on the mosquito, which we can call action, the mosquito exerts an equal and opposite force on the windshield, called the reaction.
</ReferenceAnswers>
</Instance>
I have tried this code, however it's not working on my side. It returns empty dataframe.
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("grade_data.xml")
xroot = xtree.getroot()
df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", 'Question', 'Answer',
'ContextRequired', 'ExtraInfoInAnswer', 'Comments', 'Watch', 'ReferenceAnswers']
rows = []
for node in xroot:
s_name = node.attrib.get("ID")
s_student = node.find("StudentID")
s_task = node.find("TaskID")
s_source = node.find("DataSource")
s_desc = node.find("ProblemDescription")
s_question = node.find("Question")
s_ans = node.find("Answer")
s_label = node.find("Label")
s_contextrequired = node.find("ContextRequired")
s_extraInfoinAnswer = node.find("ExtraInfoInAnswer")
s_comments = node.find("Comments")
s_watch = node.find("Watch")
s_referenceAnswers = node.find("ReferenceAnswers")
rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task,
"DataSource": s_source, "ProblemDescription": s_desc ,
"Question": s_question , "Answer": s_ans ,"Label": s_label,
"s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
"Comments": s_comments , "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers,
})
out_df = pd.DataFrame(rows, columns = df_cols)
The problem in your solution was that the "element data extraction" was not done properly. The xml you mentioned in the question is nested in several layers. And that is why we need to recursively read and extract the data. The following solution should give you what you need in this case. Although I would encourage you to look at this article and the python documentation for more clarity.
Method: 1
import numpy as np
import pandas as pd
#import os
import xml.etree.ElementTree as ET
def xml2df(xml_source, df_cols, source_is_file = False, show_progress=True):
"""Parse the input XML source and store the result in a pandas
DataFrame with the given columns.
For xml_source = xml_file, Set: source_is_file = True
For xml_source = xml_string, Set: source_is_file = False
<element attribute_key1=attribute_value1, attribute_key2=attribute_value2>
<child1>Child 1 Text</child1>
<child2>Child 2 Text</child2>
<child3>Child 3 Text</child3>
</element>
Note that for an xml structure as shown above, the attribute information of
element tag can be accessed by list(element). Any text associated with <element> tag can be accessed
as element.text and the name of the tag itself can be accessed with
element.tag.
"""
if source_is_file:
xtree = ET.parse(xml_source) # xml_source = xml_file
xroot = xtree.getroot()
else:
xroot = ET.fromstring(xml_source) # xml_source = xml_string
consolidator_dict = dict()
default_instance_dict = {label: None for label in df_cols}
def get_children_info(children, instance_dict):
# We avoid using element.getchildren() as it is deprecated.
# Instead use list(element) to get a list of attributes.
for child in children:
#print(child)
#print(child.tag)
#print(child.items())
#print(child.getchildren()) # deprecated method
#print(list(child))
if len(list(child))>0:
instance_dict = get_children_info(list(child),
instance_dict)
if len(list(child.keys()))>0:
items = child.items()
instance_dict.update({key: value for (key, value) in items})
#print(child.keys())
instance_dict.update({child.tag: child.text})
return instance_dict
# Loop over all instances
for instance in list(xroot):
instance_dict = default_instance_dict.copy()
ikey, ivalue = instance.items()[0] # The first attribute is "ID"
instance_dict.update({ikey: ivalue})
if show_progress:
print('{}: {}={}'.format(instance.tag, ikey, ivalue))
# Loop inside every instance
instance_dict = get_children_info(list(instance),
instance_dict)
#consolidator_dict.update({ivalue: instance_dict.copy()})
consolidator_dict[ivalue] = instance_dict.copy()
df = pd.DataFrame(consolidator_dict).T
df = df[df_cols]
return df
Run the following to generate the desired output.
xml_source = r'grade_data.xml'
df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", "Question", "Answer",
"ContextRequired", "ExtraInfoInAnswer", "Comments", "Watch", 'ReferenceAnswers']
df = xml2df(xml_source, df_cols, source_is_file = True)
df
Method: 2
Given you have the xml_string, you could convert xml >> dict >> dataframe. run the following to get the desired output.
Note: You will need to install xmltodict to use Method-2. This method is inspired by the solution suggested by #martin-blech at How to convert XML to JSON in Python? [duplicate]
. Kudos to #martin-blech for making it.
pip install -U xmltodict
Solution
def read_recursively(x, instance_dict):
#print(x)
txt = ''
for key in x.keys():
k = key.replace("#","")
if k in df_cols:
if isinstance(x.get(key), dict):
instance_dict, txt = read_recursively(x.get(key), instance_dict)
#else:
instance_dict.update({k: x.get(key)})
#print('{}: {}'.format(k, x.get(key)))
else:
#print('else: {}: {}'.format(k, x.get(key)))
# dig deeper if value is another dict
if isinstance(x.get(key), dict):
instance_dict, txt = read_recursively(x.get(key), instance_dict)
# add simple text associated with element
if k=='#text':
txt = x.get(key)
# update text to corresponding parent element
if (k!='#text') and (txt!=''):
instance_dict.update({k: txt})
return (instance_dict, txt)
You will need the function read_recursively() given above. Now run the following.
import xmltodict, json
o = xmltodict.parse(xml_string) # INPUT: XML_STRING
#print(json.dumps(o)) # uncomment to see xml to json converted string
consolidated_dict = dict()
oi = o['Instances']['Instance']
for x in oi:
instance_dict = dict()
instance_dict, _ = read_recursively(x, instance_dict)
consolidated_dict.update({x.get("#ID"): instance_dict.copy()})
df = pd.DataFrame(consolidated_dict).T
df = df[df_cols]
df
Several issues:
Calling .find on the loop variable, node, expects a child node to exist: current_node.find('child_of_current_node'). However, since all the nodes are the children of root they do not maintain their own children, so no loop is required;
Not checking NoneType that can result from missing nodes with find() and prevents retrieving .tag or .text or other attributes;
Not retrieving node content with .text, otherwise the <Element... object is returned;
Consider this adjustment using the ternary condition expression a if condition else b to ensure variable has a value regardless:
rows = []
s_name = xroot.attrib.get("ID")
s_student = xroot.find("StudentID").text if xroot.find("StudentID") is not None else None
s_task = xroot.find("TaskID").text if xroot.find("TaskID") is not None else None
s_source = xroot.find("DataSource").text if xroot.find("DataSource") is not None else None
s_desc = xroot.find("ProblemDescription").text if xroot.find("ProblemDescription") is not None else None
s_question = xroot.find("Question").text if xroot.find("Question") is not None else None
s_ans = xroot.find("Answer").text if xroot.find("Answer") is not None else None
s_label = xroot.find("Label").text if xroot.find("Label") is not None else None
s_contextrequired = xroot.find("ContextRequired").text if xroot.find("ContextRequired") is not None else None
s_extraInfoinAnswer = xroot.find("ExtraInfoInAnswer").text if xroot.find("ExtraInfoInAnswer") is not None else None
s_comments = xroot.find("Comments").text if xroot.find("Comments") is not None else None
s_watch = xroot.find("Watch").text if xroot.find("Watch") is not None else None
s_referenceAnswers = xroot.find("ReferenceAnswers").text if xroot.find("ReferenceAnswers") is not None else None
rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task,
"DataSource": s_source, "ProblemDescription": s_desc ,
"Question": s_question , "Answer": s_ans ,"Label": s_label,
"s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
"Comments": s_comments , "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers
})
out_df = pd.DataFrame(rows, columns = df_cols)
Alternatively, run a more dynamic version assigning to an inner dictionary using the iterator variable:
rows = []
for node in xroot:
inner = {}
inner[node.tag] = node.text
rows.append(inner)
out_df = pd.DataFrame(rows, columns = df_cols)
Or list/dict comprehension:
rows = [{node.tag: node.text} for node in xroot]
out_df = pd.DataFrame(rows, columns = df_cols)
I really enjoyed the graph traversals of arangodb which allows me to visit any path or nodes with little query sweats. However, i m stacked with a context which is already implemented in neo4j, I believe anyone using arangodb might find this useful for his future operation.
I have successfully imported the list of product categories google product taxonomy into arangodb database. in a vertex collection named taxonomy and edge collection named catof.
If i m correct, from this query, i m able to fetch all vertices and linked edges.
FOR t IN taxonomy
for c in inbound t catof
sort c.name asc
return {c}
While feeding the taxonomy documents, parent vertex do not have an edge if any of both parts _from, _to are null. i need to mention, i m using flask-script and python-arango to proceed on these operations, they have been helpful.
manager = Manager(app)
tax_item = storegraph.vertex_collection('taxonomy')
catof = storegraph.edge_collection('catof')
#manager.command
def fetch_tree():
dictionary = {}
with open('input.csv') as file:
for row in file.readlines():
things = row.strip().split(' > ')
dictionary[things[0]] = None
i, j = 0, 1
while j < len(things):
parent, child = things[i], things[j]
dictionary[child] = parent
i += 1
j += 1
# for key in dictionary:
#tax_item.insert({"name": key})
for child, parent in dictionary.iteritems():
# edge_collection.insert_edge({from: vertex_collection /
# parent, to: vertex_collection / child})
chl, par = tax_item.find({'name': child}),
tax_item.find({'name': parent})
c, p = [h for h in chl], [a for a in par]
if c and p:
#print 'Child: %s parent: %s' % (c[0]['_id'], p[0]['_id'])
catof.insert({'_from': c[0]['_id'], '_to': p[0]['_id'] })
#print '\n'
After operation, i have the following sample vertices.
[{"_key": "5246198", "_id": "taxonomy/5246198","name": "Computers"},
{"_key": "5252911", "_id": "taxonomy/5252911","name": "Hardwares"},
{"_key": "5257587", "_id": "taxonomy/5257587", "name": "Hard disk"
}]
and edges
[
{ "_key": "5269883", "_id": "catof/5269883", "_from": "taxonomy/5246198", "_to": "taxonomy/5252911"},
{"_key": "5279833", "_id": "catof/5279833", "_from": "taxonomy/5252911",
"_to": "taxonomy/5257587"}]
Now my question is:
How do I fetch only parent documents? i.e Computers
From parent documents, how do i print all their children using ? to be a format of Computers, Hardwares, Hard Disks
I have a list of paths and contents similar to that:
paths = [
("/test/file1.txt", "content1"),
("/test/file2.txt", "content2"),
("/file3.txt", "content3"),
("/test1/test2/test3/file5.txt", "content5"),
("/test2/file4.txt", "content4")
]
I would transform this path list to:
structure = {
"file3.txt": "content3"
"test": {
"file1.txt": "content1",
"file2.txt": "content2"
},
"test2": {
"file4.txt": "content4"
}
}
Is there any simple solution to that problem ?
Since the file path can be of an arbitrary depth, we need something scalable.
Here is a recursive approach - splitting the path recursively until we get to the root /:
import os
paths = [
("/test/file1.txt", "content1"),
("/test/file2.txt", "content2"),
("/file3.txt", "content3"),
("/test1/test2/test3/file5.txt", "content5"),
("/test2/file4.txt", "content4")
]
def deepupdate(original, update):
for key, value in original.items():
if key not in update:
update[key] = value
elif isinstance(value, dict):
deepupdate(value, update[key])
return update
def traverse(key, value):
directory = os.path.dirname(key)
filename = os.path.basename(key)
if directory == "/":
return value if isinstance(value, dict) else {filename: value}
else:
path, directory = os.path.split(directory)
return traverse(path, {directory: {filename: value}})
result = {}
for key, value in paths:
result = deepupdate(result, traverse(key, value))
print(result)
Using deepupdate() function suggested here.
It prints:
{'file3.txt': 'content3',
'test': {'file1.txt': 'content1', 'file2.txt': 'content2'},
'test1': {'test2': {'test3': {'file5.txt': 'content5'}}},
'test2': {'file4.txt': 'content4'}}
I think .setdefault() be ok:
paths = [
("/test/file1.txt", "content1"),
("/test/file2.txt", "content2"),
("/file3.txt", "content3"),
("/test2/file4.txt", "content4")
]
dirs = {}
for p in paths:
current = dirs
ps = p[0].split('/')
for d in ps[:-1]:
if d:
current = current.setdefault(d, {})
current[ps[-1]] = p[1]
print(dirs)
Try using recursivity:
paths = [
("/test/file1.txt", "content1"),
("/test/file2.txt", "content2"),
("/file3.txt", "content3"),
("/test2/file4.txt", "content4"),
('/test1/test2/test3/file.txt', 'content'),
('/test10/test20/test30/test40/file.txt', 'content100')
]
def create_structure(elems,count,mylen,p_1,var):
if mylen<=2:
var[elems[count]] = p_1
return
create_structure(elems,count+1,mylen-1,p_1,var.setdefault(elems[count],{}))
structure = {}
for p in paths:
elems = p[0].split('/')
create_structure(elems,1,len(elems),p[1],structure)
print structure