parse xml to pandas data frame in python

parse xml to pandas data frame in python - python

I am trying to read the XML file and convert it to pandas. However it returns empty data
This is the sample of xml structure:
<Instance ID="1">
<MetaInfo StudentID ="DTSU040" TaskID="LP03_PR09.bLK.sh" DataSource="DeepTutorSummer2014"/>
<ProblemDescription>A car windshield collides with a mosquito, squashing it.</ProblemDescription>
<Question>How does this work tion?</Question>
<Answer>tthis is my best </Answer>
<Annotation Label="correct(0)|correct_but_incomplete(1)|contradictory(0)|incorrect(0)">
<AdditionalAnnotation ContextRequired="0" ExtraInfoInAnswer="0"/>
<Comments Watch="1"> The student forgot to tell the opposite force. Opposite means opposite direction, which is important here. However, one can argue that the opposite is implied. See the reference answers.</Comments>
</Annotation>
<ReferenceAnswers>
1: Since the windshield exerts a force on the mosquito, which we can call action, the mosquito exerts an equal and opposite force on the windshield, called the reaction.
</ReferenceAnswers>
</Instance>
I have tried this code, however it's not working on my side. It returns empty dataframe.
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("grade_data.xml")
xroot = xtree.getroot()
df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", 'Question', 'Answer',
'ContextRequired', 'ExtraInfoInAnswer', 'Comments', 'Watch', 'ReferenceAnswers']
rows = []
for node in xroot:
s_name = node.attrib.get("ID")
s_student = node.find("StudentID")
s_task = node.find("TaskID")
s_source = node.find("DataSource")
s_desc = node.find("ProblemDescription")
s_question = node.find("Question")
s_ans = node.find("Answer")
s_label = node.find("Label")
s_contextrequired = node.find("ContextRequired")
s_extraInfoinAnswer = node.find("ExtraInfoInAnswer")
s_comments = node.find("Comments")
s_watch = node.find("Watch")
s_referenceAnswers = node.find("ReferenceAnswers")
rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task,
"DataSource": s_source, "ProblemDescription": s_desc ,
"Question": s_question , "Answer": s_ans ,"Label": s_label,
"s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
"Comments": s_comments , "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers,
})
out_df = pd.DataFrame(rows, columns = df_cols)

The problem in your solution was that the "element data extraction" was not done properly. The xml you mentioned in the question is nested in several layers. And that is why we need to recursively read and extract the data. The following solution should give you what you need in this case. Although I would encourage you to look at this article and the python documentation for more clarity.
Method: 1
import numpy as np
import pandas as pd
#import os
import xml.etree.ElementTree as ET
def xml2df(xml_source, df_cols, source_is_file = False, show_progress=True):
"""Parse the input XML source and store the result in a pandas
DataFrame with the given columns.
For xml_source = xml_file, Set: source_is_file = True
For xml_source = xml_string, Set: source_is_file = False
<element attribute_key1=attribute_value1, attribute_key2=attribute_value2>
<child1>Child 1 Text</child1>
<child2>Child 2 Text</child2>
<child3>Child 3 Text</child3>
</element>
Note that for an xml structure as shown above, the attribute information of
element tag can be accessed by list(element). Any text associated with <element> tag can be accessed
as element.text and the name of the tag itself can be accessed with
element.tag.
"""
if source_is_file:
xtree = ET.parse(xml_source) # xml_source = xml_file
xroot = xtree.getroot()
else:
xroot = ET.fromstring(xml_source) # xml_source = xml_string
consolidator_dict = dict()
default_instance_dict = {label: None for label in df_cols}
def get_children_info(children, instance_dict):
# We avoid using element.getchildren() as it is deprecated.
# Instead use list(element) to get a list of attributes.
for child in children:
#print(child)
#print(child.tag)
#print(child.items())
#print(child.getchildren()) # deprecated method
#print(list(child))
if len(list(child))>0:
instance_dict = get_children_info(list(child),
instance_dict)
if len(list(child.keys()))>0:
items = child.items()
instance_dict.update({key: value for (key, value) in items})
#print(child.keys())
instance_dict.update({child.tag: child.text})
return instance_dict
# Loop over all instances
for instance in list(xroot):
instance_dict = default_instance_dict.copy()
ikey, ivalue = instance.items()[0] # The first attribute is "ID"
instance_dict.update({ikey: ivalue})
if show_progress:
print('{}: {}={}'.format(instance.tag, ikey, ivalue))
# Loop inside every instance
instance_dict = get_children_info(list(instance),
instance_dict)
#consolidator_dict.update({ivalue: instance_dict.copy()})
consolidator_dict[ivalue] = instance_dict.copy()
df = pd.DataFrame(consolidator_dict).T
df = df[df_cols]
return df
Run the following to generate the desired output.
xml_source = r'grade_data.xml'
df_cols = ["ID", "TaskID", "DataSource", "ProblemDescription", "Question", "Answer",
"ContextRequired", "ExtraInfoInAnswer", "Comments", "Watch", 'ReferenceAnswers']
df = xml2df(xml_source, df_cols, source_is_file = True)
df
Method: 2
Given you have the xml_string, you could convert xml >> dict >> dataframe. run the following to get the desired output.
Note: You will need to install xmltodict to use Method-2. This method is inspired by the solution suggested by #martin-blech at How to convert XML to JSON in Python? [duplicate]
. Kudos to #martin-blech for making it.
pip install -U xmltodict
Solution
def read_recursively(x, instance_dict):
#print(x)
txt = ''
for key in x.keys():
k = key.replace("#","")
if k in df_cols:
if isinstance(x.get(key), dict):
instance_dict, txt = read_recursively(x.get(key), instance_dict)
#else:
instance_dict.update({k: x.get(key)})
#print('{}: {}'.format(k, x.get(key)))
else:
#print('else: {}: {}'.format(k, x.get(key)))
# dig deeper if value is another dict
if isinstance(x.get(key), dict):
instance_dict, txt = read_recursively(x.get(key), instance_dict)
# add simple text associated with element
if k=='#text':
txt = x.get(key)
# update text to corresponding parent element
if (k!='#text') and (txt!=''):
instance_dict.update({k: txt})
return (instance_dict, txt)
You will need the function read_recursively() given above. Now run the following.
import xmltodict, json
o = xmltodict.parse(xml_string) # INPUT: XML_STRING
#print(json.dumps(o)) # uncomment to see xml to json converted string
consolidated_dict = dict()
oi = o['Instances']['Instance']
for x in oi:
instance_dict = dict()
instance_dict, _ = read_recursively(x, instance_dict)
consolidated_dict.update({x.get("#ID"): instance_dict.copy()})
df = pd.DataFrame(consolidated_dict).T
df = df[df_cols]
df

Several issues:
Calling .find on the loop variable, node, expects a child node to exist: current_node.find('child_of_current_node'). However, since all the nodes are the children of root they do not maintain their own children, so no loop is required;
Not checking NoneType that can result from missing nodes with find() and prevents retrieving .tag or .text or other attributes;
Not retrieving node content with .text, otherwise the <Element... object is returned;
Consider this adjustment using the ternary condition expression a if condition else b to ensure variable has a value regardless:
rows = []
s_name = xroot.attrib.get("ID")
s_student = xroot.find("StudentID").text if xroot.find("StudentID") is not None else None
s_task = xroot.find("TaskID").text if xroot.find("TaskID") is not None else None
s_source = xroot.find("DataSource").text if xroot.find("DataSource") is not None else None
s_desc = xroot.find("ProblemDescription").text if xroot.find("ProblemDescription") is not None else None
s_question = xroot.find("Question").text if xroot.find("Question") is not None else None
s_ans = xroot.find("Answer").text if xroot.find("Answer") is not None else None
s_label = xroot.find("Label").text if xroot.find("Label") is not None else None
s_contextrequired = xroot.find("ContextRequired").text if xroot.find("ContextRequired") is not None else None
s_extraInfoinAnswer = xroot.find("ExtraInfoInAnswer").text if xroot.find("ExtraInfoInAnswer") is not None else None
s_comments = xroot.find("Comments").text if xroot.find("Comments") is not None else None
s_watch = xroot.find("Watch").text if xroot.find("Watch") is not None else None
s_referenceAnswers = xroot.find("ReferenceAnswers").text if xroot.find("ReferenceAnswers") is not None else None
rows.append({"ID": s_name,"StudentID":s_student, "TaskID": s_task,
"DataSource": s_source, "ProblemDescription": s_desc ,
"Question": s_question , "Answer": s_ans ,"Label": s_label,
"s_contextrequired": s_contextrequired , "ExtraInfoInAnswer": s_extraInfoinAnswer ,
"Comments": s_comments , "Watch": s_watch, "ReferenceAnswers": s_referenceAnswers
})
out_df = pd.DataFrame(rows, columns = df_cols)
Alternatively, run a more dynamic version assigning to an inner dictionary using the iterator variable:
rows = []
for node in xroot:
inner = {}
inner[node.tag] = node.text
rows.append(inner)
out_df = pd.DataFrame(rows, columns = df_cols)
Or list/dict comprehension:
rows = [{node.tag: node.text} for node in xroot]
out_df = pd.DataFrame(rows, columns = df_cols)

Related

problem with creating column name dynamically in a dataframe

I am trying to create a pandas dataframe dynamically. So far I'm fine with capturing data within the dataframe, but not with the name of the columns.
I also want the name of the columns to be based on the 'category' of the data that comes with the record that I am reading in my function, but I always get the last one.
def funct_example(client):
documents = [ v_document ]
poller = client.begin_analyze_entities(documents)
result = poller.result()
docs = [doc for doc in result if not doc.is_error]
i = 1
df_final = pd.DataFrame()
for idx, doc in enumerate(docs):
for relation in doc.entity_relations:
for role in relation.roles:
name = str([format(entity.category)]) + str(i) # <---- THIS LINE ALWAYS IS THE LAST REGISTER
d = {name : "'{}' with entity '{}'".format(role.name, role.entity.text)} # <---THIS IS OK
df = pd.DataFrame(data=d, index=[0])
df_final = pd.concat([df_final, df], axis=1)
i = i + 1
display(df_final)
return(df_final)
df_new_2 = funct_example(client)
I've tried adding an extra loop between creating the dataframe sentence and concat function like so:
for col in df.columns:
name = str([format(entity.category)]) + str(i)
df = df.rename(columns={col: name })
But the last category still appears in the column name...
How can I solve that?
From already thank you very much.
SOLUTION:
for idx, doc in enumerate(docs):
for relation in doc.entity_relations:
for role in relation.roles:
name = 'Relation_' + format(relation.relation_type) + '_' + str(i)
d = {name : "'{}' with entity '{}'".format(role.name, role.entity.text)}
df = pd.DataFrame(data=d, index=[0])
df_final = pd.concat([df_final, df], axis=1)
i = i + 1
display(df_final)
return(df_final)
df_relations = funct_example(client)
Regards!! :D

its difficult to suggest a solution without knowing the properties of all the objects being used.
'entity' object, is not defined within the function, so is it a global variable?
Does the role has 'entity', which then has the 'category' property? Im assuming as such, since role does have entity property
d = {name : "'{}' with entity '{}'".format(role.name, role.entity.text)} # <---THIS IS OK
Beside, the name variable while initialized is not used.
maybe you try
name = str([format(role.entity.category)]) + str(i)

What is the best way to parse large XML and genarate a dataframe with the data in the XML (with python or else)?

I try to make a table (or csv, I'm using pandas dataframe) from the information of an XML file.
The file is here (.zip is 14 MB, XML is ~370MB), https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip . It has package information of different languages - node.js, python, java etc. aka, CPE 2.3 list by the US government org NVD.
this is how it looks like in the first 30 rows:
<cpe-list xmlns:config="http://scap.nist.gov/schema/configuration/0.1" xmlns="http://cpe.mitre.org/dictionary/2.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:scap-core="http://scap.nist.gov/schema/scap-core/0.3" xmlns:cpe-23="http://scap.nist.gov/schema/cpe-extension/2.3" xmlns:ns6="http://scap.nist.gov/schema/scap-core/0.1" xmlns:meta="http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2" xsi:schemaLocation="http://scap.nist.gov/schema/cpe-extension/2.3 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary-extension_2.3.xsd http://cpe.mitre.org/dictionary/2.0 https://scap.nist.gov/schema/cpe/2.3/cpe-dictionary_2.3.xsd http://scap.nist.gov/schema/cpe-dictionary-metadata/0.2 https://scap.nist.gov/schema/cpe/2.1/cpe-dictionary-metadata_0.2.xsd http://scap.nist.gov/schema/scap-core/0.3 https://scap.nist.gov/schema/nvd/scap-core_0.3.xsd http://scap.nist.gov/schema/configuration/0.1 https://scap.nist.gov/schema/nvd/configuration_0.1.xsd http://scap.nist.gov/schema/scap-core/0.1 https://scap.nist.gov/schema/nvd/scap-core_0.1.xsd">
<generator>
<product_name>National Vulnerability Database (NVD)</product_name>
<product_version>4.9</product_version>
<schema_version>2.3</schema_version>
<timestamp>2022-03-17T03:51:01.909Z</timestamp>
</generator>
<cpe-item name="cpe:/a:%240.99_kindle_books_project:%240.99_kindle_books:6::~~~android~~">
<title xml:lang="en-US">$0.99 Kindle Books project $0.99 Kindle Books (aka com.kindle.books.for99) for android 6.0</title>
<references>
<reference href="https://play.google.com/store/apps/details?id=com.kindle.books.for99">Product information</reference>
<reference href="https://docs.google.com/spreadsheets/d/1t5GXwjw82SyunALVJb2w0zi3FoLRIkfGPc7AMjRF0r4/edit?pli=1#gid=1053404143">Government Advisory</reference>
</references>
<cpe-23:cpe23-item name="cpe:2.3:a:\$0.99_kindle_books_project:\$0.99_kindle_books:6:*:*:*:*:android:*:*"/>
</cpe-item>
The tree structure of the XML file is quite simple, the root is 'cpe-list', the child element is 'cpe-item', and the grandchild elements are 'title', 'references' and 'cpe23-item'.
From 'title', I want the text in the element;
From 'cpe23-item', I want the attribute 'name';
From 'references', I want the attributes 'href' from its great-grandchildren, 'reference'.
The dataframe should look like this:
| cpe23_name | title_text | ref1 | ref2 | ref3 | ref_other
0 | 'cpe23name 1'| 'this is a python pkg'| 'url1'| 'url2'| NaN | NaN
1 | 'cpe23name 2'| 'this is a java pkg' | 'url1'| 'url2'| NaN | NaN
...
my code is here，finished in ~100sec：
import xml.etree.ElementTree as et
xtree = et.parse("official-cpe-dictionary_v2.3.xml")
xroot = xtree.getroot()
import time
start_time = time.time()
df_cols = ["cpe", "text", "vendor", "product", "version", "changelog", "advisory", 'others']
title = '{http://cpe.mitre.org/dictionary/2.0}title'
ref = '{http://cpe.mitre.org/dictionary/2.0}references'
cpe_item = '{http://scap.nist.gov/schema/cpe-extension/2.3}cpe23-item'
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
rows = []
i = 0
while i < len(xroot):
for elm in xroot[i]:
if elm.tag == title:
p_text = elm.text
#assign p_text
elif elm.tag == ref:
for nn in elm:
s = nn.text.lower()
#check the lower text in refs
if 'version' in s:
p_vers = nn.attrib.get('href')
#assign p_vers
elif 'advisor' in s:
p_advi = nn.attrib.get('href')
#assign p_advi
elif 'product' in s:
p_prod = nn.attrib.get('href')
#assign p_prod
elif 'vendor' in s:
p_vend = nn.attrib.get('href')
#assign p_vend
elif 'change' in s:
p_chan = nn.attrib.get('href')
#assign p_vend
else:
p_othe = nn.attrib.get('href')
elif elm.tag == cpe_item:
p_cpe = elm.attrib.get("name")
#assign p_cpe
else:
print(elm.tag)
row = [p_cpe, p_text, p_vend, p_prod, p_vers, p_chan, p_advi, p_othe]
rows.append(row)
p_cpe = None
p_text = None
p_vend = None
p_prod = None
p_vers = None
p_chan = None
p_advi = None
p_othe = None
print(len(rows)) #this shows how far I got during the running time
i+=1
out_df1 = pd.DataFrame(rows, columns = df_cols)# move this part outside the loop by removing the indent
print("---853k rows take %s seconds ---" % (time.time() - start_time))
updated: the faster way is to move the 2nd last row out side the loop. Since 'rows' already get info in each loop, there is no need to make a new dataframe every time.
the running time now is 136.0491042137146 seconds. yay!

Since your XML is fairly flat, consider the recently added IO module, pandas.read_xml introduced in v1.3. Given XML uses a default namespace, to reference elements in xpath use namespaces argument:
url = "https://nvd.nist.gov/feeds/xml/cpe/dictionary/official-cpe-dictionary_v2.3.xml.zip"
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}
)
If you do not have the default parser, lxml, installed, use the etree parser:
df = pd.read_xml(
url, xpath=".//doc:cpe-item", namespaces={'doc': 'http://cpe.mitre.org/dictionary/2.0'}, parser="etree"
)

Parse xml w/ xsd to CSV with Python?

I am trying to parse a very large XML file which I downloaded from OSHA's website and convert it into a CSV so I can use it in a SQLite database along with some other spreadsheets. I would just use an online converter, but the osha file is apparently too big for all of them.
I wrote a script in Python which looks like this:
import csv
import xml.etree.cElementTree as ET
tree = ET.parse('data.xml')
root = tree.getroot()
xml_data_to_csv =open('Out.csv', 'w')
list_head=[]
Csv_writer=csv.writer(xml_data_to_csv)
count=0
for element in root.findall('data'):
List_nodes =[]
if count== 0:
inspection_number = element.find('inspection_number').tag
list_head.append(inspection_number)
establishment_name = element.find('establishment_name').tag
list_head.append(establishment_name)
city = element.find('city')
list_head.append(city)
state = element.find('state')
list_head.append(state)
zip_code = element.find('zip_code')
list_head.append(zip_code)
sic_code = element.find('sic_code')
list_head.append(sic_code)
naics_code = element.find('naics_code')
list_head.append(naics_code)
sampling_number = element.find('sampling_number')
list_head.append(sampling_number)
office_id = element.find('office_id')
list_head.append(office_id)
date_sampled = element.find('date_sampled')
list_head.append(date_sampled)
date_reported = element.find('date_reported')
list_head.append(date_reported)
eight_hour_twa_calc = element.find('eight_hour_twa_calc')
list_head.append(eight_hour_twa_calc)
instrument_type = element.find('instrument_type')
list_head.append(instrument_type)
lab_number = element.find('lab_number')
list_head.append(lab_number)
field_number = element.find('field_number')
list_head.append(field_number)
sample_type = element.find('sample_type')
list_head.append(sample_type)
blank_used = element.find('blank_used')
list_head.append(blank_used)
time_sampled = element.find('time_sampled')
list_head.append(time_sampled)
air_volume_sampled = element.find('air_volume_sampled')
list_head.append(air_volume_sampled)
sample_weight = element.find('sample_weight')
list_head.append(sample_weight)
imis_substance_code = element.find('imis_substance_code')
list_head.append(imis_substance_code)
substance = element.find('substance')
list_head.append(substance)
sample_result = element.find('sample_result')
list_head.append(sample_result)
unit_of_measurement = element.find('unit_of_measurement')
list_head.append(unit_of_measurement)
qualifier = element.find('qualifier')
list_head.append(qualifier)
Csv_writer.writerow(list_head)
count = +1
inspection_number = element.find('inspection_number').text
List_nodes.append(inspection_number)
establishment_name = element.find('establishment_name').text
List_nodes.append(establishment_name)
city = element.find('city').text
List_nodes.append(city)
state = element.find('state').text
List_nodes.append(state)
zip_code = element.find('zip_code').text
List_nodes.append(zip_code)
sic_code = element.find('sic_code').text
List_nodes.append(sic_code)
naics_code = element.find('naics_code').text
List_nodes.append(naics_code)
sampling_number = element.find('sampling_number').text
List_nodes.append(sampling_number)
office_id = element.find('office_id').text
List_nodes.append(office_id)
date_sampled = element.find('date_sampled').text
List_nodes.append(date_sampled)
date_reported = element.find('date_reported').text
List_nodes.append(date_reported)
eight_hour_twa_calc = element.find('eight_hour_twa_calc').text
List_nodes.append(eight_hour_twa_calc)
instrument_type = element.find('instrument_type').text
List_nodes.append(instrument_type)
lab_number = element.find('lab_number').text
List_nodes.append(lab_number)
field_number = element.find('field_number').text
List_nodes.append(field_number)
sample_type = element.find('sample_type').text
List_nodes.append(sample_type)
blank_used = element.find('blank_used').text
List_nodes.append()
time_sampled = element.find('time_sampled').text
List_nodes.append(time_sampled)
air_volume_sampled = element.find('air_volume_sampled').text
List_nodes.append(air_volume_sampled)
sample_weight = element.find('sample_weight').text
List_nodes.append(sample_weight)
imis_substance_code = element.find('imis_substance_code').text
List_nodes.append(imis_substance_code)
substance = element.find('substance').text
List_nodes.append(substance)
sample_result = element.find('sample_result').text
List_nodes.append(sample_result)
unit_of_measurement = element.find('unit_of_measurement').text
List_nodes.append(unit_of_measurement)
qualifier= element.find('qualifier').text
List_nodes.append(qualifier)
Csv_writer.writerow(List_nodes)
xml_data_to_csv.close()
But when I run the code I get a CSV with nothing in it. I suspect this may have something to do with the XSD file associated with the XML, but I'm not totally sure.
Does anyone know what the issue is here?

The code below is a 'compact' version of your code.
It assumes that the XML structure looks like in the script variable xml. (Based on https://www.osha.gov/opengov/sample_data_2011.zip)
The main difference bwtween this sample code and yours is that I define the fields that I want to collect once (see FIELDS) and I use this definition across the script.
import xml.etree.ElementTree as ET
FIELDS = ['lab_number', 'instrument_type'] # TODO add more fields
xml = '''<main xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="health_sample_data.xsd">
<DATA_RECORD>
<inspection_number>316180165</inspection_number>
<establishment_name>PROFESSIONAL ENGINEERING SERVICES, LLC.</establishment_name>
<city>EUFAULA</city>
<state>AL</state>
<zip_code>36027</zip_code>
<sic_code>1799</sic_code>
<naics_code>238990</naics_code>
<sampling_number>434866166</sampling_number>
<office_id>418600</office_id>
<date_sampled>2011-12-30</date_sampled>
<date_reported>2011-12-30</date_reported>
<eight_hour_twa_calc>N</eight_hour_twa_calc>
<instrument_type>TBD</instrument_type>
<lab_number>L13645</lab_number>
<field_number>S1</field_number>
<sample_type>B</sample_type>
<blank_used>N</blank_used>
<time_sampled></time_sampled>
<air_volume_sampled></air_volume_sampled>
<sample_weight></sample_weight>
<imis_substance_code>S777</imis_substance_code>
<substance>Soil</substance>
<sample_result>0</sample_result>
<unit_of_measurement>AAAAA</unit_of_measurement>
<qualifier></qualifier>
</DATA_RECORD>
<DATA_RECORD>
<inspection_number>315516757</inspection_number>
<establishment_name>MARGUERITE CONCRETE CO.</establishment_name>
<city>WORCESTER</city>
<state>MA</state>
<zip_code>1608</zip_code>
<sic_code>1771</sic_code>
<naics_code>238110</naics_code>
<sampling_number>423259902</sampling_number>
<office_id>112600</office_id>
<date_sampled>2011-12-30</date_sampled>
<date_reported>2011-12-30</date_reported>
<eight_hour_twa_calc>N</eight_hour_twa_calc>
<instrument_type>GRAV</instrument_type>
<lab_number>L13355</lab_number>
<field_number>9831B</field_number>
<sample_type>P</sample_type>
<blank_used>N</blank_used>
<time_sampled>184</time_sampled>
<air_volume_sampled>340.4</air_volume_sampled>
<sample_weight>.06</sample_weight>
<imis_substance_code>9135</imis_substance_code>
<substance>Particulates not otherwise regulated (Total Dust)</substance>
<sample_result>0.176</sample_result>
<unit_of_measurement>M</unit_of_measurement>
<qualifier></qualifier>
</DATA_RECORD></main>'''
root = ET.fromstring(xml)
records = root.findall('.//DATA_RECORD')
with open('out.csv', 'w') as out:
out.write(','.join(FIELDS) + '\n')
for record in records:
values = [record.find(f).text for f in FIELDS]
out.write(','.join(values) + '\n')
out.csv
lab_number,instrument_type
L13645,TBD
L13355,GRAV

Transforming XML response into pandas dataframe

Been trying to structure an API response from XML to a readable pandas dataframe. I found a lot of inspiration in earlier threads on this topic, but the values in my dataframe still displays as "None".
The XML response:
<VehiclePositionResponse xmlns="http://fms-standard.com/rfms/v1.0.0/xsd/position" xmlns:ns2="http://fms-standard.com/rfms/v1.0.0/xsd/common/position">
<VehiclePosition>
<VIN>YS2R8X40005440923</VIN>
<TriggerType>OTHER</TriggerType>
<CreatedDateTime>2019-07-31T16:50:28</CreatedDateTime>
<ReceivedDateTime>2019-07-31T16:50:29</ReceivedDateTime>
<GNSSPosition>
<ns2:Latitude>62.098339</ns2:Latitude>
<ns2:Longitude>10.542222</ns2:Longitude>
<ns2:Heading>291</ns2:Heading>
<ns2:Altitude>655</ns2:Altitude>
<ns2:Speed>0</ns2:Speed>
<ns2:PositionDateTime>2019-07-31T16:50:28</ns2:PositionDateTime>
</GNSSPosition>
<WheelBasedSpeed></WheelBasedSpeed>
</VehiclePosition>
<VehiclePosition>
<VIN>YS2R8X40005441367</VIN>
<TriggerType>OTHER</TriggerType>
<CreatedDateTime>2019-07-31T18:13:24</CreatedDateTime>
<ReceivedDateTime>2019-07-31T18:13:25</ReceivedDateTime>
<GNSSPosition>
<ns2:Latitude>62.127206</ns2:Latitude>
<ns2:Longitude>10.608676</ns2:Longitude>
<ns2:Heading>3</ns2:Heading>
etc.
Code:
headers={'Authorization':Token,'Content-Type':'application/xml'}
r=requests.get(url, headers=headers)
def getvalueofnode(node):
return node.text if node is not None else None
def main():
root = cET.fromstring(r.content)
dfcols = ['VIN', 'CreatedDateTime', 'ReceivedDateTime', 'Latitude', 'Longitude', 'Altitude']
df_xml = pd.DataFrame(columns=dfcols)
for node in root:
VIN = node.find('VIN')
CreatedDateTime = node.find('CreatedDateTime')
ReceivedDateTime = node.find('ReceivedDateTime')
Latitude = node.find('Latitude')
Longitude = node.find('Longitude')
Altitude = node.find('Altitude')
df_xml = df_xml.append(
pd.Series([getvalueofnode(VIN), getvalueofnode(CreatedDateTime), getvalueofnode(ReceivedDateTime), getvalueofnode(Latitude), getvalueofnode(Longitude), getvalueofnode(Altitude)], index=dfcols),
ignore_index=True)
print(df_xml)
main()
This is how my response looks like:
Current Dataframe

Essentially, you are not accounting for the namespaces in the XML located in root tag and likely the reason for all None results. Consider parsing with namespaces defined. Since one is the default namespace, give it any prefix like data and parse with it:
ns = {"doc":"http://fms-standard.com/rfms/v1.0.0/xsd/position",
"ns2":"http://fms-standard.com/rfms/v1.0.0/xsd/common/position"}
for node in root:
VIN = node.find("doc:VIN", ns)
CreatedDateTime = node.find('doc:CreatedDateTime', ns)
ReceivedDateTime = node.find('doc:ReceivedDateTime', ns)
Latitude = node.find('doc:GNSSPosition/ns2:Latitude', ns)
Longitude = node.find('doc:GNSSPosition/ns2:Longitude', ns)
Altitude = node.find('doc:GNSSPosition/ns2:Altitude', ns)
Additionally, avoid the quadratic copy of calling append in a loop. Instead, build a list of dictionaries to bind into a DataFrame() constructor.
def main2():
root = cET.fromstring(r.content)
ns = {"doc":"http://fms-standard.com/rfms/v1.0.0/xsd/position",
"ns2":"http://fms-standard.com/rfms/v1.0.0/xsd/common/position"}
data_list = [{'VIN': getvalueofnode(node.find("doc:VIN", ns)),
'CreatedDateTime': getvalueofnode(node.find('doc:CreatedDateTime', ns)),
'ReceivedDateTime': getvalueofnode(node.find('doc:ReceivedDateTime', ns)),
'Latitude': getvalueofnode(node.find('doc:GNSSPosition/ns2:Latitude', ns)),
'Longitude': getvalueofnode(node.find('doc:GNSSPosition/ns2:Longitude', ns)),
'Altitude': getvalueofnode(node.find('doc:GNSSPosition/ns2:Altitude', ns))} \
for node in root]
df_xml = pd.DataFrame(data_list)
Output
print(df_xml)
# Altitude CreatedDateTime Latitude Longitude ReceivedDateTime VIN
# 0 655 2019-07-31T16:50:28 62.098339 10.542222 2019-07-31T16:50:29 YS2R8X40005440923
# 1 None 2019-07-31T18:13:24 62.127206 10.608676 2019-07-31T18:13:25 YS2R8X40005441367

python key dict error multilevel dict

I'm not entirely sure why im getting a dictionary key error. I'm trying to create a multi level dict with = sign and getting a key error on metrics, but not on the first two.
doc['timestamp']
and
doc['instance_id']
both work fine, but when it gets to metrics it gives me a metrics key error. I'm not entirely sure why.
doc = {}
doc['timestamp'] = datetime.now()
#doc['instance_id'] = get_cloud_app_name()
doc['instance_id'] = "MyMac"
cpu_dict_returned = get_cpu_info()
doc['metrics']['cpu_usage']['user_cpu'] = cpu_dict_returned['user_cpu']
doc['metrics']["cpu_usage"]['system_cpu'] = cpu_dict_returned['system_cpu']
doc['metrics']["cpu_usage"]['idle_cpu'] = cpu_dict_returned['idle_cpu']
doc['metrics']["cpu_usage"]['cpu_count'] = cpu_dict_returned['cpu_count']

You must create the sub-dictionnaries before using them:
doc = {}
doc['timestamp'] = datetime.now()
doc['instance_id'] = "MyMac"
cpu_dict_returned = get_cpu_info()
doc['metrics'] = {}
doc['metrics']['cpu_usage'] = {}
doc['metrics']['cpu_usage']['user_cpu'] = cpu_dict_returned['user_cpu']
doc['metrics']["cpu_usage"]['system_cpu'] = cpu_dict_returned['system_cpu']
doc['metrics']["cpu_usage"]['idle_cpu'] = cpu_dict_returned['idle_cpu']
doc['metrics']["cpu_usage"]['cpu_count'] = cpu_dict_returned['cpu_count']

You can do this more succinctly using a dictionary comprehension:
doc = {}
doc['timestamp'] = datetime.now()
doc['instance_id'] = "MyMac"
cpu_dict_returned = get_cpu_info()
doc['metrics'] = {
'cpu_usage':
{k: cpu_dict_returned.get(k)
for k in ['user_cpu', 'system_cpu', 'idle_cpu', 'cpu_count']}
}
Note that the sub dictionary cpu_usage is first created, and then the nested dictionary is inserted.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

parse xml to pandas data frame in python - python

Related

problem with creating column name dynamically in a dataframe

What is the best way to parse large XML and genarate a dataframe with the data in the XML (with python or else)?

Parse xml w/ xsd to CSV with Python?

Transforming XML response into pandas dataframe

python key dict error multilevel dict

Categories

Resources