Parsing Solr XML into Python Dictionary

Parsing Solr XML into Python Dictionary - python

I am new to python and am trying to pass an xml document (filled with documents for a solr instance) into a python dictionary. I am having trouble trying to actually accomplish this. I have tried using ElementTree and minidom but I can't seem to get the right results.
Here is my XML Structure:
<add>
<doc>
<field name="genLatitude">45.639968</field>
<field name="carOfficeHoursEnd">2000-01-01T09:00:00.000Z</field>
<field name="genLongitude">5.879745</field>
</doc>
<doc>
<field name="genLatitude">46.639968</field>
<field name="carOfficeHoursEnd">2000-01-01T09:00:00.000Z</field>
<field name="genLongitude">6.879745</field>
</doc>
</add>
And From this I need to turn it into a dictionary that looks like:
doc {
"genLatitude": '45.639968',
"carOfficeHoursEnd": '2000-01-01T09:00:00.000Z',
"genLongitude": '5.879745',
}
I am not too familiar with how dictionaries work but is there also a way to get all the "docs" into one dictionary.
cheers.

import xml.etree.cElementTree as etree
from pprint import pprint
root = etree.fromstring(xmlstr) # or etree.parse(filename_or_file).getroot()
docs = [{f.attrib['name']: f.text for f in doc.iterfind('field[#name]')}
for doc in root.iterfind('doc')]
pprint(docs)
Output
[{'carOfficeHoursEnd': '2000-01-01T09:00:00.000Z',
'genLatitude': '45.639968',
'genLongitude': '5.879745'},
{'carOfficeHoursEnd': '2000-01-01T09:00:00.000Z',
'genLatitude': '46.639968',
'genLongitude': '6.879745'}]
Where xmlstr is:
xmlstr = """
<add>
<doc>
<field name="genLatitude">45.639968</field>
<field name="carOfficeHoursEnd">2000-01-01T09:00:00.000Z</field>
<field name="genLongitude">5.879745</field>
</doc>
<doc>
<field name="genLatitude">46.639968</field>
<field name="carOfficeHoursEnd">2000-01-01T09:00:00.000Z</field>
<field name="genLongitude">6.879745</field>
</doc>
</add>
"""

Solr can return a Python dictionary if you add wt=python to the request parameters. To convert this text response into a Python object, use ast.literal_eval(text_response).
This is much simpler than parsing the XML.

A possible solution using ElementTree, with output pretty formatted for sake of example:
>>> import xml.etree.ElementTree as etree
>>> root = etree.parse(document).getroot()
>>> docs = []
>>> for doc in root.findall('doc'):
... fields = {}
... for field in doc:
... fields[field.attrib['name']] = field.text
... docs.append(fields)
...
>>> print docs
[{'genLongitude': '5.879745',
'genLatitude': '45.639968',
'carOfficeHoursEnd': '2000-01-01T09:00:00.000Z'},
{'genLongitude': '6.879745',
'genLatitude': '46.639968',
'carOfficeHoursEnd': '2000-01-01T09:00:00.000Z'}]
The XML document you show does not provide a way to distinguish each doc from the other, so I would maintain that a list is the best structure to collect each dictionary.
Indeed, if you want to insert each doc data into another dictionary, of course you can, but you need to choose a suitable key for that dictionary. For example, using the id Python provides for each object, you could write:
>>> docs = {}
>>> for doc in root.findall('doc'):
... fields = {}
... for field in doc:
... fields[field.attrib['name']] = field.text
... docs[id(fields)] = fields
...
>>> print docs
{3076930796L: {'genLongitude': '6.879745',
'genLatitude': '46.639968',
'carOfficeHoursEnd': '2000-01-01T09:00:00.000Z'},
3076905540L: {'genLongitude': '5.879745',
'genLatitude': '45.639968',
'carOfficeHoursEnd': '2000-01-01T09:00:00.000Z'}}
This example is designed just to let you see how to use the outer dictionary. If you decide to go down this path, I would suggest you to find a meaningful and usable key instead of the obejct's memory address returned by id, which can change from run to run.

It's risky to eval any string that comes from the outside directly into python. Who knows what's in there.
I'd suggest using the json interface. Something like:
import json
import urllib2
response_dict = json.loads(urllib2.urlopen('http://localhost:8080/solr/combined/select?wt=json&q=*&rows=1').read())
#to view the dict
print json.dumps(answer, indent=1)

Related

how to handle web SQL queries and xml replies in Python

I have a distant database on which I can send SQL select queries through a web service like this:
http://aa.bb.cc.dd:85/SQLWEB?query=select+*+from+machine&output=xml_v2
which returns
<Query>
<SQL></SQL>
<Fields>
<MACHINEID DataType="Integer" DataSize="4"/>
<NAME DataType="WideString" DataSize="62"/>
<MACHINECLASSID DataType="Integer" DataSize="4"/>
<SUBMACHINECLASS DataType="WideString" DataSize="22"/>
<DISABLED DataType="Integer" DataSize="4"/>
</Fields>
<Record>
<MACHINEID>1</MACHINEID>
<NAME>LOADER</NAME>
<MACHINECLASSID>16</MACHINECLASSID>
<SUBMACHINECLASS>A</SUBMACHINECLASS>
<DISABLED>0</DISABLED>
</Record>
<Record>
...
</Record>
...
</Query>
Then I need to insert the records into a local SQL database.
What's the easiest way ? Thanks !

First of all, querys in the url it's a horrible idea for security.
Use xml libs to parse the xml, and then iterate over the result to add to the db.
import xml.etree.ElementTree as ET
tree = ET.parse('xml file')
root = tree.getroot()
# root = ET.fromstring(country_data_as_string) if you use a string
for record in root.findall('Record'):
MACHINEID = record.get('MACHINEID')
NAME = record.get('NAME')
MACHINECLASSID = record.get('MACHINECLASSID')
SUBMACHINECLASS = record.get('SUBMACHINECLASS')
DISABLED = record.get('DISABLED')
#your code to add this result to the db
ElementTree XML API

lxml: How do I search for fields without adding a xmlns (localhost) path to each search term?

I'm trying to locate fields in a SOAP xml file using lxml (3.6.0)
...
<soap:Body>
<Request xmlns="http://localhost/">
<Test>
<field1>hello</field1>
<field2>world</field2>
</Test>
</Request>
</soap:Body>
...
In this example I'm trying to find field1 and field2.
I need to add a path to the search term, to find the field:
print (myroot.find(".//{http://localhost/}field1").tag) # prints 'field1'
without it, I don't find anything
print (myroot.find("field1").tag) # finds 'None'
Is there any other way to search for the field tag (here field1) without giving path info?
Full example below:
from lxml import etree
example = """<?xml version="1.0" encoding="utf-8"?><soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body><Request xmlns="http://localhost/">
<Test><field1>hello</field1><field2>world</field2></Test>
</Request></soap:Body></soap:Envelope>
"""
myroot = etree.fromstring(example)
# this works
print (myroot.find(".//{http://localhost/}field1").text)
print (myroot.find(".//{http://localhost/}field2").text)
# this fails
print (myroot.find(".//field1").text)
print (myroot.find("field1").text)
Comment: The input of the SOAP request is given, I can't change any of it in real live to make things easier.

There is a way to ignore namespace when selecting element using XPath, but that isn't a good practice. Namespace is there for a reason. Anyway, there is a cleaner way to reference element in namespace i.e by using namespace prefix that was mapped to the namespace uri, instead of using the actual namespace uri every time :
.....
>>> ns = {'d': 'http://localhost/'}
>>> print (myroot.find(".//d:field1", ns).text)
hello
>>> print (myroot.find(".//d:field2", ns).text)
world

Python: Get value with xmltodict

I have an XML-file that looks like this:
<?xml version="1.0" encoding="utf-8"?>
<session id="2934" name="Valves" docVersion="5.0.1">
<docInfo>
<field name="Employee" isMandotory="True">Jake Roberts</field>
<field name="Section" isOpen="True" isMandotory="False">5</field>
<field name="Location" isOpen="True" isMandotory="False">Munchen</field>
</docInfo>
</session>
Using xmltodict I want to get the Employee in a string. It is probably quite simple but I can't seem to figure it out.
Here's my code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys
import xmltodict
with open('valves.xml') as fd:
doc = xmltodict.parse(fd.read())
print "ID : %s" % doc['session']['#id']
print "Name : %s" % doc['session']['#name']
print "Doc Version : %s" % doc['session']['#docVersion']
print "Employee : %s" % doc['session']['docInfo']['field']
sys.exit(0)
With this, I do get all fields in a list, but probably with xmltodict every individual field attribute or element is accessible as a key-value.
How can I access the value "Jake Roberts" like I access the value of docVersion for example?

What you are getting is a list of fields where every field is represented by a dict(). Explore this dict (e.g. in Python interactive shell) to narrow down how to get to the value you want.
>>> doc["session"]["docInfo"]["field"][0]
OrderedDict([(u'#name', u'Employee'), (u'#isMandotory', u'True'), ('#text', u'Jake Roberts')])
In order to get to the element value add ["#text"] to the end of the line in the snippet above.

How to use python to parse XML to required custom fields

I've got a directory full of salesforce objects in XML format. I'd like to identify the <fullName> and parent file of all the custom <fields> where <required> is true. Here is some truncated sample data, lets call it "Custom_Object__c:
<?xml version="1.0" encoding="UTF-8"?>
<CustomObject xmlns="http://soap.sforce.com/2006/04/metadata">
<deprecated>false</deprecated>
<description>descriptiontext</description>
<fields>
<fullName>custom_field1</fullName>
<required>false</required>
<type>Text</type>
<unique>false</unique>
</fields>
<fields>
<fullName>custom_field2</fullName>
<deprecated>false</deprecated>
<visibleLines>5</visibleLines>
</fields>
<fields>
<fullName>custom_field3</fullName>
<required>false</required>
</fields>
<fields>
<fullName>custom_field4</fullName>
<deprecated>false</deprecated>
<description>custom field 4 description</description>
<externalId>true</externalId>
<required>true</required>
<scale>0</scale>
<type>Number</type>
<unique>false</unique>
</fields>
<fields>
<fullName>custom_field5</fullName>
<deprecated>false</deprecated>
<description>Creator of this log message. Application-specific.</description>
<externalId>true</externalId>
<label>Origin</label>
<length>255</length>
<required>true</required>
<type>Text</type>
<unique>false</unique>
</fields>
<label>App Log</label>
<nameField>
<displayFormat>LOG-{YYYYMMDD}-{00000000}</displayFormat>
<label>Entry ID</label>
<type>AutoNumber</type>
</nameField>
</CustomObject>
The desired output would be a dictionary with format something like:
required_fields = {'Custom_Object__1': 'custom_field4', 'Custom_Object__1': 'custom_field5',... etc for all the required fields in all files in the fold.}
or anything similar.
I've already gotten my list of objects through glob.glob, and I can get a list of all the children and their attributes with ElementTree but I'm struggling past there. I feel like I'm very close but I'd love a hand finishing this task off. Here is my code so far:
import os
import glob
import xml.etree.ElementTree as ET
os.chdir("/Users/paulsallen/workspace/fforce/FForce Dev Account/config/objects/")
objs = []
for file in glob.glob("*.object"):
objs.append(file)
fields_dict = {}
for object in objs:
root = ET.parse(objs).getroot()
....
and once I get the XML data parsed I don't know where to take it from there.

You really want to switch to using lxml here, because then you can use an XPath query:
from lxml import etree as ET
os.chdir("/Users/paulsallen/workspace/fforce/FForce Dev Account/config/objects/")
objs = glob.glob("*.object")
fields_dict = {}
for filename in objs:
root = ET.parse(filename).getroot()
required = root.xpath('.//n:fullName[../n:required/text()="true"]/text()',
namespaces={'n': tree.nsmap[None]})
fields_dict[os.path.splitext(filename)[0]] = required
With that code you end up with a dictionary of lists; each key is a filename (without the extension), each value is a list of required fields.
The XPath query looks for fullName elements in the default namespace, that have a required element as sibling with the text 'true' in them. It then takes the contained text of each of those matching elements, which is a list we can store in the dictionary.

Use this function to find all required fields under a given root. It should also help as an example/starting point for future parsing needs
def find_required_fields(root):
NS = {'soap': 'http://soap.sforce.com/2006/04/metadata'}
required_fields = []
for field in root.findall('soap:fields', namespaces=NS):
required = field.findtext('soap:required', namespaces=NS) == "true"
name = field.findtext('soap:fullName', namespaces=NS)
if required:
required_fields.append(name)
return required_fields
Example usage:
>>> import xml.etree.ElementTree as ET
>>> root = ET.parse('objects.xml') # where objects.xml contains the example in the question
>>> print find_required_fields(root)
['custom_field4', 'custom_field5']
>>>

Accessing Elements with and without namespaces using lxml

Is there a way to search for the same element, at the same time, within a document that occur with and without namespaces using lxml? As an example, I would want to get all occurences of the element identifier irrespective of whether or not it is associated with a specific namespace. I am currently only able to access them separately as below.
Code:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
root = xmlfile.getroot()
for l in root.iter('identifier'):
print l.text
for l in root.iter('{http://www.openarchives.org/OAI/2.0/provenance}identifier'):
print l.text
File: xmlfile.xml
<?xml version="1.0"?>
<record>
<header>
<identifier>identifier1</identifier>
<datestamp>datastamp1</datestamp>
<setSpec>setspec1</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>title1</dc:title>
<dc:title>title2</dc:title>
<dc:creator>creator1</dc:creator>
<dc:subject>subject1</dc:subject>
<dc:subject>subject2</dc:subject>
</oai_dc:dc>
</metadata>
<about>
<provenance xmlns="http://www.openarchives.org/OAI/2.0/provenance" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/provenance http://www.openarchives.org/OAI/2.0/provenance.xsd">
<originDescription altered="false" harvestDate="2011-08-11T03:47:51Z">
<baseURL>baseURL1</baseURL>
<identifier>identifier3</identifier>
<datestamp>datestamp2</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
<originDescription altered="false" harvestDate="2010-10-10T06:15:53Z">
<baseURL>xxxxx</baseURL>
<identifier>identifier4</identifier>
<datestamp>2010-04-27T01:10:31Z</datestamp>
<metadataNamespace>xxxxx</metadataNamespace>
</originDescription>
</originDescription>
</provenance>
</about>
</record>

You could use XPath to solve that kind of issue:
from lxml import etree
xmlfile = etree.parse('xmlfile.xml')
identifier_nodes = xmlfile.xpath("//*[local-name() = 'identifier']")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing Solr XML into Python Dictionary - python

Solr can return a Python dictionary if you add wt=python to the request parameters. To convert this text response into a Python object, use ast.literal_eval(text_response). This is much simpler than parsing the XML.

Related

how to handle web SQL queries and xml replies in Python

lxml: How do I search for fields without adding a xmlns (localhost) path to each search term?

Python: Get value with xmltodict

How to use python to parse XML to required custom fields

Accessing Elements with and without namespaces using lxml

Categories

Resources