Parsing an XML File to CSV without hardcoding values

Parsing an XML File to CSV without hardcoding values - python

I was wondering if there is a way to parse through an XML and basically get all the tags (or as much as possible) and put them into columns without hardcoding.
For example the eventType tag in my xml. I would like it to initially create a column named "eventType" and put the value inside it underneath that column. Each "eventType" tag it parses through would be put it into the same column.
Here is generally how I am trying to make it look like:
Here is the XML sample:
<?xml version="1.0" encoding="UTF-8"?>
<faults version="1" xmlns="urn:nortel:namespaces:mcp:faults" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:nortel:namespaces:mcp:faults NortelFaultSchema.xsd ">
<family longName="1OffMsgr" shortName="OOM"/>
<family longName="ACTAGENT" shortName="ACAT">
<logs>
<log>
<eventType>RES</eventType>
<number>1</number>
<severity>INFO</severity>
<descTemplate>
<msg>Accounting is enabled upon this NE.</msg>
</descTemplate>
<note>This log is generated when setting a Session Manager's AM from <none> to a valid AM.</note>
<om>On all instances of this Session Manager, the <NE_Inst>:<AM>:STD:acct OM row in the StdRecordStream group will appear and start counting the recording units sent to the configured AM.
On the configured AM, the <NE_inst>:acct OM rows in RECSTRMCOLL group will appear and start counting the recording units received from this Session Manager's instances.
</om>
</log>
<log>
<eventType>RES</eventType>
<number>2</number>
<severity>ALERT</severity>
<descTemplate>
<msg>Accounting is disabled upon this NE.</msg>
</descTemplate>
<note>This log is generated when setting a Session Manager's AM from a valid AM to <none>.</note>
<action>If you do not intend for the Session Manager to produce accounting records, then no action is required. If you do intend for the Session Manager to produce accounting records, then you should set the Session Manager's AM to a valid AM.</action>
<om>On all instances of this Session Manager, the <NE_Inst>:<AM>:STD:acct OM row in the StdRecordStream group that matched the previous datafilled AM will disappear.
On the previously configured AM, the <NE_inst>:acct OM rows in RECSTRMCOLL group will disappear.
</om>
</log>
</logs>
</family>
<family longName="ACODE" shortName="AC">
<alarms>
<alarm>
<eventType>ADMIN</eventType>
<number>1</number>
<probableCause>INFORMATION_MODIFICATION_DETECTED</probableCause>
<descTemplate>
<msg>Configured data for audiocode server updated: $1</msg>
<param>
<num>1</num>
<description>AudioCode configuration data got updated</description>
<exampleValue>acgwy1</exampleValue>
</param>
</descTemplate>
<manualClearable></manualClearable>
<correctiveAction>None. Acknowledge/Clear alarm and deploy the audiocode server if appropriate.</correctiveAction>
<alarmName>Audiocode Server Updated</alarmName>
<severities>
<severity>MINOR</severity>
</severities>
</alarm>
<alarm>
<eventType>ADMIN</eventType>
<number>2</number>
<probableCause>CONFIG_OR_CUSTOMIZATION_ERROR</probableCause>
<descTemplate>
<msg>Deployment for audiocode server failed: $1. Reason: $2.</msg>
<param>
<num>1</num>
<description>AudioCode Name</description>
<exampleValue>audcod</exampleValue>
</param>
<param>
<num>2</num>
<description>AudioCode Deployment failed reason</description>
<exampleValue>Failed to parse audiocode configuration data</exampleValue>
</param>
</descTemplate>
<manualClearable></manualClearable>
<correctiveAction>Check the configuration of audiocode server. Acknowledge/Clear alarm and deploy the audiocode server if appropriate.</correctiveAction>
<alarmName>Audiocode Server Deploy Failed</alarmName>
<severities>
<severity>MINOR</severity>
<severity>MAJOR</severity>
</severities>
</alarm>
<alarm>
<eventType>COMM</eventType>
<number>2</number>
<probableCause>LOSS_OF_FRAME</probableCause>
<descTemplate>
<msg>Far end LOF (a.k.a., Yellow Alarm). Trunk (DS1 Number): $1.</msg>
<param>
<num>1</num>
<description>Trunk Number of Trunk with configuration problem</description>
<exampleValue>2</exampleValue>
</param>
</descTemplate>
<clearCondition>Far end is correctly configured for proper framing.</clearCondition>
<correctiveAction>Check that the far end is configured for the proper framing.</correctiveAction>
<alarmName>Far end LOF</alarmName>
<severities>
<severity>CRITICAL</severity>
</severities>
<note>This alarm indicates the Trunk Framing settings on the connected PSTN switch do not match those provisioned on the Audiocodes Mediant 2k.</note>
</alarm>
<alarm>
<eventType>COMM</eventType>
<number>3</number>
<probableCause>LOSS_OF_FRAME</probableCause>
<descTemplate>
<msg>Near end sending LOF Indication. Trunk (DS1 Number): $1.</msg>
<param>
<num>1</num>
<description>Trunk Number of Trunk with configuration problem</description>
<exampleValue>2</exampleValue>
</param>
</descTemplate>
<clearCondition>Gateway is correctly configured for proper framing.</clearCondition>
<correctiveAction>Check that the Audiocodes gateway is configured for the proper framing.</correctiveAction>
<alarmName>Near end sending LOF Indication</alarmName>
<severities>
<severity>CRITICAL</severity>
</severities>
</alarm>
</alarms>
</family>
</faults>
This is the code, as you can see my tag names are hardcoded:
from xml.etree import ElementTree
import csv
import lxml.etree
import pandas as pd
from copy import copy
from pprint import pprint
tree = ElementTree.parse('FaultFamilies.xml')
sitescope_data = open('Out.csv', 'w', newline='', encoding='utf-8')
csvwriter = csv.writer(sitescope_data)
# Create all needed columns here in order and writes them to excel file
col_names = ['longName', 'shortName', 'eventType', 'ProbableCause', 'Severity', 'alarmName', 'clearCondition',
'correctiveAction', 'note', 'action', 'om']
csvwriter.writerow(col_names)
def recurse(root, props):
# Finds every single tag in the xml file
for child in root:
#print(child.text)
if child.tag == '{urn:nortel:namespaces:mcp:faults}family':
# copy of the dictionary
p2 = copy(props)
# adds to the dictionary the longNm name and shortName
p2['longName'] = child.attrib.get('longName', '')
p2['shortName'] = child.attrib.get('shortName', '')
recurse(child, p2)
else:
recurse(child, props)
# FIND ALL NEEDED ALARMS INFORMATION
for event in root.findall('{urn:nortel:namespaces:mcp:faults}alarm'):
event_data = [props.get('longName',''), props.get('shortName', '')]
# Find eventType and appends it
event_id = event.find('{urn:nortel:namespaces:mcp:faults}eventType')
if event_id != None:
event_id = event_id.text
# appends to the to the list with comma
event_data.append(event_id)
# Find probableCause and appends it
probableCause = event.find('{urn:nortel:namespaces:mcp:faults}probableCause')
if probableCause != None:
probableCause = probableCause.text
event_data.append(probableCause)
# Find severities and appends it
severities = event.find('{urn:nortel:namespaces:mcp:faults}severities')
if severities:
severity_data = ','.join(
[sv.text for sv in severities.findall('{urn:nortel:namespaces:mcp:faults}severity')])
event_data.append(severity_data)
else:
event_data.append("")
# Find alarmName and appends it
alarmName = event.find('{urn:nortel:namespaces:mcp:faults}alarmName')
if alarmName != None:
alarmName = alarmName.text
event_data.append(alarmName)
clearCondition = event.find('{urn:nortel:namespaces:mcp:faults}clearCondition')
if clearCondition != None:
clearCondition = clearCondition.text
event_data.append(clearCondition)
correctiveAction = event.find('{urn:nortel:namespaces:mcp:faults}correctiveAction')
if correctiveAction != None:
correctiveAction = correctiveAction.text
event_data.append(correctiveAction)
note = event.find('{urn:nortel:namespaces:mcp:faults}note')
if note != None:
note = note.text
event_data.append(note)
action = event.find('{urn:nortel:namespaces:mcp:faults}action')
if action != None:
action = action.text
event_data.append(action)
csvwriter.writerow(event_data)
# FIND ALL LOGS INFORMATION
for event in root.findall('{urn:nortel:namespaces:mcp:faults}log'):
event_data = [props.get('longName', ''), props.get('shortName', '')]
event_id = event.find('{urn:nortel:namespaces:mcp:faults}eventType')
if event_id != None:
event_id = event_id.text
event_data.append(event_id)
probableCause = event.find('{urn:nortel:namespaces:mcp:faults}probableCause')
if probableCause != None:
probableCause = probableCause.text
event_data.append(probableCause)
severities = event.find('{urn:nortel:namespaces:mcp:faults}severity')
if severities != None:
severities = severities.text
event_data.append(severities)
alarmName = event.find('{urn:nortel:namespaces:mcp:faults}alarmName')
if alarmName != None:
alarmName = alarmName.text
event_data.append(alarmName)
# Find alarmName and appends it
clearCondition = event.find('{urn:nortel:namespaces:mcp:faults}clearCondition')
if clearCondition != None:
clearCondition = clearCondition.text
event_data.append(clearCondition)
correctiveAction = event.find('{urn:nortel:namespaces:mcp:faults}correctiveAction')
if correctiveAction != None:
correctiveAction = correctiveAction.text
event_data.append(correctiveAction)
note = event.find('{urn:nortel:namespaces:mcp:faults}note')
if note != None:
note = note.text
event_data.append(note)
action = event.find('{urn:nortel:namespaces:mcp:faults}action')
if action != None:
action = action.text
event_data.append(action)
csvwriter.writerow(event_data)
root = tree.getroot()
recurse(root, {}) # root + empty dictionary
print("File successfuly converted to CSV")
sitescope_data.close()
When running #tdelaney solution:

You could build a list of lists to represent rows of the table. Whenever its time for a new row, build a new list with all known columns defaulted to "" and append it to the bottom of the outer list. When a new column needs to inserted, its just a case of spinning through the existing inner lists and appending a default "" cell. Keep a map of known column names to index in the row. Now when you spin through the events, you use the tag name to find the row index and add its value to the latest row in the table.
It looks like you want "log" and "alarm" tags, but I wrote the element selector to take any element that has an "eventType" child element. Since "longName" and "shortName" are common to all events under a given , there is an outer loop to grab those and apply on each new row of the table. I switched to xpath so that I could setup namespaces and write the selectors more tersely. Personal preference there, but I think it makes the xpath more readable.
import csv
import lxml.etree
from lxml.etree import QName
import operator
class ExpandingTable:
"""A 2 dimensional table where columns are exapanded as new column
types are discovered"""
def __init__(self):
"""Create table that can expand rows and columns"""
self.name_to_col = {}
self.table = []
def add_column(self, name):
"""Add column named `name` unless already included"""
if name not in self.name_to_col:
self.name_to_col[name] = len(self.name_to_col)
for row in self.table:
row.append('')
def add_cell(self, name, value):
"""Add value to named column in the current row"""
if value:
self.add_column(name)
self.table[-1][self.name_to_col[name]] = value.strip().replace("\r\n", " ")
def new_row(self):
"""Create a new row and make it current"""
self.table.append([''] * len(self.name_to_col))
def header(self):
"""Gather discovered column names into a header list"""
idx_1 = operator.itemgetter(1)
return [name for name, _ in sorted(self.name_to_col.items(), key=idx_1)]
def prepend_header(self):
"""Gather discovered column names into a header and
prepend it to the list"""
self.table.insert(0, self.header())
def events_to_table(elem):
""" Builds table from <family> child elements and their contained alarms and
logs."""
ns = {"f":"urn:nortel:namespaces:mcp:faults"}
table = ExpandingTable()
for family in elem.xpath("f:family", namespaces=ns):
longName = family.get("longName")
shortName = family.get("shortName")
for event in family.xpath("*/*[f:eventType]", namespaces=ns):
table.new_row()
table.add_cell("longName", longName)
table.add_cell("shortName", shortName)
for cell in event:
tag = QName(cell.tag).localname
if tag == "severities":
tag = "severity"
text = ",".join(severity.text for severity in cell.xpath("*"))
print("severities", repr(text))
else:
text = cell.text
table.add_cell(tag, text)
table.prepend_header()
return table.table
def main(filename):
doc = lxml.etree.parse(filename)
table = events_to_table(doc.getroot())
with open('test.csv', 'w', newline='', encoding='utf-8') as fileobj:
csv.writer(fileobj).writerows(table)
main('test.xml')

Related

python ctypes structure pointers don't resolve as expected

I've built a ctypes interface to Libxml2, the Python xmlDoc is:
class xmlDoc(ctypes.Structure):
_fields_ = [
("_private",ctypes.c_void_p), # application data
("type",ctypes.c_uint16), # XML_DOCUMENT_NODE, must be second !
("name",ctypes.c_char_p), # name/filename/URI of the document
("children",ctypes.c_void_p), # the document tree
("last",ctypes.c_void_p), # last child link
("parent",ctypes.c_void_p), # child->parent link
("next",ctypes.c_void_p), # next sibling link
("prev",ctypes.c_void_p), # previous sibling link
("doc",ctypes.c_void_p), # autoreference to itself End of common part
("compression",ctypes.c_int), # level of zlib compression
("standalone",ctypes.c_int), # standalone document (no external refs) 1 if standalone="yes" 0 if sta
("intSubset",ctypes.c_void_p), # the document internal subset
("extSubset",ctypes.c_void_p), # the document external subset
("oldNs",ctypes.c_void_p), # Global namespace, the old way
("version",ctypes.c_char_p), # the XML version string
("encoding",ctypes.c_char_p), # external initial encoding, if any
("ids",ctypes.c_void_p), # Hash table for ID attributes if any
("refs",ctypes.c_void_p), # Hash table for IDREFs attributes if any
("URL",ctypes.c_char_p), # The URI for that document
("charset",ctypes.c_int), # Internal flag for charset handling, actually an xmlCharEncoding
("dict",ctypes.c_void_p), # dict used to allocate names or NULL
("psvi",ctypes.c_void_p), # for type/PSVI information
("parseFlags",ctypes.c_int), # set of xmlParserOption used to parse the document
("properties",ctypes.c_int), # set of xmlDocProperties for this document set at the end of parsing
]
The char* pointers all make sense, the xmlNode* and xmlDoc* don't, the xmlDoc->doc should point to the same location (from VS Code):

The solution was in my own code, which came from the ctypes templates. Effectively, the template cast ctypes.c_void_p to a ctypes.POINTER(), which in this case is my structure definition of the xmlNode. The line of code is:
# perfect use of lambda
xmlNode = lambda x: ctypes.cast(x, ctypes.POINTER(LibXml.xmlNode))
for the fixed code:
def InsertChild(tree: QTreeWidget, item: QTreeWidgetItem, node: ctypes.c_void_p):
cur = node.contents
xmlNode = lambda x: ctypes.cast(x, ctypes.POINTER(LibXml.xmlNode))
while cur:
item.setText(0, cur.name.decode('utf-8'))
# if cur.content: item.setText(1, cur.content.decode('utf-8'))
item.setText(2, utils.PtrToHex(ctypes.addressof(cur)))
if cur.children:
child = QTreeWidgetItem(tree);
item.addChild(child);
InsertChild(tree, child, xmlNode(cur.children))
if cur.next:
cur = xmlNode(cur.next)
item = QTreeWidgetItem(tree);
else: cur = None
return

Python script - Blogger2Wordpress - how to save file?

I use the blogger2wordpress python script that Google released back in 2010 (https://code.google.com/archive/p/google-blog-converters-appengine/downloads), to convert a 95mb blogger export file to wordpress wxr format.
However, the script has this code:
#!/usr/bin/env python
# Copyright 2008 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0.txt
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os.path
import logging
import re
import sys
import time
from xml.sax.saxutils import unescape
import BeautifulSoup
import gdata
from gdata import atom
import iso8601
import wordpress
__author__ = 'JJ Lueck (EMAIL#gmail.com)'
###########################
# Constants
###########################
BLOGGER_URL = 'http://www.blogger.com/'
BLOGGER_NS = 'http://www.blogger.com/atom/ns#'
KIND_SCHEME = 'http://schemas.google.com/g/2005#kind'
YOUTUBE_RE = re.compile('http://www.youtube.com/v/([^&]+)&?.*')
YOUTUBE_FMT = r'[youtube=http://www.youtube.com/watch?v=\1]'
GOOGLEVIDEO_RE = re.compile('(http://video.google.com/googleplayer.swf.*)')
GOOGLEVIDEO_FMT = r'[googlevideo=\1]'
DAILYMOTION_RE = re.compile('http://www.dailymotion.com/swf/(.*)')
DAILYMOTION_FMT = r'[dailymotion id=\1]'
###########################
# Translation class
###########################
class Blogger2Wordpress(object):
"""Performs the translation of a Blogger export document to WordPress WXR."""
def __init__(self, doc):
"""Constructs a translator for a Blogger export file.
Args:
doc: The WXR file as a string
"""
# Ensure UTF8 chars get through correctly by ensuring we have a
# compliant UTF8 input doc.
self.doc = doc.decode('utf-8', 'replace').encode('utf-8')
# Read the incoming document as a GData Atom feed.
self.feed = atom.FeedFromString(self.doc)
self.next_id = 1
def Translate(self):
"""Performs the actual translation to WordPress WXR export format.
Returns:
A WordPress WXR export document as a string, or None on error.
"""
# Create the top-level document and the channel associated with it.
channel = wordpress.Channel(
title = self.feed.title.text,
link = self.feed.GetAlternateLink().href,
base_blog_url = self.feed.GetAlternateLink().href,
pubDate = self._ConvertPubDate(self.feed.updated.text))
posts_map = {}
for entry in self.feed.entry:
# Grab the information about the entry kind
entry_kind = ""
for category in entry.category:
if category.scheme == KIND_SCHEME:
entry_kind = category.term
if entry_kind.endswith("#comment"):
# This entry will be a comment, grab the post that it goes to
in_reply_to = entry.FindExtensions('in-reply-to')
post_item = None
# Check to see that the comment has a corresponding post entry
if in_reply_to:
post_id = self._ParsePostId(in_reply_to[0].attributes['ref'])
post_item = posts_map.get(post_id, None)
# Found the post for the comment, add the commment to it
if post_item:
# The author email may not be included in the file
author_email = ''
if entry.author[0].email:
author_email = entry.author[0].email.text
# Same for the the author's url
author_url = ''
if entry.author[0].uri:
author_url = entry.author[0].uri.text
post_item.comments.append(wordpress.Comment(
comment_id = self._GetNextId(),
author = entry.author[0].name.text,
author_email = author_email,
author_url = author_url,
date = self._ConvertDate(entry.published.text),
content = self._ConvertContent(entry.content.text)))
elif entry_kind.endswith('#post'):
# This entry will be a post
post_item = self._ConvertEntry(entry, False)
posts_map[self._ParsePostId(entry.id.text)] = post_item
channel.items.append(post_item)
elif entry_kind.endswith('#page'):
# This entry will be a static page
page_item = self._ConvertEntry(entry, True)
posts_map[self._ParsePageId(entry.id.text)] = page_item
channel.items.append(page_item)
wxr = wordpress.WordPressWxr(channel=channel)
return wxr.WriteXml()
def _ConvertEntry(self, entry, is_page):
"""Converts the contents of an Atom entry into a WXR post Item element."""
# A post may have an empty title, in which case the text element is None.
title = ''
if entry.title.text:
title = entry.title.text
# Check here to see if the entry points to a draft or regular post
status = 'publish'
if entry.control and entry.control.draft:
status = 'draft'
# If no link is present in the Blogger entry, just link
if entry.GetAlternateLink():
link = entry.GetAlternateLink().href
else:
link = BLOGGER_URL
# Declare whether this is a post of a page
post_type = 'post'
if is_page:
post_type = 'page'
blogger_blog = ''
blogger_permalink = ''
if entry.GetAlternateLink():
blogger_path_full = entry.GetAlternateLink().href.replace('http://', '')
blogger_blog = blogger_path_full.split('/')[0]
blogger_permalink = blogger_path_full[len(blogger_blog):]
# Create the actual item element
post_item = wordpress.Item(
title = title,
link = link,
pubDate = self._ConvertPubDate(entry.published.text),
creator = entry.author[0].name.text,
content = self._ConvertContent(entry.content.text),
post_id = self._GetNextId(),
post_date = self._ConvertDate(entry.published.text),
status = status,
post_type = post_type,
blogger_blog = blogger_blog,
blogger_permalink = blogger_permalink,
blogger_author = entry.author[0].name.text)
# Convert the categories which specify labels into wordpress labels
for category in entry.category:
if category.scheme == BLOGGER_NS:
post_item.labels.append(category.term)
return post_item
def _ConvertContent(self, text):
"""Unescapes the post/comment text body and replaces video content.
All <object> and <embed> tags in the post that relate to video must be
changed into the WordPress tags for embedding video,
e.g. [youtube=http://www.youtube.com/...]
If no text is provided, the empty string is returned.
"""
if not text:
return ''
# First unescape all XML tags as they'll be escaped by the XML emitter
content = unescape(text)
# Use an HTML parser on the body to look for video content
content_tree = BeautifulSoup.BeautifulSoup(content)
# Find the object tag
objs = content_tree.findAll('object')
for obj_tag in objs:
# Find the param tag within which contains the URL to the movie
param_tag = obj_tag.find('param', { 'name': 'movie' })
if not param_tag:
continue
# Get the video URL
video = param_tag.attrMap.get('value', None)
if not video:
continue
# Convert the video URL if necessary
video = YOUTUBE_RE.subn(YOUTUBE_FMT, video)[0]
video = GOOGLEVIDEO_RE.subn(GOOGLEVIDEO_FMT, video)[0]
video = DAILYMOTION_RE.subn(DAILYMOTION_FMT, video)[0]
# Replace the portion of the contents with the video
obj_tag.replaceWith(video)
return str(content_tree)
def _ConvertPubDate(self, date):
"""Translates to a pubDate element's time/date format."""
date_tuple = iso8601.parse_date(date)
return date_tuple.strftime('%a, %d %b %Y %H:%M:%S %z')
def _ConvertDate(self, date):
"""Translates to a wordpress date element's time/date format."""
date_tuple = iso8601.parse_date(date)
return date_tuple.strftime('%Y-%m-%d %H:%M:%S')
def _GetNextId(self):
"""Returns the next identifier to use in the export document as a string."""
next_id = self.next_id;
self.next_id += 1
return str(next_id)
def _ParsePostId(self, text):
"""Extracts the post identifier from a Blogger entry ID."""
matcher = re.compile('post-(\d+)')
matches = matcher.search(text)
return matches.group(1)
def _ParsePageId(self, text):
"""Extracts the page identifier from a Blogger entry ID."""
matcher = re.compile('page-(\d+)')
matches = matcher.search(text)
return matches.group(1)
if __name__ == '__main__':
if len(sys.argv) <= 1:
print 'Usage: %s <blogger_export_file>' % os.path.basename(sys.argv[0])
print
print ' Outputs the converted WordPress export file to standard out.'
sys.exit(-1)
wp_xml_file = open(sys.argv[1])
wp_xml_doc = wp_xml_file.read()
translator = Blogger2Wordpress(wp_xml_doc)
print translator.Translate()
wp_xml_file.close()
This scripts outputs the wxr file in the terminal window which is useless for me when the import file has tons of entries.
As I am not familiar with python, how can I modify the script to output the data into a .xml file?
Edit:
I did changed the end of the script to:
wp_xml_file = open(sys.argv[1])
wp_xml_doc = wp_xml_file.read()
translator = Blogger2Wordpress(wp_xml_doc)
print translator.Translate()
fh = open("testoutput.xml", "w")
fh.write(wp_xml_doc);
fh.close();
wp_xml_file.close()
But the produced file is an "invalid wxr file" :/
Can anybody help? Thanks!

Quick and dirty answer:
Output to the stdout is normal behaviour.
You might want to redirect it to a file for instance:
python2 blogger2wordpress your_blogger_export_file > backup
The output will be saved in the file named backup.
Or you can replace print translator.Translate() by
with open('output_file', 'w') as fd:
fd.write(translator.Translate())
This should do the trick (haven't tried).

Get parent node?

I write a script to delete unwanted objects from huge datasets by their id-prefix.
That's how these objects are structured:
<wfsext:Replace vendorId="AdV" safeToIgnore="false">
<AX_Anschrift gml:id="DENWAEDA0000001G20161222T083308Z">
<gml:identifier codeSpace="http://www.adv-online.de/">urn:adv:oid:DENWAEDA0000001G</gml:identifier>
...
</AX_Anschrift>
<ogc:Filter>
<ogc:FeatureId fid="DENWAEDA0000001G20161222T083308Z" />
</ogc:Filter>
</wfsext:Replace>
I like to delete these full snippet within <wfsext:Replace>...</wfsext:Replace>
And there is a code snippet from my script:
file = etree.parse(portion_file)
root = file.getroot()
nsmap = root.nsmap.copy()
nsmap['adv'] = nsmap.pop(None)
node = root.xpath(".//adv:geaenderteObjekte/wfs:Transaction", namespaces=nsmap)[0]
for t in node:
for obj in t:
objecttype = str(etree.QName(obj.tag).localname)
if objecttype == 'Filter':
pass
else:
objid = (obj.xpath('#gml:id', namespaces=nsmap))[0][:16]
if debug:
print('{} - {}'.format(objid[:16], objecttype))
if objid[:6] != prefix:
#parent = obj.getparent()
t.remove(obj)
The t.remove(obj) removes <AX_Anschrift>..</AX_Anschrift> but not the rest of the object. I tried to get the parent node by using obj.getparent() but this gives me an error. How to catch it?

obj.getparent() is t, so you don't actually need to call getparent(), simply remove the entire object with:
node.remove(t)
or, if you want to remove the entire wfs:Transaction,
node.getparent().remove(node)

transferring rdf to 4store

actually I have a code named rdf.py that generates rdf code ..what I want to do is to directly move that file in 4store.. I have stored the entire code in a variable and want to directly pass that variable to 4store.. is it possible?
the code of rdf.py is below.
rdf_code contains the entire rdf code that is generated
import rdflib
from rdflib.events import Dispatcher, Event
from rdflib.graph import ConjunctiveGraph as Graph
from rdflib import plugin
from rdflib.store import Store, NO_STORE, VALID_STORE
from rdflib.namespace import Namespace
from rdflib.term import Literal
from rdflib.term import URIRef
from tempfile import mkdtemp
from gstudio.models import *
from objectapp.models import *
from reversion.models import Version
from optparse import make_option
def get_nodetype(name):
"""
returns the model the id belongs to.
"""
try:
"""
ALGO: get object id, go to version model, return for the given id.
"""
node = NID.objects.get(title=str(name))
# Retrieving only the relevant tupleset for the versioned objects
vrs = Version.objects.filter(type=0 , object_id=node.id)
# Returned value is a list, so splice it .
vrs = vrs[0]
except Error:
return "The item was not found."
return vrs.object._meta.module_name
def rdf_description(name, notation='xml' ):
"""
Function takes title of node, and rdf notation.
"""
valid_formats = ["xml", "n3", "ntriples", "trix"]
default_graph_uri = "http://gstudio.gnowledge.org/rdfstore"
configString = "/var/tmp/rdfstore"
# Get the Sleepycat plugin.
store = plugin.get('IOMemory', Store)('rdfstore')
# Open previously created store, or create it if it doesn't exist yet
graph = Graph(store="IOMemory",
identifier = URIRef(default_graph_uri))
path = mkdtemp()
rt = graph.open(path, create=False)
if rt == NO_STORE:
#There is no underlying Sleepycat infrastructure, create it
graph.open(path, create=True)
else:
assert rt == VALID_STORE, "The underlying store is corrupt"
# Now we'll add some triples to the graph & commit the changes
# rdflib = Namespace('http://sbox.gnowledge.org/gstudio/')
graph.bind("gstudio", "http://gnowledge.org/")
exclusion_fields = ["id", "rght", "node_ptr_id", "image", "lft", "_state", "_altnames_cache", "_tags_cache", "nid_ptr_id", "_mptt_cached_fields"]
node_type=get_nodetype(name)
if (node_type=='gbobject'):
node=Gbobject.objects.get(title=name)
elif (node_type=='objecttype'):
node=Objecttype.objects.get(title=name)
elif (node_type=='metatype'):
node=Metatype.objects.get(title=name)
elif (node_type=='attributetype'):
node=Attributetype.objects.get(title=name)
elif (node_type=='relationtype'):
node=Relationtype.objects.get(title=name)
elif (node_type=='attribute'):
node=Attribute.objects.get(title=name)
elif (node_type=='complement'):
node=Complement.objects.get(title=name)
elif (node_type=='union'):
node=Union.objects.get(title=name)
elif (node_type=='intersection'):
node=Intersection.objects.get(title=name)
elif (node_type=='expression'):
node=Expression.objects.get(title=name)
elif (node_type=='processtype'):
node=Processtype.objects.get(title=name)
elif (node_type=='systemtype'):
node=Systemtype.objects.get(title=name)
node_url=node.get_absolute_url()
site_add= node.sites.all()
a = site_add[0]
host_name =a.name
#host_name=name
link='http://'
#Concatenating the above variables will give the url address.
url_add=link+host_name+node_url
rdflib = Namespace(url_add)
# node=Objecttype.objects.get(title=name)
node_dict=node.__dict__
subject=str(node_dict['id'])
for key in node_dict:
if key not in exclusion_fields:
predicate=str(key)
pobject=str(node_dict[predicate])
graph.add((rdflib[subject], rdflib[predicate], Literal(pobject)))
rdf_code= graph.serialize(format=notation)
# print out all the triples in the graph
for subject, predicate, object in graph:
print subject, predicate, object
graph.commit()
print rdf_code
graph.close()
can I directly pass the rdf_code to 4store...if yes then how?

The simplest way to do this is to transform that graph into ntriples and send it to http://yourhost:port/data/GRAPH_URI. If you do an HTTP POST then the triples will be appended to the existing graph represented by GRAPH_URI. If you do a HTTP PUT then the current graph will be replaced. If the graph does not exist then it will be created no matter if you POST or PUT.
Taking this function as example:
def assert4s(data,epr,graph,contenttype,flush=False):
try:
params = urllib.urlencode({'graph': graph,
'data': data,
'mime-type' : contenttype })
opener = urllib2.build_opener(urllib2.HTTPHandler)
request = urllib2.Request(epr,params)
request.get_method = lambda: ('PUT' if flush else 'POST')
url = opener.open(request)
return url.read()
except Exception, e:
raise e
If you had the following data:
triples = """<a> <b> <c> .
<d> <e> <f> .
"""
You can do the following call:
assert4s(triples,
"http://yourhost:port/data/",
"http://some.org/graph/id",
"application/x-turtle")
Edit
My previous answer assumed you were using the 4s-httpd server. You can start the SPARQL server in 4store with the following command 4s-httpd -p PORT kb_name. Once you have this running, you can use the following services for:
http://localhost:port/sparql/ to submit queries
http://localhost:port/data/ to PUT or POST data files.
http://localhost:port/update/ to submit SPARQL updates queries.
The 4store SPARQLServer documentation is quite complete.

XML Parsing in Python using document builder factory

I am working in STAF and STAX. Here python is used for coding . I am new to python.
Basically my task is to parse a XML file in python using Document Factory Parser.
The XML file I am trying to parse is :
<?xml version="1.0" encoding="utf-8"?>
<operating_system>
<unix_80sp1>
<tests type="quick_sanity_test">
<prerequisitescript>preparequicksanityscript</prerequisitescript>
<acbuildpath>acbuildpath</acbuildpath>
<testsuitscript>test quick sanity script</testsuitscript>
<testdir>quick sanity dir</testdir>
</tests>
<machine_name>u80sp1_L004</machine_name>
<machine_name>u80sp1_L005</machine_name>
<machine_name>xyz.pxy.dxe.cde</machine_name>
<vmware id="155.35.3.55">144.35.3.90</vmware>
<vmware id="155.35.3.56">144.35.3.91</vmware>
</unix_80sp1>
</operating_system>
I need to read all the tags .
For the tags machine_name i need to read them into a list
say all machine names should be in a list machname.
so machname should be [u80sp1_L004,u80sp1_L005,xyz.pxy.dxe.cde] after reading the tags.
I also need all the vmware tags:
all attributes should be vmware_attr =[155.35.3.55,155.35.3.56]
all vmware values should be vmware_value = [ 144.35.3.90,155.35.3.56]
I am able to read all tags properly except vmware tags and machine name tags:
I am using the following code:(i am new to xml and vmware).Help required.
The below code needs to be modified.
factory = DocumentBuilderFactory.newInstance();
factory.setValidating(1)
factory.setIgnoringElementContentWhitespace(0)
builder = factory.newDocumentBuilder()
document = builder.parse(xmlFileName)
vmware_value = None
vmware_attr = None
machname = None
# Get the text value for the element with tag name "vmware"
nodeList = document.getElementsByTagName("vmware")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
vmware_value = thisChild.getNodeValue()
vmware_attr ==??? what method to use ?
# Get the text value for the element with tag name "machine_name"
nodeList = document.getElementsByTagName("machine_name")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
machname = thisChild.getNodeValue()
Also how to check if a tag exists or not at all. I need to code the parsing properly.

You are need to instantiate vmware_value, vmware_attr and machname as lists not as strings, so instead of this:
vmware_value = None
vmware_attr = None
machname = None
do this:
vmware_value = []
vmware_attr = []
machname = []
Then, to add items to the list, use the append method on your lists. E.g.:
factory = DocumentBuilderFactory.newInstance();
factory.setValidating(1)
factory.setIgnoringElementContentWhitespace(0)
builder = factory.newDocumentBuilder()
document = builder.parse(xmlFileName)
vmware_value = []
vmware_attr = []
machname = []
# Get the text value for the element with tag name "vmware"
nodeList = document.getElementsByTagName("vmware")
for i in range(nodeList.getLength()):
node = nodeList.item(i)
vmware_attr.append(node.attributes["id"].value)
if node.getNodeType() == Node.ELEMENT_NODE:
children = node.getChildNodes()
for j in range(children.getLength()):
thisChild = children.item(j)
if (thisChild.getNodeType() == Node.TEXT_NODE):
vmware_value.append(thisChild.getNodeValue())
I've also edited the code to something I think should work to append the correct values to vmware_attr and vmware_value.
I had to make the assumption that STAX uses xml.dom syntax, so if that isn't the case, you will have to edit my suggestion appropriately.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing an XML File to CSV without hardcoding values - python

Related

python ctypes structure pointers don't resolve as expected

Python script - Blogger2Wordpress - how to save file?

Get parent node?

transferring rdf to 4store

XML Parsing in Python using document builder factory

Categories

Resources