Extracting Data from Mysql XML dump with xml.dom.minidom

Extracting Data from Mysql XML dump with xml.dom.minidom - python

I exported a mysql database to xml with phpmyadmin and now I would like to parse it with minidom but I'm having trouble getting the content in the form that I need it.
Summary: I need to assign the variable title to the text contained within <column name="news_title">This is the title</column>
The extracted db looks like this:
<pma_xml_export version="1.0" >
<database name="dbname">
<!-- Table newsbox -->
<table name="newsbox">
<column name="news_id">1</column>
<column name="news_title">This is the title</column>
<column name="news_text">This is the news text</column>
<column name="date">Thu, 28 Feb 2008 20:10:30 -0500</column>
<column name="author">author</column>
<column name="category">site_announcement</column>
</table>
</database>
</pma_xml_export>
I am able to extract the text with the following script but it's not in the form that I need:
doc = parseString(document)
pmaexport = doc.getElementsByTagName("pma_xml_export")[0]
columns = pmaexport.getElementsByTagName("column")
for item in columns:
name = item.getAttribute("name")
text = item.firstChild.data.strip()
print name, text
What I need is something where I can assign the text contents of these elements to variables which can be passed on e.g.,
for item in columns:
title = ???
text = ???
date = ???
author = ???
If the db output was in the form of <title>Here's the Title</title> I would have plenty of examples to go off, but I just can't find any reference to something like <column name="news_title">This is the title</column>

It's been a while since I've used xml.dom.minidom but this should work...
columns = [c.firstChild.data for c in pmaexport.getElementsByTagName('column') if c.getAttribute('name') == 'news_title']
Plus, like, list comprehension!

Related

Increment list indexes to get correct values to be updated in XML data based on Title

List elements to be appended in XML data:
Sorted_TestSpecID: [10860972, 10860972, 10860972, 10860972, 10860972]
Sorted_TestCaseID: [16961435, 16961462, 16961739, 16961741, 16961745]
Sorted_TestText : ['SIG1', 'SIG2', 'SIG3', 'Signal1', 'Signal2']
original xml data:
<tc>
<title>Signal1</title>
<tcid>2c758925-dc3d-4b1d-a5e2-e0ca54c52a47</tcid>
<attributes>
<attr>
<key>TestSpec ID</key>
<value>0</value>
</attr>
<attr>
<key>TestCase ID</key>
<value>0</value>
</attr>
</attributes>
</tc>
Trying Python script to:
Search title Signal1 in xml data from Sorted_TestText
Then it should search for Key =TestCase ID and update the corresponding 16961741 value
Then it shall check for its resp. Key =TestSpec ID and update the corresponding 10860972.
soup = BeautifulSoup(xml_data, 'xml')
for tc in soup.find_all('tc'):
for title, spec, case in zip(Sorted_TestText, Sorted_TestSpecID, Sorted_TestCaseID):
if tc.find('title').text == title:
for attr in tc.find_all('attr'):
if attr.find('key').text == "TestSpec ID":
attr.find('value').text = str(spec)
if attr.find('key').text == "TestCase ID"
attr.find('value').text = str(case)
print(soup)
I've tried above script ,this script is not updating spec and case based on title, working on if spec, case and title are in order. My intention was script shall look for title and then it shall update its respective attributes. Lets say in my xml 'SIG1', 'SIG2', 'SIG3' are not present; I want to update spec and case of Signal1 with spec: 10860972 case: 16961741, but with this script it is updating SIG4 as spec: 10860972 case: 16961435. Need to traverse the spec and case lists as well for respective title. I tried, but no luck.; Required support here. Thanks in advance.

I'd use a dictionary where keys are titles and values are TestCaseIDs and TestSpecIDs.
Then, to change the contents of <value> use .string instead of .text:
dct = {
c: (str(a), str(b))
for a, b, c in zip(Sorted_TestSpecID, Sorted_TestCaseID, Sorted_TestText)
}
for tc in soup.select("tc"):
title = tc.title.get_text(strip=True)
if title not in dct:
continue
val = tc.select_one('attr:has(key:-soup-contains("TestSpec ID")) value')
if val:
val.string = str(dct[title][0])
val = tc.select_one('attr:has(key:-soup-contains("TestCase ID")) value')
if val:
val.string = str(dct[title][1])
print(soup.prettify())
Prints:
<?xml version="1.0" encoding="utf-8"?>
<tc>
<title>
Signal1
</title>
<tcid>
2c758925-dc3d-4b1d-a5e2-e0ca54c52a47
</tcid>
<attributes>
<attr>
<key>
TestSpec ID
</key>
<value>
10860972
</value>
</attr>
<attr>
<key>
TestCase ID
</key>
<value>
16961741
</value>
</attr>
</attributes>
</tc>

ElementTree, .set() and iteration

This is my first post on Stack Overflow and am a novice programmer.
I am having trouble using ElementTree and the .set() method. Using an f-string I am able to assign recipe_id with the correct number.
When I try to set the recipe_name attribute, it returns only the last element in the name_list array. I'm a bit lost! I'm sure it's something in my syntax or just my understanding of how I'm actually iterating through the items...I just don't understand because the recipe_id portion works just fine.
Expected output (within the XML)
<recipes recipe_id="1" recipe_name="Apples">
<recipes recipe_id="2" recipe_name="Oranges">
Instead I get:
<recipes recipe_id="1" recipe_name="Oranges">
<recipes recipe_id="2" recipe_name="Oranges">
My code:
#!/usr/bin/env python3
import os
import os.path
import xml.etree.ElementTree as ET
filename = "my_recipes.xml"
xmlTree = ET.parse(filename)
root = xmlTree.getroot()
#change the recipe id in recipes
i = 0
for element in root.iter("recipes"):
i += 1
element.set('recipe_id', f"{i}")
#for every tbody in the tree
name_list = []
for tbody in root.iter("tbody"):
#for every time you find a row
for row in tbody.findall('row'):
data = row.find('entry').text
#get those rows attribs
rec_name = row.find('entry').attrib
#if the row is the row that i want (contains the recipe name)...i couldn't figure out a better way to get this value precisely
if rec_name == {'namest': 'c1', 'nameend': 'c2', 'align': 'left', 'valign': 'bottom'}:
#yoink name and stick it in name_list
name_list.append(data)
for recipes in root.findall('recipes'):
for i in range(len(name_list)):
recipes.set('recipe_name', F"{name_list[i]}")
xmlTree.write(filename, encoding='UTF-8', xml_declaration=True)
My XML:
<?xml version='1.0' encoding='UTF-8'?>
<Root>
<recipes>
<tbody>
<row>
<entry namest="c1" nameend="c2" align="left" valign="bottom">Apples</entry>
</row>
</tbody>
</recipes>
<recipes>
<tbody>
<row>
<entry namest="c1" nameend="c2" align="left" valign="bottom">Oranges</entry>
</row>
</tbody>
</recipes>
</Root>
FIXED:
The code should be
for i,recipes in enumerate(root.findall('recipes')):
recipes.set('recipe_name',name_list[i])
and I am just stupid.

ElementTree wrong encoding

im searching like for hours but I cant find the solution online so im trying to ask you now here in this topic.
I just want to print the inside Content of a html tag in a xml document but im getting only things like (&lt, &gt, and and and...)
It looks like this in the XML Document
<data table="tt_content" elementUid="2490" key="tt_content:NEW/1/2490:bodytext"><![CDATA[<img src="/fileadmin/public/Redaktion/Bilder/Icons/Icon-CE.png" width="28" height="21" class="float-left mt-1 mr-2">
<h4>EU-Baumusterprüfbescheinigung</h4>
When I print it it looks like this
<data table="tt_content" elementUid="2490" key="tt_content:NEW/1/2490:bodytext"><img src="/fileadmin/public/Redaktion/Bilder/Icons/Icon-CE.png" width="28" height="21" class="float-left mt-1 mr-2">
<h4>EU-Baumusterprüfbescheinigung</h4>
as you can see it is very different not only the german characters not being displayed but also the "CDATA" which is very important to me.
There are replaced with &lt.. and so on.
And now to my Code
raw = <data table="tt_content" elementUid="2490" key="tt_content:NEW/1/2490:bodytext"><![CDATA[<img src="/fileadmin/public/Redaktion/Bilder/Icons/Icon-CE.png" width="28" height="21" class="float-left mt-1 mr-2">
<h4>EU-Baumusterprüfbescheinigung</h4>
raw = ET.tostring(data).decode()
print(raw) # print is showed before
What I've also tried
# raw = ET.tostring(raw, encoding='unicode', method='xml')
# raw = ET.tostring(raw, encoding='unicode', method='xml')
At first im iterating to the position where i have the data table which i showed you before
def copy_content():
for pageGrp in root.findall('pageGrp'):
for data in pageGrp.iter('data'):
tag = data.get("key").split(":")[2]
if (tag == "bodytext"):
raw = ET.tostring(data).decode() IT Starts HERE
# ET.dump(data)
# print(raw)
# file = open('new.xml', 'a')
# file.write(raw)
print(raw)
I hope you can help me.. Thanks in advance

How to parse XML grouped by specific tag id

I have the following xml file and I will like to structure it group it by Table Id.
xml = """
<Tables Count="19">
<Table Id="1" >
<Data>
<Cell>
<Brush/>
<Text>AA</Text>
<Text>BB</Text>
</Cell>
</Data>
</Table>
<Table Id="2" >
<Data>
<Cell>
<Brush/>
<Text>CC</Text>
<Text>DD</Text>
</Cell>
</Data>
</Table>
</Tables>
"""
I would like to parse it and get something like this.
I have tried something below but couldn't figure out it.
from lxml import etree
tree = etree.fromstring(xml)
users = {}
for user in tree.xpath("//Tables"):
name = user.xpath("Table")[0].text
users[name] = []
for group in user.xpath("Data/Cell/Text"):
users[name].append(group.text)
print (users)
Is that possible to get the above result? if so, could anyone help me to do this? I really appreciate your effort.

You need to change your xpath queries to:
from lxml import etree
tree = etree.fromstring(xml)
users = {}
for user in tree.xpath("//Tables/Table"):
# ^^^
name = user.attrib['Id']
users[name] = []
for group in user.xpath(".//Data/Cell/Text"):
# ^^^
users[name].append(group.text)
print (users)
...and use the attrib dictionary.
This yields for your string:
{'1': ['AA', 'BB'], '2': ['CC', 'DD']}
If you're into "one-liners", you could even do:
users = {name: [group.text for group in user.xpath(".//Data/Cell/Text")]
for user in tree.xpath("//Tables/Table")
for name in [user.attrib["Id"]]}

Replacing a custom "HTML" tag in a Python string

I want to be able to include a custom "HTML" tag in a string, such as: "This is a <photo id="4" /> string".
In this case the custom tag is <photo id="4" />. I would also be fine changing this custom tag to be written differently if it makes it easier, ie [photo id:4] or something.
I want to be able to pass this string to a function that will extract the tag <photo id="4" />, and allow me to transform this to some more complicated template like <div class="photo"><img src="...." alt="..."></div>, which I can then use to replace the tag in the original string.
I'm imaging it work something like this:
>>> content = "This is a <photo id="4" /> string"
# Pass the string to a function that returns all the tags with the given name.
>>> tags = parse_tags('photo', string)
>>> print(tags)
[{'tag': 'photo', 'id': 4, 'raw': '<photo id="4" />'}]
# Now that I know I need to render a photo with ID 4, so I can pass that to some sort of template thing
>>> rendered = render_photo(id=tags[0]['id'])
>>> print(rendered)
<div class="photo"><img src="...." alt="..."></div>
>>> content = content.replace(tags[0]['raw'], rendered)
>>> print(content)
This is a <div class="photo"><img src="...." alt="..."></div> string
I think this is a fairly common pattern, for something like putting a photo in a blog post, so I'm wondering if there is a library out there that will do something similar to the example parse_tags function above. Or do I need to write it?
This example of the photo tag is just a single example. I would want to have tags with different names. As a different example, maybe I have a database of people and I want a tag like <person name="John Doe" />. In that case the output I want is something like {'tag': 'person', 'name': 'John Doe', 'raw': '<person name="John Doe" />'}. I can then use the name to look that person up and return a rendered template of the person's vcard or something.

If you're working with HTML5, I would suggest looking into the xml module (etree). It will allow you to parse the whole document into a tree structure and manipulate tags individually (and then turn the resut bask into an html document).
You could also use regular expressions to perform text substitutions. This would likely be faster than loading a xml tree structure if you don't have too many changes to make.
import re
text = """<html><body>some text <photo> and tags <photo id="4"> more text <person name="John Doe"> yet more text"""
tags = ["photo","person","abc"]
patterns = "|".join([ f"(<{tag} .*?>)|(<{tag}>)" for tag in tags ])
matches = list(re.finditer(patterns,text))
for match in reversed(matches):
tag = text[match.start():match.end()]
print(match.start(),match.end(),tag)
# substitute what you need for that tag
text = text[:match.start()] + "***" + text[match.end():]
print(text)
This will be printed:
64 88 <person name="John Doe">
39 53 <photo id="4">
22 29 <photo>
<html><body>some text *** and tags *** more text *** yet more text
Performing the replacements in reverse order ensures that the ranges found by finditer() remain valid as the text changes with the substitutions.

For this kind of "surgical" parsing (where you want to isolate specific tags instead of creating a full hierarchical document), pyparsing's makeHTMLTags method can be very useful.
See the annotated script below, showing the creation of the parser, and using it for parseTag and replaceTag methods:
import pyparsing as pp
def make_tag_parser(tag):
# makeHTMLTags returns 2 parsers, one for the opening tag and one for the
# closing tag - we only need the opening tag; the parser will return parsed
# fields of the tag itself
tag_parser = pp.makeHTMLTags(tag)[0]
# instead of returning parsed bits of the tag, use originalTextFor to
# return the raw tag as token[0] (specifying asString=False will retain
# the parsed attributes and tag name as attributes)
parser = pp.originalTextFor(tag_parser, asString=False)
# add one more callback to define the 'raw' attribute, copied from t[0]
def add_raw_attr(t):
t['raw'] = t[0]
parser.addParseAction(add_raw_attr)
return parser
# parseTag to find all the matches and report their attributes
def parseTag(tag, s):
return make_tag_parser(tag).searchString(s)
content = """This is a <photo id="4" /> string"""
tag_matches = parseTag("photo", content)
for match in tag_matches:
print(match.dump())
print("raw: {!r}".format(match.raw))
print("tag: {!r}".format(match.tag))
print("id: {!r}".format(match.id))
# transform tag to perform tag->div transforms
def replaceTag(tag, transform, s):
parser = make_tag_parser(tag)
# add one more parse action to do transform
parser.addParseAction(lambda t: transform.format(**t))
return parser.transformString(s)
print(replaceTag("photo",
'<div class="{tag}"><img src="<src_path>/img_{id}.jpg." alt="{tag}_{id}"></div>',
content))
Prints:
['<photo id="4" />']
- empty: True
- id: '4'
- raw: '<photo id="4" />'
- startPhoto: ['photo', ['id', '4'], True]
[0]:
photo
[1]:
['id', '4']
[2]:
True
- tag: 'photo'
raw: '<photo id="4" />'
tag: 'photo'
id: '4'
This is a <div class="photo"><img src="<src_path>/img_4.jpg." alt="photo_4"></div> string

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting Data from Mysql XML dump with xml.dom.minidom - python

It's been a while since I've used xml.dom.minidom but this should work... columns = [c.firstChild.data for c in pmaexport.getElementsByTagName('column') if c.getAttribute('name') == 'news_title'] Plus, like, list comprehension!

Related

Increment list indexes to get correct values to be updated in XML data based on Title

ElementTree, .set() and iteration

ElementTree wrong encoding

How to parse XML grouped by specific tag id

Replacing a custom "HTML" tag in a Python string

Categories

Resources