Parse XML and Re-write the Filename Using an XML Element - python

I am trying to parse an XML and re-name the original XML using one of its child elements, specifically as a prefix for the filename of an XML to be overwritten. In the sample XML below, I want to extract the "to" element and insert its name "Tove" into a newly written XML filename. If the original file was named "reminder.xml", could the name "Tove" be parsed and inserted into a newly written file called "Tove_reminder.xml"? Is this possible with XMLs?
`<?xml version="1.0" encoding="ISO-8859-1"?>
-<note>
<to>Tove</to>
<from>Jani</from>
<heading>Reminder</heading>
<body>Don't forget me this weekend!</body>
</note>`
It seems that Python has more flexibility extracting text and strings in other file formats, but I cannot find much that pertains to XML. Any help is most appreciated!

You can use beautifulsoup4 to extract attribute and inner texts of an xml document.
first, install beautfulsoup4
pip install beautifulsoup4
Then, assuming the text you wrote in your question is loaded in a variable named xml_text, you can do the following
from bs4 import BeautifulSoup
file_name = "reminder.xml"
xml_file = open(file_name, 'r')
xml_text = xml_file.read()
xml_file.close()
soup = BeautifulSoup(xml_text, "html.parser")
To extract a text from a tag, you can then use
to = soup.find("to")
name = to.text #contains Tove now
Finally, you can use the "name" variable to save the file
file_name = name + "_" + file_name
xml_file = open(file_name, "w")
xml_file.write(xml_text)
xml_file.close()

Related

Get xpath from html file using LXML - Python

I am learning how to parse documents using lxml. To do so, I'm trying to parse my linkedin page. It has plenty of information and I thought it would be a good training.
Enough with the context. Here what I'm doing:
going to the url: https://www.linkedin.com/in/NAME/
opening and saving the source code to as "linkedin.html"
as I'm trying to extract my current job, I'm doing the following:
from io import StringIO, BytesIO
from lxml import html, etree
# read file
filename = 'linkedin.html'
file = open(filename).read()
# building parser
parser = etree.HTMLParser()
tree = etree.parse(StringIO(file), parser)
# parse an element
title = tree.xpath('/html/body/div[6]/div[4]/div[3]/div/div/div/div/div[2]/main/div[1]/section/div[2]/div[2]/div[1]/h2')
print(title)
The tree variable's type is
But it always return an empty list for my variable title.
I've been trying all day but still don't understand what I'm doing wrong.
I've find the answer to my problem by adding an encoding parameter within the open() function.
Here what I've done:
def parse_html_file(filename):
f = open(filename, encoding="utf8").read()
parser = etree.HTMLParser()
tree = etree.parse(StringIO(f), parser)
return tree
tree = parse_html_file('linkedin.html')
name = tree.xpath('//li[#class="inline t-24 t-black t-normal break-words"]')
print(name[0].text.strip())

Parsing XML files that do not have 'root' node in Python

My client wants me to parse over 100,00 xml files and converting them into a text file.
I have successfully parse a couple of files and converting them into a text file. However I managed to do that by editing the xml and adding <root></root> in the xml file.
This would seem inefficient since I would have to edit nearly 100,00 xml files to achieve my desired result.
Is there anyway for my python code to recognize the first node and read it as the root node?
I have tried using the method showed in Python XML Parsing without root
,however I do not fully understand it and I do not know where to implement this.
The XML format is as follows:
<Thread>
<ThreadID></ThreadID>
<Title></Title>
<InitPost>
<UserID></UserID>
<Date></Date>
<icontent></icontent>
</InitPost>
<Post>
<UserID></UserID>
<Date></Date>
<rcontent></rcontent>
</Post>
</Thread>
And this is my code on how to parse the XML files:
import os
from xml.etree import ElementTree
saveFile = open('test3.txt','w')
for path, dirs, files in os.walk("data/sample"):
for f in files:
fileName = os.path.join(path, f)
with open(fileName, "r", encoding="utf8") as myFile:
dom = ElementTree.parse(myFile)
thread = dom.findall('Thread')
for t in thread:
threadID = str(t.find('ThreadID').text)
threadID = threadID.strip()
title = str(t.find('Title').text)
title = title.strip()
userID = str(t.find('InitPost/UserID').text)
userID = userID.strip()
date = str(t.find('InitPost/Date').text)
date = date.strip()
initPost = str(t.find('InitPost/icontent').text)
initPost = initPost.strip()
post = dom.findall('Thread/Post')
The rest of the code is just writing to the output text file.
Load the xml as text and wrap it with root element.
'1.xml' is the xml you have posted
from xml.etree import ElementTree as ET
files = ['1.xml'] # your list of files goes here
for file in files:
with open(file) as f:
# wrap it with <r>
xml = '<r>' + f.read() + '</r>'
root = ET.fromstring(xml)
print('Now we are ready to work with the xml')
I don't know if the Python parser supports DTDs, but if it does, then one approach is to define a simple wrapper document like this
<!DOCTYPE root [
<!ENTITY e SYSTEM "realdata.xml">
]>
<root>&e;</root>
and point the parser at this wrapper document instead of at realdata.xml
Not sure about Python, but generally speaking you can use SGML to infer missing tags, whether at the document element (root) level or elsewhere. The basic technique is creating a DTD for declaring the document element like so
<!DOCTYPE root [
<!ELEMENT root O O ANY>
]>
<!-- your document character data goes here -->
where the important things are the O O (letter O) tag omission indicators telling SGML that both the start- and end-element tags for root can be omitted.
See also the following questions with more details:
Querying Non-XML compliant structured data
Adding missing XML closing tags in Javascript

How to determine what the root tag name is for a XML document

I was wonder how I would go about determining what the root tag for an XML document is using xml.dom.minidom.
<?xml version="1.0" encoding="UTF-8"?>
<root>
<child1></child1>
<child2></child2>
<child3></child3>
</root>
In the example XML above, my root tag could be 3 or 4 different things. All I want to do is pull the tag, and then use that value to get the elements by tag name.
def import_from_XML(self, file_name)
file = open(file_name)
document = file.read()
if re.compile('^<\?xml').match(document):
xml = parseString(document)
root = '' # <-- THIS IS WHERE IM STUCK
elements = xml.getElementsByTagName(root)
I tried searching through the documentation for xml.dom.minidom, but it is a little hard for me to wrap my head around, and I couldn't find anything that answered this question outright.
I'm using Python 3.6.x, and I would prefer to keep with the standard library if possible.
For the line you commented as Where I am stuck, the following should assign the value of the root tag of the XML document to the variable theNameOfTheRootElement:
theNameOfTheRootElement = xml.documentElement.tagName
this is what I did when I last processed xml. I didn't use the approach you used but I hope it will help you.
import urllib2
from xml.etree import ElementTree as ET
req = urllib2.Request(site)
file=None
try:
file = urllib2.urlopen(req)
except urllib2.URLError as e:
print e.reason
data = file.read()
file.close()
root = ET.fromstring(data)
print("root", root)
for child in root.findall('parent element'):
print(child.text, child.attrib)

How to retain text which is present inside a tag after using beautifulsoup package in python

I have a html tag which is as follows
CWE-134
I want to retain the href part inside
Please suggest any steps for doing so
Extract:
a_tag['href']
Save to file:
with open('output.txt', 'w') as f:
f.write(a_tag['href'])
Write it to a file, like TXT or CSV. Or store it to the database.
for _ in soup.find_all('a'):
print _
text = re.split(r'">',re.split(r'="', str(_))[-1])[0]
print text

Trying to extract xml element using python 2.7

I am trying to extract the name elements under the sequence in xml files. I have pasted in the top of a sample xml to illustrate. With this I want to get the text from 01 Interview_been successful through mentorship and write it to a file. There are multiple sequence tags in the xml and I am trying to figure out how to go through it and extract it. I have tried to figure out how to use xml.etree and xml.dom.minidom but I can't seem to wrap my brain around it. I was able to get all of the id values from the sequence tags but not the name elements. I'm pasting in my code before the xml.
from xml.etree import ElementTree
file = open("xmldump.txt", "r")
filedata = file.read()
file.close()
with open('test.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter('name'):
sequenceid = node.attrib.get('name')
print ' %s' % (sequenceid)
newLine = sequenceid + "\n"
file = open("xmldump.txt", "w")
file.write(newLine)
file.close()
Here is the XML:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xmeml>
<xmeml version="5">
<bin>
<uuid>0F5D72FA-54E4-4DE8-81D7-CC33F5C43836</uuid>
<updatebehavior>add</updatebehavior>
<name>Logged</name>
<children>
<sequence id="01 Interview_been successful through mentorship">
<uuid>12FB944D-83EA-4527-9A54-2130A42E3A06</uuid>
<updatebehavior>add</updatebehavior>
<name>01 Interview_been successful through mentorship</name>
<duration>1195</duration>
<rate>
<ntsc>TRUE</ntsc>
<timebase>24</timebase>
</rate>
<timecode>
Well, I'm not sure if you want the "id" attribute or the name tag(your code is confusing, it tries to extract a "name" attribute out of the "sequence" tag, but that tag only has an "id" attribute). Below is code that extract both, should help you get started on figuring out how ElementTree works
from xml.etree import ElementTree
with open('test.xml', 'rt') as f:
tree = ElementTree.parse(f)
for node in tree.iter('sequence'):
sequenceid = node.attrib.get('id')
name = node.findtext('name')

Categories

Resources