Get the URL from XML tag - python

My XML file:
<xml
xmlns="http://www.myweb.org/2003/instance"
xmlns:link="http://www.myweb.org/2003/linkbase"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:iso4217="http://www.myweb.org/2003/iso4217"
xmlns:utr="http://www.myweb.org/2009/utr">
<link:schemaRef xlink:type="simple" xlink:href="http://www.myweb.com/form/2020-01-01/test.xsd"></link:schemaRef>
I want to get the URL: http://www.myweb.com/folder/form/1/2020-01-01/test.xsd from the <link:schemaRef> tag.
My below python code finds the <link:schemaRef> tag. But I am unable to retrieve the URL.
from lxml import etree
with open(filepath,'rb') as f:
file = f.read()
root = etree.XML(file)
print(root.nsmap["link"]) #http://www.myweb.org/2003/linkbase
print(root.find(".//{"+root.nsmap["link"]+"}"+"schemaRef"))

Try it this way and see if it works:
for i in root.xpath('//*/node()'):
if isinstance(i,lxml.etree._Element):
print(i.values()[1])
Output:
http://www.myweb.com/form/2020-01-01/test.xsd

Use:
>>> child = root.getchildren()[0]
>>> child.attrib
{'{http://www.w3.org/1999/xlink}type': 'simple', '{http://www.w3.org/1999/xlink}href': 'http://www.myweb.com/form/2020-01-01/test.xsd'}
>>> url = child.attrib['{http://www.w3.org/1999/xlink}href']
However, I believe the challenge is would you know which is the right key (i.e. {http://www.w3.org/1999/xlink}href) to be used. If this is the issue, then we just need:
>>> print(root.nsmap['xlink']) # Notice that the requested url is a href to the xlink
'http://www.w3.org/1999/xlink'
>>> key_url = "{"+key_prefix+"}href"
>>> print(child.attrib[key_url])
'http://www.myweb.com/form/2020-01-01/test.xsd'

Related

Python - How to parse xml response and store a elements value in a variable?

I am getting the XML response from the API call.
I need the "testId" attribute value from this response. Please help me on this.
r = requests.get( myconfig.URL_webpagetest + "?url=" + testurl + "&f=xml&k=" + myconfig.apikey_webpagetest )
xmltxt = r.content
print(xmltxt)
testId = XML(xmltxt).find("testId").text
r = requests.get("http://www.webpagetest.org/testStatus.php?f=xml&test=" + testId )
xml response:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<statusCode>200</statusCode>
<statusText>Ok</statusText>
<data>
<testId>180523_YM_054fd7d84fd4ea7aed237f87289e0c7c</testId>
<ownerKey>dfc65d98de13c4770e528ef5b65e9629a52595e9</ownerKey>
<jsonUrl>http://www.webpagetest.org/jsonResult.php?test=180523_YM_054fd7d84fd4ea7aed237f87289e0c7c</jsonUrl>
</data>
</response>
The following error is produced:
Traceback (most recent call last):
File "/pagePerformance.py", line 52, in <module>
testId = XML (xmltxt).find("testId").text
AttributeError: 'NoneType' object has no attribute 'text'
Use the following to collect testId from response:-
import xml.etree.ElementTree as ET
response_xml_as_string = "xml response string from API"
responseXml = ET.fromstring(response_xml_as_string)
testId = responseXml.find('data').find('testId')
print testId.text
from lxml.etree import fromstring
string = '<?xml version="1.0" encoding="UTF-8"?> <response> <statusCode>200</statusCode> <statusText>Ok</statusText> <data><testId>180523_YM_054fd7d84fd4ea7aed237f87289e0c7c</testId> <ownerKey>dfc65d98de13c4770e528ef5b65e9629a52595e9</ownerKey> <jsonUrl>http://www.webpagetest.org/jsonResult.php?test=180523_YM_054fd7d84fd4ea7aed237f87289e0c7c</jsonUrl> </data> </response>'
response = fromstring(string.encode('utf-8'))
elm = response.xpath('/response/data/testId').pop()
testId = elm.text
This way you can search for any element within the xml from the root/parent element via the XPATH.
Side Note: I don't particular like using the pop method to remove the item from a single item list. So if anyone else has a better way to do it please let me know. So far I've consider:
1) elm = next(iter(response.xpath('/response/data/testId')))
2) simply leaving it in a list so it can use as a stararg
I found this article the other day when it appeared on my feed, and it may suit your needs. I skimmed it, but in general the package parses xml data and converts the tags/attributes/values into a dictionary. Additionally, the author points out that it maintains the nesting structure of the xml as well.
https://www.oreilly.com/learning/jxmlease-python-xml-conversion-data-structures
for your use case.
>>> xml = '<?xml version="1.0" encoding="UTF-8"?> <response> <statusCode>200</statusCode> <statusText>Ok</statusText> <data> <testId>180523_YM_054fd7d84fd4ea7aed237f87289e0c7c</testId> <ownerKey>dfc65d98de13c4770e528ef5b65e9629a52595e9</ownerKey> <jsonUrl>http://www.webpagetest.org/jsonResult.php?test=180523_YM_054fd7d84fd4ea7aed237f87289e0c7c</jsonUrl> </data> </response>'
>>> root = jxmlease.parse(xml)
>>> testid = root['response']['data']['testId'].get_cdata()
>>> print(testid)
>>> '180523_YM_054fd7d84fd4ea7aed237f87289e0c7c'

Keep lxml from creating self-closing tags

I have a (old) tool which does not understand self-closing tags like <STATUS/>. So, we need to serialize our XML files with opened/closed tags like this: <STATUS></STATUS>.
Currently I have:
>>> from lxml import etree
>>> para = """<ERROR>The status is <STATUS></STATUS>.</ERROR>"""
>>> tree = etree.XML(para)
>>> etree.tostring(tree)
'<ERROR>The status is <STATUS/>.</ERROR>'
How can I serialize with opened/closed tags?
<ERROR>The status is <STATUS></STATUS>.</ERROR>
Solution
Given by wildwilhelm, below:
>>> from lxml import etree
>>> para = """<ERROR>The status is <STATUS></STATUS>.</ERROR>"""
>>> tree = etree.XML(para)
>>> for status_elem in tree.xpath("//STATUS[string() = '']"):
... status_elem.text = ""
>>> etree.tostring(tree)
'<ERROR>The status is <STATUS></STATUS>.</ERROR>'
It seems like the <STATUS> tag gets assigned a text attribute of None:
>>> tree[0]
<Element STATUS at 0x11708d4d0>
>>> tree[0].text
>>> tree[0].text is None
True
If you set the text attribute of the <STATUS> tag to an empty string, you should get what you're looking for:
>>> tree[0].text = ''
>>> etree.tostring(tree)
'<ERROR>The status is <STATUS></STATUS>.</ERROR>'
With this is mind, you can probably walk a DOM tree and fix up text attributes before writing out your XML. Something like this:
# prevent creation of self-closing tags
for node in tree.iter():
if node.text is None:
node.text = ''
If you tostring lxml dom is HTML, you can use
etree.tostring(html_dom, method='html')
to prevent self-closing tag like <a />

How to extract one value from an xml url using xml.etree

Im trying to print the value of just one field of a XML tree, here is the XML tree (e.g), the one that i get when i request it
<puco>
<resultado>OK</resultado>
<coberturaSocial>O.S.P. TIERRA DEL FUEGO(IPAUSS)</coberturaSocial>
<denominacion>DAMIAN GUTIERREZ DEL RIO</denominacion>
<nrodoc>32443324</nrodoc>
<rnos>924001</rnos>
<tipodoc>DNI</tipodoc>
</puco>
Now, i just want to print "coberturaSocial" value, here the request that i have in my views.py:
def get(request):
r = requests.get('https://sisa.msal.gov.ar/sisa/services/rest/puco/38785898')
dom = r.content
asd = etree.fromstring(dom)
If i print "asd" i get this error: The view didn't return an HttpResponse object. It returned None instead.
and also in the console i get this
I just want to print coberturaSocial, please help, new in xml parsing!
You need to extract the contents of the tag and then return it wrapped in a response, like so:
return HttpResponse(asd.find('coberturaSocial').text)
I'm guessing etree is import xml.etree.ElementTree as etree
You can use:
text = r.content
dom = etree.fromstring(text)
el = dom.find('coberturaSocial')
el.text # this is where the string is

how to get the text of the url while scraping webpage [duplicate]

I am trying to extract the content of a single "value" attribute in a specific "input" tag on a webpage. I use the following code:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTag = soup.findAll(attrs={"name" : "stainfo"})
output = inputTag['value']
print str(output)
I get TypeError: list indices must be integers, not str
Even though, from the Beautifulsoup documentation, I understand that strings should not be a problem here... but I am no specialist, and I may have misunderstood.
Any suggestion is greatly appreciated!
.find_all() returns list of all found elements, so:
input_tag = soup.find_all(attrs={"name" : "stainfo"})
input_tag is a list (probably containing only one element). Depending on what you want exactly you either should do:
output = input_tag[0]['value']
or use .find() method which returns only one (first) found element:
input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']
In Python 3.x, simply use get(attr_name) on your tag object that you get using find_all:
xmlData = None
with open('conf//test1.xml', 'r') as xmlFile:
xmlData = xmlFile.read()
xmlDecoded = xmlData
xmlSoup = BeautifulSoup(xmlData, 'html.parser')
repElemList = xmlSoup.find_all('repeatingelement')
for repElem in repElemList:
print("Processing repElem...")
repElemID = repElem.get('id')
repElemName = repElem.get('name')
print("Attribute id = %s" % repElemID)
print("Attribute name = %s" % repElemName)
against XML file conf//test1.xml that looks like:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
<singleElement>
<subElementX>XYZ</subElementX>
</singleElement>
<repeatingElement id="11" name="Joe"/>
<repeatingElement id="12" name="Mary"/>
</root>
prints:
Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary
For me:
<input id="color" value="Blue"/>
This can be fetched by below snippet.
page = requests.get("https://www.abcd.com")
soup = BeautifulSoup(page.content, 'html.parser')
colorName = soup.find(id='color')
print(colorName['value'])
If you want to retrieve multiple values of attributes from the source above, you can use findAll and a list comprehension to get everything you need:
import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()
from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)
inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})
output = [x["stainfo"] for x in inputTags]
print output
### This will print a list of the values.
I would actually suggest you a time saving way to go with this assuming that you know what kind of tags have those attributes.
suppose say a tag xyz has that attritube named "staininfo"..
full_tag = soup.findAll("xyz")
And i wan't you to understand that full_tag is a list
for each_tag in full_tag:
staininfo_attrb_value = each_tag["staininfo"]
print staininfo_attrb_value
Thus you can get all the attrb values of staininfo for all the tags xyz
you can also use this :
import requests
from bs4 import BeautifulSoup
import csv
url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})
for val in get_details:
get_val = val["value"]
print(get_val)
You could try to use the new powerful package called requests_html:
from requests_html import HTMLSession
session = HTMLSession()
r = session.get("https://www.bbc.co.uk/news/technology-54448223")
date = r.html.find('time', first = True) # finding a "tag" called "time"
print(date) # you will have: <Element 'time' datetime='2020-10-07T11:41:22.000Z'>
# To get the text inside the "datetime" attribute use:
print(date.attrs['datetime']) # you will get '2020-10-07T11:41:22.000Z'
I am using this with Beautifulsoup 4.8.1 to get the value of all class attributes of certain elements:
from bs4 import BeautifulSoup
html = "<td class='val1'/><td col='1'/><td class='val2' />"
bsoup = BeautifulSoup(html, 'html.parser')
for td in bsoup.find_all('td'):
if td.has_attr('class'):
print(td['class'][0])
Its important to note that the attribute key retrieves a list even when the attribute has only a single value.
Here is an example for how to extract the href attrbiutes of all a tags:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "http://www.cde.ca.gov/ds/sp/ai/"
page = rq.get(url)
html = bs(page.text, 'lxml')
hrefs = html.find_all("a")
all_hrefs = []
for href in hrefs:
# print(href.get("href"))
links = href.get("href")
all_hrefs.append(links)
print(all_hrefs)
You can try gazpacho:
Install it using pip install gazpacho
Get the HTML and make the Soup using:
from gazpacho import get, Soup
soup = Soup(get("http://ip.add.ress.here/")) # get directly returns the html
inputs = soup.find('input', attrs={'name': 'stainfo'}) # Find all the input tags
if inputs:
if type(inputs) is list:
for input in inputs:
print(input.attr.get('value'))
else:
print(inputs.attr.get('value'))
else:
print('No <input> tag found with the attribute name="stainfo")

extract xml tag name, attributes and values

I am looking for a way to extract all information available in a xml file to a flat file or a database.
For example
<r>
<P>
<color val="1F497D"/>
</P>
<t val="123" val2="234">TEST REPORT</t>
</r>
I would want this as
r
P
color,val,1f497d
t,val,123
t,val2,234
Any pointers on how to go about this in python?
Install lxml then:
>>> from lxml import etree
>>> parser = etree.XMLParser(remove_blank_text=True)
>>> parsed_xml = etree.XML(s,parser)
>>> for i in parsed_xml.iter('*'):
... print i.tag
... for x in i.items():
... print '%s,%s' % (x[0],x[1])
...
r
P
color
val,1F497D
t
val,123
val2,234
I'll leave it up to you to format the output.
I think your best bet is to use BeautifulSoup
e.g. (from their docs):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.title
# <title>The Dormouse's story</title>
soup.p['class']
# u'title'
for link in soup.find_all('a'):
print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie
You could also take a look at lxml, it's easy and efficient, and it's what BeautifulSoup is built upon. Specifically, you might want to take a look at this page.
I'm not really sure why you'd want this, but you should take a look at lxml or BeautifulSoup for Python.
Alternatively, if you just want it in exactly the form you've presented above:
def parse_html(html_string):
import re
fields = re.findall(r'(?<=\<)[\w=\s\"\']+?(?=\/?\>)', html_string)
out = []
for field in fields:
tag = re.match(r'(?P<tag>\w+?) ?', field).group('tag')
attrs = re.findall(r' (\w+?)\=[\"\'](.+?)[\"\']', field)
if attrs:
for x in attrs:
out.append(','.join([tag] + list(x)))
else:
out.append(tag)
print '\n'.join(out)
This is a little over the top, and that's why you generally should use lxml or BeautifulSoup, but it gets this particular job done.
Output of my above program:
r
P
c,val,1F497D
t,val,123
t,val2,234

Categories

Resources