In all the examples and tutorials I have seen of BeautifulSoup, an HTML/XML document is passed and a soup object is returned which can then be used to modify the document. However, how can I use BeautifulSoup to create a HTML/XML document from scratch? In other words, I have data that I would like to put in an XML file, but the XML file does not exist yet and I would like to build it from scratch. How can I go about it?
Just create an empty BeautifulSoup() object:
soup = BeautifulSoup()
and start adding elements:
soup.append(soup.new_tag("a", href="http://www.example.com"))
For XML you could start out with a XML header by using the xml tree builder:
soup = BeautifulSoup(features='xml')
This requires lxml to be installed first. This sets the .is_xml flag on the BeautifulSoup object (which can also be set manually).
Related
I am trying to get some text out of a table from an online xml file. I can find the tables:
from lxml import etree
import requests
main_file = requests.get('https://training.gov.au/TrainingComponentFiles/CUA/CUAWRT601_R1.xml')
main_file.encoding = 'utf-8-sig'
root = etree.fromstring(main_file.content)
tables = root.xpath('//foo:table', namespaces={"foo": "http://www.authorit.com/xml/authorit"})
print(tables)
But I can't get any further than that. The text that I am looking for is:
Prepare to write scripts
Write draft scripts
Produce final scripts
When I paste the xml in here: http://xpather.com/
I can get it using the following expression:
//table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text()
but that doesn't work here and I'm out of ideas. How can I get that text?
Use the namespace prefix you declared (with namespaces={"foo": "http://www.authorit.com/xml/authorit"}) e.g. instead of //table[1]/tr/td[#width="2700"]/p[#id="4"][not(*)]/text() use //foo:table[1]/foo:tr/foo:td[#width="2700"]/foo:p[#id="4"][not(*)]/text().
I'm trying to get content out of XML from an API call. I'm able to use requests to get the xml content, but can't seem to parse it correctly. Here is the code that has been semi-successful so far:
import requests
from lxml import etree
data = requests.get('http://elections.huffingtonpost.com/pollster/api/polls.xml', params={'sort':'updated'})
tree = etree.XML(data.content)
The tree is showing the line breaks from the xml as text, and some of the nodes that are more than 3 levels deep are gone.
Basically what I am doing is using urllib.request to make an API call to pubmed, receive an XML file in return, and am trying to parse it with no luck.
I have tried using Element Tree and other modules with no luck. I believe there may be an issue with XML object itself.
#Imorting URL Request Modules for API Calls
#Also importing ElemenTree as it seems to be best for XML parsing
import urllib.request
import urllib.parse
import re
import xml.etree.ElementTree as ET
from urllib import request
#Now I can make the API call.
id_request = urllib.request.urlopen('http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=17570568')
#id_request will be an object that I'm not sure I understand?
#id_request Returns: "<http.client.HTTPResponse object at 0x0000000003693FD0>"
#Let's now read this baby in XML format!
id_pubmed = id_request.read()
#If I look at the id_pubmed object, I not have the XML file I want to parse.
You can see what the XML file id_pubmed is calling/prints here: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=17570568
My issue is I can't get Element Tree to parse this at all. I have tried:
tree = ET.parse(id_pubmed)
root = tree.getroot()
as well as various other suggestions from https://docs.python.org/3/library/xml.etree.elementtree.html#module-xml.etree.ElementTree
ET.parse() method requires either the location of the xml file (on local file system) or a file like object , but your id_pubmed seems to be a string .
In that case , you should use ET.fromstring() . Example -
root = ET.fromstring(id_pubmed)
I've a problem with extracting text out of .docx after removing table.
The docx files I'm dealing with contain a lot of tables that I would like to get rid of before extracting the text.
I first use docx2html to convert a docx file to html, and then use BeautifulSoup to remove the table tag and extract the text.
from docx2html import convert
from bs4 import BeautifulSoup
...
temp = convert(FileToConvert)
soup = BeautifulSoup(temp)
for i in range(0,len(soup('table'))):
soup.table.decompose()
Text = soup.get_text()
While this process works and produces what I need, there is some efficiency issue with docx2html.convert(). Since .docx files are in infact .xml files, would it be possible to skip the the procedure of converting docx into html and just extract text from the xml after removing tables.
docx files are not just xml files but rather a zipped xml based format, so you won't be able to pass a docx file directly to BeautifulSoup. The format seems pretty simple though as the zipped docx contains a file called word/document.xml which is probably the xml file you want to parse. You can use Python's zipfile module to extract this file and pass its contents directly to BeautfulSoup:
import sys
import zipfile
from bs4 import BeautifulSoup
with zipfile.ZipFile(sys.argv[1], 'r') as zfp:
with zfp.open('word/document.xml') as fp:
soup = BeautifulSoup(fp.read(), 'xml')
print soup
However, you might also want to look at https://github.com/mikemaccana/python-docx, which might do a lot of what you want already. I haven't tried it so I can't vouch for its suitability for your specific use-case.
I've been trying for some hours to grab the response from the imgur API. I got the XML in the terminal, but I don't know how to grab it and parse it. Here's my code.
c = pycurl.Curl()
values = [
("key", "Super Secret API Number"),
("image", (c.FORM_FILE, "pic.jpg"))]
c.setopt(c.URL, "http://api.imgur.com/2/upload.xml")
c.setopt(c.HTTPPOST, values)
c.perform()
c.close()
I'm a big noob with python, this is my first time. Python virgin. I read that you can parse the xml with ElementTree, but I can't find any cool documentation.
Hope you can help me. Thanks.
Store the response from imgur-api into a file.Than need to use a xml parser to parse the xml response/file you are getting from Imgur-API.
There are lots of option available like lxml or BeautifulSoup.
Here is an example of how to use lxml with XPath expressions.
from lxml import etree
xml = """<foo>baz!</foo>"""
>>> xml = """<foo>baz!</foo>"""
>>> xp = etree.fromstring(xml)
>>> values = xp.xpath("//foo/text()")
>>> values
['baz!']
If you need to parse a xml file:
# parse from file
et = etree.parse(source_xml)
value = et.xpath("your xpath xpr here")
If you need to parse directly from url
# parse from URL
etree.parse("http://example.com/somefile.xml")
For, XPath use firefox's firebug extension or install firepath
When I started using the included ElementTree module I found the documentation lacking good examples (currently there are only 3, and only one of those shows anything immediately practical).
I've answered a couple of questions here on SO related to lxml/ElementTree, and I usually see people getting stuck trying to write these weird list comprehensions to deal with something XPath handles in one line much more clearly:
Parsing lxml.etree._Element contents
lxml classic: Get text content except for that of nested tags?
If you have a more specific question, please post some source XML and desired effect.
I hope this helps,