XML parser in BeautifulSoup only scrapes the first symbol out of two - python

I wish to read symbols from some XML content stored in a text file. When I use xml as a parser, I get the first symbol only. However, I got the two symbols when I use the xml parser. Here is the xml content.
<?xml version="1.0" encoding="utf-8"?>
<lookupdata symbolstring="WDS">
<key>
<symbol>WDS</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001S5WCY6</openfigi>
<qmidentifier>USI79Z473117AAG</qmidentifier>
</key>
<equityinfo>
<longname>
Woodside Energy Group Limited American Depositary Shares each representing one
</longname>
<shortname>Woodside Energy </shortname>
2
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
<proprietaryquoteeligible>false</proprietaryquoteeligible>
</equityinfo>
</lookupdata>
<lookupdata symbolstring="PAM">
<key>
<symbol>PAM</symbol>
<exchange>NYE</exchange>
<openfigi>BBG001T5K0S1</openfigi>
<qmidentifier>USI68Z3Z75887AS</qmidentifier>
</key>
<equityinfo>
<longname>Pampa Energia S.A.</longname>
<shortname>PAM</shortname>
<instrumenttype>equity</instrumenttype>
<sectype>DR</sectype>
<isocfi>EDSXFR</isocfi>
<issuetype>AD</issuetype>
</equityinfo>
</lookupdata>
When I read the xml content from a text file and parse the symbols, I get only the first symbol.
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
soup = BeautifulSoup(item,"xml")
for item in soup.select("lookupdata symbol"):
print(item.text)
current output:
WDS
If I replace xml with lxml in BeautifulSoup(item,"xml"), I get both symbols. When I use lxml, a warning pops up, though.
As the content is xml, I would like to stick to xml parser instead of lxml.
Expected output:
WDS
PAM

The issue seems to be that the builtin xml library only loads the first item, it just stops after the first lookupdata ends. Given all the examples in the xml docs have some top-level container element, I'm assuming it just stops parsing after the first top-level element ends (though am not sure, just an assumption). You can add a print(soup) after you load it in to see what its using.
You could use BeautifulSoup(item, "html.parser") which uses the builtin html library, which works.
Or, to keep using the xml library, surround it with some top-level dummy element, like:
from bs4 import BeautifulSoup
with open("input_xml.txt") as infile:
item = infile.read()
patched = f"<root>{item}</root>"
soup = BeautifulSoup(patched, "xml")
for found in soup.select("lookupdata symbol"):
print(found.text)
Output:
WDS
PAM

Related

BeautifulSoup using an iterable instead of string?

I am parsing a Wikipedia metadata file with bs4 and python 3.5
This works for extraction, from a test slice of the (much larger) file:
from bs4 import BeautifulSoup
with open ("Wikipedia/test.xml", 'r') as xml_file:
xml = xml_file.read()
print(BeautifulSoup(xml, 'lxml').select("timestamp"))
The issue is that the metadata files are all 12+ gigs, so rather than slurping in the entire file as a string before ensoupification, I'd like to have BeautifulSoup read the data as an iterator (possibly even from gzcat to avoid having the data sitting around in uncompressed files).
However, my attempts to hand BS anything other than a string causes it to choke. Is there a way to get BS to read data as a stream instead of a string?
You can give BS a file handle object.
with open("Wikipedia/test.xml", 'r') as xml_file:
soup = BeautifulSoup(xml_file, 'lxml')
This is the first example in the documentation of Making the Soup
BeautifulSoup or lxml has no stream option but you can use iterparse() to read large xml files in a chunk
import xml.etree.ElementTree as etree
for event, elem in etree.iterparse("Wikipedia/test.xml", events=('start', 'end')):
....
if event == 'end':
....
elem.clear() # freed memory
read more here or here

Create new list from old using re.sub() in python 2.7

My goal is to take an XML file, pull out all instances of a specific element, remove the XML tags, then work on the remaining text.
I started with this, which works to remove the XML tags, but only from the entire XML file:
from urllib import urlopen
import re
url = [URL of XML FILE HERE] #the url of the file to search
raw = urlopen(url).read() #open the file and read it into variable
exp = re.compile(r'<.*?>')
text_only = exp.sub('',raw).strip()
I've also got this, text2 = soup.find_all('quoted-block'), which creates a list of all the quoted-block elements (yes, I know I need to import BeautifulSoup).
But I can't figure out how to apply the regex to the list resulting from the soup.find_all. I've tried to use text_only = [item for item in text2 if exp.sub('',item).strip()] and variations but I keep getting this error: TypeError: expected string or buffer
What am I doing wrong?
You don't want to regex this. Instead just use BeautifulSoup's existing support for grabbing text:
quoted_blocks = soup.find_all('quoted-block')
text_chunks = [block.get_text() for block in quoted_blocks]

inserting html into an xml query using CDATA

I am trying to insert the content of an html file into a xml request .
I am opening the html file this way :
page = open(html).read()
then inserting the content in the xml this way :
"<Description><![CDATA["+page+"]]</Description>"+\
This errors out this way :
XML Parse error. XML Error Text: "; nested exception is:
org.xml.sax.SAXParseException: XML document structures must start and
end within the same entity."
I'm assuming I have to do a bit more than just dumping the content of the html file into a CDATA tag ? or maybe do it in a different way ?
Two potential issues.
First, the correct way to end a CDATA block is with ]]>, not ]]
Second, your HTML data might include CDATA blocks, and nested CDATA blocks are not allowed. You might consider encoding your HTML data, using Base64 for example:
import base64
encPage = base64.b64encode(page)
You forgot the closing > for the CDATA element:
"<Description><![CDATA["+page+"]]></Description>"+\

Python Regex - Parsing HTML

I have this little code and it's giving me AttributeError: 'NoneType' object has no attribute 'group'.
import sys
import re
#def extract_names(filename):
f = open('name.html', 'r')
text = f.read()
match = re.search (r'<hgroup><h1>(\w+)</h1>', text)
second = re.search (r'<li class="hover">Employees: <b>(\d+,\d+)</b></li>', text)
outf = open('details.txt', 'a')
outf.write(match)
outf.close()
My intention is to read a .HTML file looking for the <h1> tag value and the number of employees and append them to a file. But for some reason I can't seem to get it right.
Your help is greatly appreciated.
You are using a regular expression, but matching XML with such expressions gets too complicated, too fast. Don't do that.
Use a HTML parser instead, Python has several to choose from:
ElementTree is part of the standard library
BeautifulSoup is a popular 3rd party library
lxml is a fast and feature-rich C-based library.
The latter two handle malformed HTML quite gracefully as well, making decent sense of many a botched website.
ElementTree example:
from xml.etree import ElementTree
tree = ElementTree.parse('filename.html')
for elem in tree.findall('h1'):
print ElementTree.tostring(elem)
Just for the sake of completion: your error message just indicate that your regular expression failed and did not return anything...

How to read an entire web page into a variable

I am trying to read an entire web page and assign it to a variable, but am having trouble doing that. The variable seems to only be able to hold the first 512 or so lines of the page source.
I tried using readlines() to just print all lines of the source to the screen, and that gave me the source in its entirety, but I need to be able to parse it with regex, so I need to store it in a variable somehow. Help?
data = urllib2.urlopen(url)
print data
Only gives me about 1/3 of the source.
data = urllib2.urlopen(url)
for lines in data.readlines()
print lines
This gives me the entire source.
Like I said, I need to be able to parse the string with regex, but the part I need isn't in the first 1/3 I'm able to store in my variable.
You probably are looking for beautiful soup: http://www.crummy.com/software/BeautifulSoup/ It's an open source web parsing library for python. Best of luck!
You should be able to use file.read() to read the entire file into a string. That will give you the entire source. Something like
data = urllib2.urlopen(url)
print data.read()
should give you the entire webpage.
From there, don't parse HTML with regex (well-worn post to this effect here), but use a dedicated HTML parser instead. Alternatively, clean up the HTML and convert it to XHTML (for instance with HTML Tidy), and then use an XML parsing library like the standard ElementTree. Which approach is best depends on your application.
Actually, print data should not give you any html content because its just a file pointer. Official documentation https://docs.python.org/2/library/urllib2.html:
This function returns a file-like object
This is what I got :
print data
<addinfourl at 140131449328200 whose fp = <socket._fileobject object at 0x7f72e547fc50>>
readlines() returns list of lines of html source and you can store it in a string like :
import urllib2
data = urllib2.urlopen(url)
l = []
s = ''
for line in data.readlines():
l.append(line)
s = '\n'.join(l)
You can either use list l or string s, according to your need.
I would also recommend to use opensource web parsing libraries for easy work rather than using regex for complete HTML parsing, any way u need regex for url parsing.
If you want to parse over the variable afterwards you might use gazpacho:
from gazpacho import Soup
url = "https://www.example.com"
soup = Soup.get(url)
str(soup)
That way you can perform finds to extract the information you're after!

Categories

Resources