Parsing HTML nested within XML file (using BeautifulSoup)

Parsing HTML nested within XML file (using BeautifulSoup) - python

I am trying to parse some data in an XML file that contains HTML in its description field.
For example, the data looks like:
<xml>
<description>
<body>
HTML I want
</body>
</description
<description>
<body>
- more data I want -
</body>
</description>
</xml>
So far, what I've come up with is this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(myfile, 'html.parser')
descContent = soup.find_all('description')
for i in descContent:
bodies = i.find_all('body')
# This will return an object of type 'ResultSet'
for n in bodies:
print n
# Nothing prints here.
I'm not sure where I'm going wrong; when I enumerate the entries in descContent it shows the content I'm looking for; the tricky part is getting in to the nested entries for <body>. Thanks for looking!
EDIT: After further playing around, it seems that BeautifulSoup doesn't recognize that there is HTML in the <description> tag - it appears as just text, hence the problem. I'm thinking of saving the results as an HTML file and reparsing that, but not sure if that will work, as saving contains the literal strings for all the carriage returns and new lines...

use xml parser in lxml
you can install lxml parser with
pip install lxml
with open("file.html") as fp:
soup = BeautifulSoup(fp, 'xml')
for description in soup.find_all('description'):
for body in description.find_all('body'):
print body.text.replace('-', '').replace('\n', '').lstrip(' ')
or u can just type
print body.text

Related

Python: BeautifulSoup Pulling/Parsing data from within html tag

I'm attempting to pull sporting data from a url using Beautiful Soup in Python code. The issue I'm having with this data source is the data appears within the html tag. Specifically this tag is titled ""
I'm after the players data - which seems to be in XML format. However this data is appearing within the "match" tag rather that as the content within the start/end tag.
So like this:
print(soup.match)
Returns: (not going to include all the text):
<match :matchdata='{"match":{"id":"5dbb8e20-6f37-11eb-924a-1f6b8ad68.....ALL DATA HERE....>
</match>
Because of this when I try to output the contents as text it returns empty.
print(soup.match.text)
Returns: nothing
How would I extract this data from within the "" html tag. After this I would like to either save as an XML file or even better a CSV file would be ideal.
My python program from the beginning is:
from bs4 import BeautifulSoup
import requests
url="___MY_URL_HERE___"
# Make a GET request for html content
html_content = requests.get(url).text
# Parse the html content
soup = BeautifulSoup(html_content, "lxml")
## type(soup)
## <class 'bs4.BeautifulSoup'>
print(soup.match)
Thanks a lot!

A tag may have any number of attributes. The tag has an attribute “id” whose value is “boldest”. You can access a tag’s attributes by treating the tag like a dictionary
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
So in your case
print(soup.match[":matchdata"])

BeautifulSoup: how to find all the about attributes from html string

In a text file, these items have the same structure and I would like to parse it with beautiful soup.
An extract:
data = """<text id="1" sig="prenatfra-camppres-2017-part01-viewEvent-1&docRefId-0&docName-news%C2%B720170425%C2%B7LC%C2%B7assignment_862852&docIndex-3_1" title="Éditorial élection présidentielle" author="NULL" year="2017" date="25/04/2017" section="NULL" sourcename="La Croix" sourcesig="LC" polarity="Positif" about="Le Pen|Macron">
<p type="title">Éditorial élection présidentielle</p>
</text>"""
soup = BeautifulSoup(data, 'html.parser') #
I know data are not really a pure html code.
I would like to extract all "about" section for example.
print(soup.find_all('about')) => it returns an empty array!
Perhaps I use a wrong parser?
Thanks a lot.
Best regards.
Théo

If you check the documentation carefully for find_all, it looks for tags with the specified name.
So in this case, you should look for the text tag(s) and then retrieve the about attribute from them.
A working example would look like this:
from bs4 import BeautifulSoup
data = """<text id="1" sig="prenatfra-camppres-2017-part01-viewEvent-1&docRefId-0&docName-news%C2%B720170425%C2%B7LC%C2%B7assignment_862852&docIndex-3_1" title="Éditorial élection présidentielle" author="NULL" year="2017" date="25/04/2017" section="NULL" sourcename="La Croix" sourcesig="LC" polarity="Positif" about="Le Pen|Macron">
<p type="title">Éditorial élection présidentielle</p>
</text>"""
soup = BeautifulSoup(data, 'html.parser')
# to get the 'about' attribute from the first text element
print(soup.find_all('text')[0]['about'])
# to get the 'about' attributes from all the text elements, as a list
print([text['about'] for text in soup.find_all('text')])
Output:
Le Pen|Macron
['Le Pen|Macron']

Parsing the html of the child element [BeautifulSoup]

I have only two weeks learning python.
I'm scraping an XML file and one of the elements of the loop [item->description], have HTML inside, how could I get the text inside p?
url="https://www.milenio.com/rss"
source=requests.get(url)
soup=BeautifulSoup(source.content, features="xml")
items=soup.findAll('item')
for item in items:
html_text=item.description
# This returns HTML code: <p>Paragraph 1</p> <p>Paragraph 2</p>
This next line could work, BUT I got some internal, external links and images, which isn't required.
desc=item.description.get_text()
So, if I make a loop o trying to get all the p, it doesn't work.
for p in html_text.find_all('p'):
print(p)
AttributeError: 'NoneType' object has no attribute 'find_all'
Thank you so much!

The issue is how bs4 processes CData (it's pretty well documented but not very solved).
You'll need to import CData from bs4 which will help extract the CData as a string and use the html.parser library, from there create a new bs4 object with that string to give it a findAll attribute and iterate over it's contents.
from bs4 import BeautifulSoup, CData
import requests
url="https://www.milenio.com/rss"
source=requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')
items=soup.findAll('item')
for item in items:
html_text = item.description
findCdata = html_text.find(text=lambda tag: isinstance(tag, CData))
newSoup = BeautifulSoup(findCdata, 'html.parser')
paragraphs = newSoup.findAll('p')
for p in paragraphs:
print(p.get_text())
Edit:
OP needed to extract link text and found that to only be possible inside the item loop using link = item.link.nextSibling because the link content was jumping outside of its tag like so </link>http://www.... In XML tree view this particular XML doc showed a drop down for the link element which is likely the cause.
To get content from other tags inside the document that don't show a dropdown in XML tree view and don't have nested CData convert the tag to lowercase and return the text as usual:
item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>

this should look like this:
for item in items:
html_text=item.description #??
#!! dont use html_text.find_all !!
for p in item.find_all('p'):
print(p)

BeautifulSoup parse XML with HTML content

I have an XML file (formally XBRL) in which some of the tags contain escaped HTML. I'd like to parse the document an XML, and then extract the HTML from these tags.
However, it appears that the escaped characters are somehow deleted by BeautifulSoup. So when I try to get mytag.text all the escaped characters (e.g. &lt ;) are not present anymore. For instance:
'<' in raw_text # True
'<' in str(BeautifulSoup(raw_text, 'xml')) # False
I have tried to create a simple example to reproduce the issue, but I haven't been able to do that, in the sense that the simple example I wanted to provide is working without any issue:
raw_text = '<xmltag><t><p>test</p><t><xmltag>'
soup = BeautifulSoup(raw_text, 'xml')
'<' in str(soup) # True
So you can find the file that I am parsing here: https://drive.google.com/open?id=1lQz1Tfy8u7TBvatP8-QjlnzUi6rNUR79
The code I am using is:
with open('test.xml', 'r') as fp:
raw_text = fp.read()
soup = BeautifulSoup(raw_text, 'xml')
mytag = soup.find('QuarterlyFinancialInformationTextBlock')
print(mytag.text[:100])
# prints: div div style="margin-left:0pt;margin-righ
# original file: <div> <div style=

Solutions using simplifieddoc
from simplified_scrapy.simplified_doc import SimplifiedDoc
doc = SimplifiedDoc('<xmltag><t><p>test</p></t></xmltag>')
print (doc.t.html)
print (doc.xmltag.t.html)
print (doc.t.unescape())
result:
<p>test</p>
<p>test</p>
<p>test</p>

Try to use another parser for XBRL, i.e. python-xbrl
Check this link- Xbrl parser written in Python

Prettify the code using Beautiful Soup

Using function prettify() I can print the html code very well formated, and I have read that this function prints even a broken html code properly (for example if the tag is opened but never closed, prettify helps to fix that). But only this function can do that or after loading the data to Beautiful Soup object like:
soup = BeautifulSoup(data),
causes that now the soup contains a code which is resistant for a broken html code.
For example if I have a broken code:
<body>
<p><b>Paragraph.</p>
</body>
and I will load it to the BS object it is seen inside of soup object as above or as a fixed one?:
<body>
<p><b>Paragraph.</b></p>
</body>

The HTML marekup is corrected at the time of creating the soup, not at the time of pretty-printing it. This is needed so that BeautifulSoup can navigate the document correctly.
As you can see below, the string representation of the soup contains corrected markup:
>>> from bs4 import BeautifulSoup
>>> text="""<body>
... <p><b>Paragraph.</p>
... </body>
... """
>>> soup = BeautifulSoup(text)
>>> str(soup)
'<body>\n<p><b>Paragraph.</b></p>\n</body>\n'
>>>
If you read the source for class BeautifulStoneSoup, you will find the following comment which addresses your broken markup:
This class contains the basic parser and search code. It defines
a parser that knows nothing about tag behavior except for the
following:
You can't close a tag without closing all the tags it encloses.
That is, "<foo><bar></foo>" actually means
"<foo><bar></bar></foo>".
And then further down the source, you can see that BeautifulSoup inherits from BeautifulStoneSoup.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing HTML nested within XML file (using BeautifulSoup) - python

Related

Python: BeautifulSoup Pulling/Parsing data from within html tag

BeautifulSoup: how to find all the about attributes from html string

Parsing the html of the child element [BeautifulSoup]

BeautifulSoup parse XML with HTML content

Prettify the code using Beautiful Soup

Categories

Resources