Split HTML nested list into python list - python

I have HTML list formed that way (It's what CKeditor create for nested list):
<ul>
<li>niv1alone</li>
<li>niv1
<ul>
<li>niv2
<ul>
<li>niv3
<ul>
<li>niv4</li>
</ul></li></ul></li></ul></li>
<li>autre niv1 alone</li>
</ul>
How do I form a "recursive list" like that:
[
('niv1alone',[]),('niv1',[('niv2',[('niv3',[('niv4',[])])])]),('autre niv1 alone',[])
]
I have tried several things with beautifulsoup but I can't get the desired result.

Here's a recursive function that functions similar to what you ask. Trick to writing recursive functions is to make the problem smaller then recurse it. Here I'm walking down the element tree and passing the children, which is a strictly smaller set than one before.
import bs4
html = '''
<ul>
<li>niv1alone</li>
<li>niv1
<ul>
<li>niv2
<ul>
<li>niv3
<ul>
<li>niv4</li>
</ul></li></ul></li></ul></li>
<li>autre niv1 alone</li>
</ul>
'''
def make_tree(body: bs4.Tag):
branch = []
for ch in body.children:
if isinstance(ch, bs4.NavigableString):
# skip whitespace
if str(ch).strip():
branch.append(str(ch).strip())
else:
branch.append(make_tree(ch))
return branch
if __name__ == '__main__':
soup = bs4.BeautifulSoup(html, 'html.parser')
tree = make_tree(soup.select_one('ul'))
print(tree)
output:
[['niv1alone'], ['niv1', [['niv2', [['niv3', [['niv4']]]]]]], ['autre niv1 alone']]

Related

How to get text into groups according to their HTML structure?

I am dealing with such html processing problem,to get the text into groups according to their structure .The original file is very complex ,so I simplified it like below :
I have get the name list :['Json Bell','Jim Charlie','Mike Alfie','William Cyphort','Juniper Egbert']
and the html file :
<p>studets</p>
<ul>
<li>Json Bell</li>
<li>Jim Charlie</li>
<li>Mike Alfie</li>
</ul>
<p>teachers</p>
<ul>
<li>William Cyphort</li>
<li>Juniper Egbert</li>
</ul>
How can I get the groups['Json Bell','Jim Charlie','Mike Alfie'],['William Cyphort','Juniper Egbert']
Any Idea would be welcomed! I am familiar with python ,so processing method in python would be much better.
The question can be described in another way :how can I know whether the strings have the same parent nodes.
For this problem I would use list comprehension - firstly select all <ul> tags and then for each ul tag select all li tags and extract the text:
data = """
<p>studets</p>
<ul>
<li>Json Bell</li>
<li>Jim Charlie</li>
<li>Mike Alfie</li>
</ul>
<p>teachers</p>
<ul>
<li>William Cyphort</li>
<li>Juniper Egbert</li>
</ul>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
print([[li.text for li in ul.select('li')] for ul in soup.select('ul')])
Output is:
[['Json Bell', 'Jim Charlie', 'Mike Alfie'], ['William Cyphort', 'Juniper Egbert']]
EDIT:
For comparing parents you can use .parent property:
tag1 = soup.find(text='Json Bell')
tag2 = soup.find(text='Jim Charlie')
print(tag1.parent == tag2.parent) #tag1 and tag2 parents are <li> tags (different)
print(tag1.parent.parent == tag2.parent.parent) #this compares <ul> tags, which are the same
This prints:
False
True
Note: soup.find(text='Json Bell') finds NavigableString class. The parent of this NavigableString is <li> tag. The parent of <li> tag is <ul> tag and so on...

Is there an alternative to bs4's find_all() method that returns another soup object instead of a list, for further navigation?

Upon finding all the <ul>'s, I'd like to further extract the text, and the href's. The problem I'm facing particular to this bit of HTML, is that I need most, but not all the <li> items in the page. I see that when I find_all(), I am returned a list object which does not allow me to further navigate it as a soup object.
For example, in the below snippet, to ultimately create a dictionary of {'cityName': 'href',}, I have tried:
city_list = soup.find_all('ul', {'class': ''})
city_dict = {}
for city in city_list:
city_dict[city.text] = city['href']
Here is the sample minimal HTML:
<h4>Alabama</h4>
<ul>
<li>auburn</li>
<li>birmingham</li>
<li>tuscaloosa</li>
</ul>
<h4>Alaska</h4>
<ul>
<li>anchorage / mat-su</li>
<li>southeast alaska</li>
</ul>
<h4>Arizona</h4>
<ul>
<li>flagstaff / sedona</li>
<li>yuma</li>
</ul>
<ul>
<li>help</li>
<li>safety</li>
<li class="fsel mobile linklike" data-mode="regular">desktop</li>
</ul>
How can I, essentially, find_all() the ul's first, and then further find only the li's that interest me?
Probably you need something like this:
city_dict = {}
for ul in soup.find_all('ul', {'class': ''}):
state_name = ul.find_previous_sibling('h4').text
print(state_name)
for link in ul.find_all('a'):
print(link['href'])
city_dict = {}
for li in soup.find_all('li'):
city_name = li.text
for link in li.find_all('a'):
city_dict[city_name] = link['href']
Try this, Thank me later :)
list_items = soup.find_all('ul',{'class':''})
list_of_dicts = []
for item in list_items:
for i in item.find_all('li'):
new_dict = {i.text:i.a.get('href')}
list_of_dicts.append(new_dict)

Identifying branches different in tag structure

I'm hoping to check if two html are different by tags only without considering the text and pick out those branch(es).
For example :
html_1 = """
<p>i love it</p>
"""
html_2 = """
<p>i love it really</p>
"""
They share the same tag structure, so they're seen to be the same. However:
html_1 = """
<div>
<p>i love it</p>
</div>
<p>i love it</p>
"""
html_2 = """
<div>
<p>i <em>love</em> it</p>
</div>
<p>i love it</p>
"""
I'd expect it to return the <div> branch, because the tag structures are different. Could lxml, BeautifulSoup or some other lib achieve this? I'm trying to find a way to actually pick out the different branches.
Thanks
A more reliable approach would be to construct a Tree of tag names out of the document as discussed here:
HTML Parse tree using Python 2.7
Here is an example working solution based on treelib.Tree:
from bs4 import BeautifulSoup
from treelib import Tree
def traverse(parent, tree):
tree.create_node(parent.name, parent.name, parent=parent.parent.name if parent.parent else None)
for node in parent.find_all(recursive=False):
tree.create_node(node.name, parent=parent.name)
traverse(node, tree)
def compare(html1, html2):
tree1 = Tree()
traverse(BeautifulSoup(html1, "html.parser"), tree1)
tree2 = Tree()
traverse(BeautifulSoup(html2, "html.parser"), tree2)
return tree1.to_json() == tree2.to_json()
print compare("<p>i love it</p>", "<p>i love it really</p>")
print compare("<p>i love it</p>", "<p>i <em>love</em> it</p>")
Prints:
True
False
Sample code to check tagging structure of two HTML content are same for not
Demo:
def getTagSequence(content):
"""
Get all Tag Sequence
"""
root = PARSER.fromstring(content)
tag_sequence = []
for elm in root.getiterator():
tag_sequence.append(elm.tag)
return tag_sequence
html_1_tags = getTagSequence(html_1)
html_2_tags = getTagSequence(html_2)
if html_1_tags==html_2_tags:
print "Tagging structure is same."
else:
print "Tagging structure is diffrent."
print "HTML 1 Tagging:", html_1_tags
print "HTML 2 Tagging:", html_2_tags
Note:
Above code just check tagging sequence only, Not checking parent and its children relationship i.e
html_1 = """ <p> This <span>is <em>p</em></span> tag</p>"""
html_2 = """ <p> This <span>is </span><em>p</em> tag</p>"""

Scraping data from website using Beautiful Soup and show it in jinja2

I am trying to pull list of data from website using Beautiful Soup:
class burger(webapp2.RequestHandler):
Husam = urlopen('http://www.qaym.com/city/77/category/3/%D8%A7%D9%84%D8%AE%D8%A8%D8%B1/%D8%A8%D8%B1%D8%AC%D8%B1/').read()
def get(self, soup = BeautifulSoup(Husam)):
tago = soup.find_all("a", class_ = "bigger floatholder")
for tag in tago:
me2 = tag.get_text("\n")
template_values = {
'me2': me2
}
for template in template_values:
template = jinja_environment.get_template('index.html')
self.response.out.write(template.render(template_values))
Now when I try to show the data in template using jinja2, but it's repeat the whole template based on the number of list and put each single info in one template.
How I put the the whole list in one tag and be able to edit other tags whith out repeating?
<li>{{ me2}}</li>
To output a list of entries, you can loop over them in your jinja2 template like this:
{%for entry in me2%}
<li> {{entry}} </li>
{% endfor %}
To use this, your python code also has to put the tags into a list.
Something like this should work:
def get(self, soup=BeautifulSoup(Husam)):
tago = soup.find_all("a", class_="bigger floatholder")
# Create a list to store your entries
values = []
for tag in tago:
me2 = tag.get_text("\n")
# Append each tag to the list
values.append(me2)
template = jinja_environment.get_template('index.html')
# Put the list of values into a dict entry for jinja2 to use
template_values = {'me2': values}
# Render the template with the dict that contains the list
self.response.out.write(template.render(template_values))
References:
Jinja2 template documentation

How to turn a Html nested list into a Python's one

I have this kind of Html list:
lista = """
<ul>
<li>Arts & Entertainment
<ul>
<li>Celebrities & Entertainment News</li>
<li>Comics & Animation
<ul>
<li>Anime & Manga</li>
<li>Cartoons</li>
<li>Comics</li>
</ul>
</li>
</ul>
</li>
</ul>
"""
and I would like to convert it into a useful python structure for further processing:
what structure do you suggest? and also how would you do that?
With BeautifulSoup, I'd do something like this:
from BeautifulSoup import BeautifulSoup
from pprint import pprint
def parseList(tag):
if tag.name == 'ul':
return [parseList(item)
for item in tag.findAll('li', recursive=False)]
elif tag.name == 'li':
if tag.ul is None:
return tag.text
else:
return (tag.contents[0].string.strip(), parseList(tag.ul))
soup = BeautifulSoup(lista)
pprint(parseList(soup.ul))
Example output:
[(u'Arts & Entertainment',
[u'Celebrities & Entertainment News',
(u'Comics & Animation',
[u'Anime & Manga', u'Cartoons', u'Comics'])])]
Note that for list items that contain an unnumbered list, a tuple is returned in which the first element is the string in the list item and the second element is a list with the contents of the unnumbered list.
You can use the Mapping Type: Dictionaries

Categories

Resources