Extracting data from HTML bulleted lists in Python - python

I have an html document with the following bulleted list:
Body=<ul><li>Preconditions<ul><li>PC1</li><li>PC2</li></ul></li><li>Use Case Triggers<ul><li>T1</li><li>T2</li></ul></li><li>Postconditions<ul><li>PO1</li><li>PO2</li></ul></li></ul>
(Alternative View):
PreconditionsPC1PC2Use Case TriggersT1T2PostconditionsPO1PO2
I'm trying to write a function in Python that will disect this list and pull out groups of data. The goal is to put this data in a matrix that would look like the following:
[[Preconditions, PC1],[Preconditions, PC2],[Use Case Triggers, T1],[Use Case Triggers, T2],[Postconditions, PO1],[Postconditions,PO2]]
The other hurdle to cross is the fact that I need this sort of matrix to generate regardless of the number of ul and li elements.
Any guidance is appreciated!

You can write a function that takes raw html and deletes all html tags
def cleanhtml(raw_html):
cleanr = re.compile("<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});")
cleantext = re.sub(cleanr, " ", raw_html)
return cleantext
Some other cleanr options:
cleanr = re.compile("<[A-Za-z\/][^>]*>")
cleanr = re.compile("<[^>]*>")
cleanr = re.compile("<\/?\w+\s*[^>]*?\/?>")
But there is a better and easier way with Beautifulsoup.
from bs4 import BeautifulSoup
def clean_with_soup(url: str) -> str:
r = requests.get(url).text
soup = BeautifulSoup(r, "html.parser")
return soup.get_text()

a good library for parse html - beautifulsoup. code example:
html = "<ul><li>Preconditions<ul><li>PC1</li><li>PC2</li></ul></li><li>Use Case Triggers<ul><li>T1</li><li>T2</li></ul></li><li>Postconditions<ul><li>PO1</li><li>PO2</li></ul></li></ul>"
from bs4 import BeautifulSoup
bs = BeautifulSoup(html, "html.parser")
uls = bs.findAll("ul")
for ul in uls:
print(ul.findAll("li"))

Related

Problem Scraping Element & Child Text with lxml & etree

I am trying to scrape lists from Wikipedia pages (like this one for example: https://de.wikipedia.org/wiki/Liste_der_Bisch%C3%B6fe_von_Sk%C3%A1lholt) in a particular format. I am encountering issues getting 'li' and 'a href' to match up.
For example, from the above page, the ninth bullet has text:
1238–1268: Sigvarður Þéttmarsson (Norweger)
with HTML:
<li>1238–1268: Sigvarður Þéttmarsson (Norweger)</li>
I want to pull it together as a dictionary:
'1238–1268: Sigvarður Þéttmarsson (Norweger)': '/wiki/Sigvar%C3%B0ur_%C3%9E%C3%A9ttmarsson'
[Entire text of both parts of 'li' and 'a' child]: [href of 'a' child]
I know I can use lxml/etree to do this, but I'm not entirely sure how. Some recombination of the below?
from lxml import etree
tree = etree.HTML(html)
bishops = tree.cssselect('li').text for bishop
text = [li.text for li in bishops]
links = tree.cssselect('li a')
hrefs = [bishop.get('href') for bishop in links]
Update: I have figured this out using BeautifulSoup as follows:
from bs4 import BeautifulSoup
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
bishops_with_links = {}
bishops = soup.select('li')
for bishop in bishops:
if bishop.findChildren('a'):
bishops_with_links[bishop.text] = 'https://de.wikipedia.org' + bishop.a.get('href')
else:
bishops_with_links[bishop.text] = ''
return bishops_with_links

Beautifulsoup remove pages numbers at bottom

I'm trying to remove the page numbers from this html. It seems to follow the pattern '\n','number','\n' if you look at the list texts. Would I be able to do it with BeautifulSoup? If not, how do I remove that pattern from the list?
import requests
from bs4 import BeautifulSoup
from bs4.element import Comment
def tag_visible(element):
if element.parent.name in ['sup']:
return False
if isinstance(element, Comment):
return False
return True
url='https://www.sec.gov/Archives/edgar/data/1318605/000156459018019254/tsla-10q_20180630.htm'
html = requests.get(url)
soup = BeautifulSoup(html.text, 'html.parser')
texts = soup.findAll(text=True)
### could remove ['\n','number','\n']
visible_texts = filter(tag_visible, texts)
You can try to extract tags containing page numbers from soup before getting text.
soup = BeautifulSoup(html.text, 'html.parser')
for hr in soup.select('hr'):
hr.find_previous('p').extract()
texts = soup.findAll(text=True)
This extracts tags with page numbers that are in style:
<p style="text-align:center;margin-top:12pt;margin-bottom:0pt;text-indent:0%;font-size:10pt;font-family:Times New Roman;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">57</p>
<p style="text-align:center;margin-top:12pt;margin-bottom:0pt;text-indent:0%;font-size:10pt;font-family:Times New Roman;font-weight:normal;font-style:normal;text-transform:none;font-variant: normal;">58</p>
... etc.

After finding all the links and texts using find_all in Beautiful Soup how do grab the one that you need

My example
from bs4 import BeautifulSoup
import requests
result = requests.get("https://pythonprogramming.net/parsememcparseface/")
c = result.content
soup = BeautifulSoup(c,'lxml')
patch_name = soup.find_all(["a", "p"])
u = soup.get_text()
print(u)
How do I get the text I need for I can store it in a variable for later usage.
this will return a list of a and p tag:
patch_name = soup.find_all(["a", "p"])
you can get all the text of the list :
[tag.get_text() for tag in patch_name]

What is the ideal way to use xml data in python html parsing with Beautiful Soup?

What is the ideal way to convert xml to text in python html parsing with Beautiful Soup?
When I am doing html parsing with Python 2.7 BeautifulSoup library, I can get to the step to "soup", but I have no idea how to extract the data I need, so I tried converting them all to string.
In the following example, I want to extract all number in the span tag and add them up. Is there a better way?
XML data:
http://python-data.dr-chuck.net/comments_324255.html
CODE:
import urllib2
from BeautifulSoup import *
import re
url = 'http://python-data.dr-chuck.net/comments_324255.html'
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
spans = soup('span')
lis = list()
span_str = str(spans)
sp = re.findall('([0-9]+)', span_str)
count = 0
for i in sp:
count = count + int(i)
print('Sum:', count)
Don't need regex:
from bs4 import BeautifulSoup
from requests import get
url = 'http://python-data.dr-chuck.net/comments_324255.html'
html = get(url).text
soup = BeautifulSoup(html, 'lxml')
count = sum(int(n.text) for n in soup.findAll('span'))
import requests, bs4
r = requests.get("http://python-data.dr-chuck.net/comments_324255.html")
soup = bs4.BeautifulSoup(r.text, 'lxml')
sum(int(span.text) for span in soup.find_all(class_="comments"))
output:
2788

Python text parsing between two words

I'm using beautifulsoup and want to extract all text from between two words on a webpage.
Ex, imagine the following website text:
This is the text of the webpage. It is just a string of a bunch of stuff and maybe some tags in between.
I want to pull out everything on the page that starts with text and ends with bunch.
In this case I'd want only:
text of the webpage. It is just a string of a bunch
However, there's a chance there could be multiple instances of this on a page.
What is the best way to do this?
This is my current setup:
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
mech = Browser()
urls = [
http://ca.news.yahoo.com/forget-phoning-business-app-sends-text-instead-100143774--sector.html
]
for url in urls:
page = mech.open(url)
html = page.read()
soup = BeautifulSoup(html)
text= soup.prettify()
texts = soup.findAll(text=True)
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
# If the parent of your element is any of those ignore it
return False
elif re.match('<!--.*-->', str(element)):
# If the element matches an html tag, ignore it
return False
else:
# Otherwise, return True as these are the elements we need
return True
visible_texts = filter(visible, texts)
# Filter only returns those items in the sequence, texts, that return True.
# We use those to build our final list.
for line in visible_texts:
print line
since you're just parsing the text you just need the regex:
import re
result = re.findall("text.*?bunch", text_from_web_page)

Categories

Resources