Parsing HTML with BeautifulSoup and no Classes (just paragraphs) - python

I'm trying to parse 'https://projecteuler.net/problem=8' for the middle bit with the number. Since it doesn't have a separate class to select it by, I have used
r = requests.get('https://projecteuler.net/problem=8')
data = r.text
soup = BeautifulSoup(data, "lxml")
[para1, para2, para3] = (soup.find_all('p'))
To separate the paragraphs, but this leaves alot of extra junk (<p> and <br>) in there. Is there a command to clear all that out? Is there a better command to do the splitting than I am currently using? Never really done much web crawling in Python...

soup.find_all returns a set of html nodes that contain the html tags; If you want to extract text from the node, you can just use .text on each node; applying this on para2, gives:
para2.text.split()
#['73167176531330624919225119674426574742355349194934',
# '96983520312774506326239578318016984801869478851843',
# '85861560789112949495459501737958331952853208805511',
# '12540698747158523863050715693290963295227443043557',
# ...

Related

How to webscrape with Python keeping meta-information in the text?

I am trying to webscrape this website. To do so,
I run the following code:
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/pressconf/2022/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)
data = []
for link in soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{link['href']}").content)
data.append({
'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
})
print(data)
This works fine. What is the issue? The problem is that the webscraped text comes without information on paragraphs split and on bold character. This is a problem since I would then need to make some calls on the basis of that.
Can anyone suggest how to maintain meta-information in the text?
Thanks a lot!
A solution is to determine in the website code source what are the markers for paragraphs split and bold characters.
Then, the "soup" variable, you can localize what interests you using the markers as a string to be searched in "soup".
Looking briefly at the source code of your website, I think the answer relies in following markers (I needed to add ' otherwise the markers are hidden by stackoverflow):
"<'/a><'/div><'div class="subtitle">"

Traversal Parsing of Text from .HTML

I am trying to scrape text from webpages contained in tags of type titles, headings or paragraphs. When i try the below code I get mixed results depending on where the url is from. When i try some sources (e.g. Wikipedia or Reuters) the code works more or less fine and at least finds all the text. For other sources (e.g. Politico, The Economist) I start to miss a lot of the text contained in webpage.
I am using traversal algo to walk through the tree and check if the tag is 'of interest'. Maybe find_all(True, recursive=False) is for some reason missing children that subsequently contain the text I am looking for? I'm unsure how to investigate that. Or maybe some sites are blocking the scraping somehow? But then why can i scrape one paragraph from the economist?
Code below replicates issue for me - you should see the wikipedia page (urls[3]) print as desired, the politico (urls[0]) missing all text in the article and economist (urls[1]) missing all but one paragraph.
from bs4 import BeautifulSoup
import requests
urls = ["https://www.politico.com/news/2022/01/17/democrats-biden-clean-energy-527175",
"https://www.economist.com/finance-and-economics/the-race-to-power-the-defi-ecosystem-is-on/21807229",
"https://www.reuters.com/world/significant-damage-reported-tongas-main-island-after-volcanic-eruption-2022-01-17/",
"https://en.wikipedia.org/wiki/World_War_II"]
# get soup
url = urls[0] # first two urls don't work, last two do work
response = requests.get(url)
soup = BeautifulSoup(response.text, features="html.parser")
# tags with text that i want to print
tags_of_interest = ['p', 'title'] + ['h' + str(i) for i in range(1, 7)]
def read(soup):
for tag in soup.find_all(True, recursive=False):
if (tag.name in tags_of_interest):
print(tag.name + ": ", tag.text.strip())
for child in tag.find_all(True, recursive=False):
read(child)
# call the function
read(soup)
BeautifulSoup's find_all() will return a list of tags in the order of a DFT (depth first traversal) as per this answer here. This allows easy access to the desired elements.

Python BeautifulSoup - Add Tags around found keyword

I am currently working on a project in which I want to allow regex search in/on a huge set of HTML files.
After first pinpointing the files of my interest I now want to highlight the found keyword!
Using BeautifulSoup I can determine the Node in which my keyword is found. One thing I do is changing the color of the whole parent.
However, I would also like to add my own <span>-Tags around just they keyword(s) I found.
Determining the position and such is no big deal using the find()-functions provided by BFSoup. But adding my tags around regular text seems to be impossible?
# match = keyword found by another regex
# node = the node I found using the soup.find(text=myRE)
node.parent.setString(node.replace(match, "<myspan>"+match+"</myspan>"))
This way I only add mere text and not a proper Tag, as the document is not freshly parsed, which I hope to avoid!
I hope my problem became a little clear :)
Here's a simple example showing one way to do it:
import re
from bs4 import BeautifulSoup as Soup
html = '''
<html><body><p>This is a paragraph</p></body></html>
'''
(1) store the text and empty the tag
soup = Soup(html)
text = soup.p.string
soup.p.clear()
print soup
(2) get start and end positions of the words to be boldened (apologies for my English)
match = re.search(r'\ba\b', text)
start, end = match.start(), match.end()
(3) split the text and add the first part
soup.p.append(text[:start])
print soup
(4) create a tag, add the relevant text to it and append it to the parent
b = soup.new_tag('b')
b.append(text[start:end])
soup.p.append(b)
print soup
(5) append the rest of the text
soup.p.append(text[end:])
print soup
here is the output from above:
<html><body><p></p></body></html>
<html><body><p>This is </p></body></html>
<html><body><p>This is <b>a</b></p></body></html>
<html><body><p>This is <b>a</b> paragraph</p></body></html>
If you add the text...
my_tag = node.parent.setString(node.replace(match, "<myspan>"+match+"</myspan>"))
...and pass it through BeautifulSoup once more
new_soup = BeautifulSoup(my_tag)
it should be classified as a BS tag object and available for parsing.
You could apply these changes to the original mass of text and run it through as a whole, to avoid repetition.
EDIT:
From the docs:
# Here is a more complex example that replaces one tag with another:
from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup("<b>Argh!<a>Foo</a></b><i>Blah!</i>")
tag = Tag(soup, "newTag", [("id", 1)])
tag.insert(0, "Hooray!")
soup.a.replaceWith(tag)
print soup
# <b>Argh!<newTag id="1">Hooray!</newTag></b><i>Blah!</i>

How can I iterate over specific elements in HTML file and replace them?

I need to do a seemingly simple thing in Python which turned out to be quite complex. What I need to do is:
Open an HTML file.
Match all instances of a specific HTML element, for example table.
For each instance, extract the element as a string, pass that string to an external command which will do some modifications, and finally replace the original element with a new string returned from the external command.
I can't simply do a re.sub(), because in each case the replacement string is different and based on the original string.
Any suggestions?
You could use Beautiful Soup to do this.
Although for what you need, something simpler like lxml.etree would work fine.
Sounds like you want BeautifulSoup. Likely, you'd want to do something like:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
tables = soup.find_all( 'table' )
for table in tables:
contents = str( table.contents )
new_contents = transform( contents )
table.replaceWith( new_contents )
Alternatively, you may be looking for something closer to soup.replace_with
EDIT: Updated to the eventual solution.
I have found that parsing HTML via BeautifulSoup or any other such parses gets complex as you need to parse different pages, with different structure which sometimes are not well-formed, use javascript manipulation etc. Best solution in this case is to directly access the browser DOM and modify and query nodes. You can easily do that in a headless browser like phanotomjs
e.g. here is a phantomjs script
var page = require('webpage').create();
page.content = '<html><body><table><tr><td>1</td><td>2</td></tr></table></html>';
page.evaluate(function () {
var elems = document.getElementsByTagName('td')
for(var i=0;i<elems.length;i++){
elems[i].innerHTML = '!'+elems[i].innerHTML+'!';
}
});
console.log(page.content);
phantom.exit();
It changes all td text and output is
<html><head></head><body><table><tbody><tr><td>!1!</td><td>!2!</td></tr></tbody></table></body></html>

Getting html stripped of script and style tags with BeautifulSoup?

I have a simple script where I am fetching an HTML page, passing it to BeautifulSoup to remove all script and style tags, then I want to pass the HTML result to another method. Is there an easy way to do this? Skimming the BeautifulSoup.py, I haven't seen it yet.
soup = BeautifulSoup(html)
for script in soup("script"):
soup.script.extract()
for style in soup("style"):
soup.style.extract()
contents = soup.html.contents
text = loader.extract_text(contents)
contents = soup.html.contents just gets a list and everything is defined in classes there. Is there a method that just returns the raw html after soup manipulates it? Or do I just need to go through the contents list and piece the html back together excluding the script & style tags?
Or is there an even better solution to accomplish what I want?
unicode( soup ) gives you the html.
Also what you want is this:
for elem in soup.findAll(['script', 'style']):
elem.extract()

Categories

Resources