Traversal Parsing of Text from .HTML - python

I am trying to scrape text from webpages contained in tags of type titles, headings or paragraphs. When i try the below code I get mixed results depending on where the url is from. When i try some sources (e.g. Wikipedia or Reuters) the code works more or less fine and at least finds all the text. For other sources (e.g. Politico, The Economist) I start to miss a lot of the text contained in webpage.
I am using traversal algo to walk through the tree and check if the tag is 'of interest'. Maybe find_all(True, recursive=False) is for some reason missing children that subsequently contain the text I am looking for? I'm unsure how to investigate that. Or maybe some sites are blocking the scraping somehow? But then why can i scrape one paragraph from the economist?
Code below replicates issue for me - you should see the wikipedia page (urls[3]) print as desired, the politico (urls[0]) missing all text in the article and economist (urls[1]) missing all but one paragraph.
from bs4 import BeautifulSoup
import requests
urls = ["https://www.politico.com/news/2022/01/17/democrats-biden-clean-energy-527175",
"https://www.economist.com/finance-and-economics/the-race-to-power-the-defi-ecosystem-is-on/21807229",
"https://www.reuters.com/world/significant-damage-reported-tongas-main-island-after-volcanic-eruption-2022-01-17/",
"https://en.wikipedia.org/wiki/World_War_II"]
# get soup
url = urls[0] # first two urls don't work, last two do work
response = requests.get(url)
soup = BeautifulSoup(response.text, features="html.parser")
# tags with text that i want to print
tags_of_interest = ['p', 'title'] + ['h' + str(i) for i in range(1, 7)]
def read(soup):
for tag in soup.find_all(True, recursive=False):
if (tag.name in tags_of_interest):
print(tag.name + ": ", tag.text.strip())
for child in tag.find_all(True, recursive=False):
read(child)
# call the function
read(soup)

BeautifulSoup's find_all() will return a list of tags in the order of a DFT (depth first traversal) as per this answer here. This allows easy access to the desired elements.

Related

How to take hyperlinks from a wikipedia page

My current project requires to obtain the summaries of some wikipedia pages. This is really easy to do, but I want to make a general script for that. More specifically I also want to obtain the summaries of hyperlinks. For example I want to get the summary of this page: https://en.wikipedia.org/wiki/Creative_industries (this is easy). Moreover, I would also like to get the summaries of the hyperlinks in Section: Definitions -> 'Advertising', 'Marketing', 'Architecture',...,'visual arts'. My problem is that some of these hyperlinks have different page names. For example, the previous mentioned page has the hyperlink 'Software' (number 6), but I want the summary of the page, which is 'Software Engineering'.
Can someone help me with that? I can find the summaries of the pages with the same hyperlink name, but that is not always the case. So basically I am looking for a way to use (page.links) to only one area of the page.
Thank you in advance
Try using Beautiful soup, this will print all the links with the given prefix
from bs4 import BeautifulSoup
import requests, re
''' Dont forget to install/setup package = 'lxml' '''
url = "your link"
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data,'lxml')
tags = soup.find_all('a')
''' This will print every available link'''
for tag in tags:
print(tag.get('href'))
''' this will print links with only prefix as given'''
for link in soup.find_all('a',attrs={'href': re.compile("^{{you prefix here}}")}):
print(link.get('href')

How to get text which has no HTML tag | Add multiple delimiters in split

Following XPath select div element with class ajaxcourseindentfix and split it from Prerequisite and gives me all the content after prerequisite.
div = soup.select("div.ajaxcourseindentfix")[0]
" ".join([word for word in div.stripped_strings]).split("Prerequisite: ")[-1]
My div can have not only prerequisite but also the following splitting points:
Prerequisites
Corerequisite
Corerequisites
Now, whenever I have Prerequisite, above XPath works fine but whenever anything from above three comes, the XPath fails and gives me the whole text.
Is there a way to put multiple delimiters in XPath? Or how do I solve it?
Sample pages:
Corequisite URL: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96106&show
Prerequisite URL: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show
Both: http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=98590&show
[Old Thread] - How to get text which has no HTML tag
This code is the solution to your problem unless you need XPath specifically, I would also suggest that you review BeautifulSoup documentation on the methods I've used, you can find that HERE
.next_element and .next_sibling can be very useful in these cases.
or .next_elements we'll get a generator that we'll have either to convert or use it in a manner that we can manipulate a generator.
from bs4 import BeautifulSoup
import requests
url = 'http://catalog.fullerton.edu/ajax/preview_course.php?catoid=16&coid=96564&show'
makereq = requests.get(url).text
soup = BeautifulSoup(makereq, 'lxml')
whole = soup.find('td', {'class': 'custompad_10'})
# we select the whole table (td), not needed in this case
thedivs = whole.find_all('div')
# list of all divs and elements within them
title_h3 = thedivs[2]
# we select only yhe second one (list) and save it in a var
mytitle = title_h3.h3
# using .h3 we can traverse (go to the child <h3> element)
mylist = list(mytitle.next_elements)
# title_h3.h3 is still part of a three and we save all the neighbor elements
the_text = mylist[3]
# we can then select specific elements
# from a generator that we've converted into a list (i.e. list(...))
prequisite = mylist[6]
which_cpsc = mylist[8]
other_text = mylist[11]
print(the_text, ' is the text')
print(which_cpsc, other_text, ' is the cpsc and othertext ')
# this is for testing purposes
Solves both issues, we don't have to use CSS selectors and those weird list manipulations. Everything is organic and works well.

Parsing HTML with BeautifulSoup and no Classes (just paragraphs)

I'm trying to parse 'https://projecteuler.net/problem=8' for the middle bit with the number. Since it doesn't have a separate class to select it by, I have used
r = requests.get('https://projecteuler.net/problem=8')
data = r.text
soup = BeautifulSoup(data, "lxml")
[para1, para2, para3] = (soup.find_all('p'))
To separate the paragraphs, but this leaves alot of extra junk (<p> and <br>) in there. Is there a command to clear all that out? Is there a better command to do the splitting than I am currently using? Never really done much web crawling in Python...
soup.find_all returns a set of html nodes that contain the html tags; If you want to extract text from the node, you can just use .text on each node; applying this on para2, gives:
para2.text.split()
#['73167176531330624919225119674426574742355349194934',
# '96983520312774506326239578318016984801869478851843',
# '85861560789112949495459501737958331952853208805511',
# '12540698747158523863050715693290963295227443043557',
# ...

How does table parsing work in python? Is there an easy way other that beautiful soup?

I am trying to understand how one can use beautiful soup to extract the href links for the contents under a particular column in a table on a webpage. For example consider the link: http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015.
On this page the table with class wikitable has a column title, I need to extract the href links that are behind each of the values under the column title and put them in an excel sheet. What would be the best way to do this? I am having a little difficulty in understanding the beautiful soup table parsing documentation.
You don't really have to literally navigate the tree, you can simply try to see what identifies those lines.
Like in this example, the urls you are looking for reside in a table with class="wikitable", in that table they reside in a td tag with align=center, now we have a somewhat unique identification for our links, we can start extracting them.
However you should put into consideration that multiple tables with class="wikitable" and td tags with align=center may exist, in case you want the first or second table, it depends on your choice, you will have to add extra filters.
The code should look something like this for extracting all links from those tables:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))
There's one more thing to note here, notice the use of SoupStrainer, it's used to specify a filter for reading the content you want to process, it helps to speed the process, try to not use the parse_only argument on this line:
soup = BeautifulSoup(content, parse_only=filter_tag)
and notice the difference. (I noticed it because my pc is not that powerful)

Python BeautifulSoup - Add Tags around found keyword

I am currently working on a project in which I want to allow regex search in/on a huge set of HTML files.
After first pinpointing the files of my interest I now want to highlight the found keyword!
Using BeautifulSoup I can determine the Node in which my keyword is found. One thing I do is changing the color of the whole parent.
However, I would also like to add my own <span>-Tags around just they keyword(s) I found.
Determining the position and such is no big deal using the find()-functions provided by BFSoup. But adding my tags around regular text seems to be impossible?
# match = keyword found by another regex
# node = the node I found using the soup.find(text=myRE)
node.parent.setString(node.replace(match, "<myspan>"+match+"</myspan>"))
This way I only add mere text and not a proper Tag, as the document is not freshly parsed, which I hope to avoid!
I hope my problem became a little clear :)
Here's a simple example showing one way to do it:
import re
from bs4 import BeautifulSoup as Soup
html = '''
<html><body><p>This is a paragraph</p></body></html>
'''
(1) store the text and empty the tag
soup = Soup(html)
text = soup.p.string
soup.p.clear()
print soup
(2) get start and end positions of the words to be boldened (apologies for my English)
match = re.search(r'\ba\b', text)
start, end = match.start(), match.end()
(3) split the text and add the first part
soup.p.append(text[:start])
print soup
(4) create a tag, add the relevant text to it and append it to the parent
b = soup.new_tag('b')
b.append(text[start:end])
soup.p.append(b)
print soup
(5) append the rest of the text
soup.p.append(text[end:])
print soup
here is the output from above:
<html><body><p></p></body></html>
<html><body><p>This is </p></body></html>
<html><body><p>This is <b>a</b></p></body></html>
<html><body><p>This is <b>a</b> paragraph</p></body></html>
If you add the text...
my_tag = node.parent.setString(node.replace(match, "<myspan>"+match+"</myspan>"))
...and pass it through BeautifulSoup once more
new_soup = BeautifulSoup(my_tag)
it should be classified as a BS tag object and available for parsing.
You could apply these changes to the original mass of text and run it through as a whole, to avoid repetition.
EDIT:
From the docs:
# Here is a more complex example that replaces one tag with another:
from BeautifulSoup import BeautifulSoup, Tag
soup = BeautifulSoup("<b>Argh!<a>Foo</a></b><i>Blah!</i>")
tag = Tag(soup, "newTag", [("id", 1)])
tag.insert(0, "Hooray!")
soup.a.replaceWith(tag)
print soup
# <b>Argh!<newTag id="1">Hooray!</newTag></b><i>Blah!</i>

Categories

Resources