Scraping string from HTML with python3-beautifulsoup3

Scraping string from HTML with python3-beautifulsoup3 - python

I'm trying to get string from a table row using beautifulsoup.
String I want to get are 'SANDAL' and 'SHORTS', from second and third rows.
I know this can be solved with regular expression or with string functions but I want to learn beautifulsoup and do as much as possible with beautifulsoup.
Clipped python code
soup=beautifulsoup(page,'html.parser')
table=soup.find('table')
row=table.find_next('tr')
row=row.find_next('tr')
HTML
<html>
<body>
<div id="body">
<div class="data">
<table id="products">
<tr><td>PRODUCT<td class="ole1">ID<td class="c1">TYPE<td class="ole1">WHEN<td class="ole4">ID<td class="ole4">ID</td></tr>
<tr><td>SANDAL<td class="ole1">77313<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878717</td></tr>
<tr><td>SHORTS<td class="ole1">77314<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878718</td></tr>
</table>
</div>
</div>
</body>
</html>

To get text from first column of the table (sans header), you can use this script:
from bs4 import BeautifulSoup
txt = '''
<html>
<body>
<div id="body">
<div class="data">
<table id="products">
<tr><td>PRODUCT<td class="ole1">ID<td class="c1">TYPE<td class="ole1">WHEN<td class="ole4">ID<td class="ole4">ID</td></tr>
<tr><td>SANDAL<td class="ole1">77313<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878717</td></tr>
<tr><td>SHORTS<td class="ole1">77314<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878718</td></tr>
</table>
</div>
</div>
</body>
</html>'''
soup = BeautifulSoup(txt, 'lxml') # <-- lxml is important here (to parse the HTML code correctly)
for tr in soup.find('table', id='products').find_all('tr')[1:]: # <-- [1:] because we want to skip the header
print(tr.td.text) # <-- print contents of first <td> tag
Prints:
SANDAL
SHORTS

Related

Get hrefs from <a> Tags Located in the Divs with a Specific Classes Using BeautifulSoup

I need to get hrefs from <a> tags in a website, but not all, but only ones that are in the spans locted in the <div>s with classes arm
<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>
import requests
from bs4 import BeautifulSoup as bs
request = requests.get("url")
html = bs(request.content, 'html.parser')
for arm in html.select(".arm"):
anchor = arm.select("span > a")
print("anchor['href']")
But my code doesn't print anything

Your code looks fine until you get to the print("anchor['href']") line which I assume is meant to be print(anchor['href']).
Now, anchor is a ResultSet, which means you will need another loop to get the hrefs. Here is how those final lines should look like if you want minimum modification to your code:
for arm in soup.select(".arm"):
anchor = arm.select("span > a")
for x in anchor:
print(x.attrs['href'])
We basically add:
for x in anchor:
print(x.attrs['href'])
And you should get the hrefs. All the best.
This is my output:

Try using the find.all() method to obtain the values in a specific tags and class
I have replicated your HTML file and obtain the values in the span tag. Please see my sample code below.
Replicated HTML file:
# Creating the HTML file
file_html = open("demo.html", "w")
# Adding the input data to the HTML file
file_html.write('''<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>''')
# Saving the data into the HTML file
file_html.close()
code:
import requests
from bs4 import BeautifulSoup as bs
#reading the replicated html file
demo = open("demo.html", "r")
results = bs(demo, 'html.parser')
#Using find.all method to find specific tags and class
job_elements = results.find_all("div", class_="arm")
for job_element in job_elements:
links = job_element.find_all("a")
for link in links:
print(link['href'])
Output:
reference:
https://realpython.com/beautiful-soup-web-scraper-python/

How to remove surplus tags from beautiful soup result

I want to get only the content in <p> tag and remove the surplus div tags.
My code is:
page = """
<p style="text-align: justify">content that I want
<div ><!-- /316485075/agk_116000_pos_3_sidebar_mobile -->
<div id="agk_116000_pos_3_sidebar_mobile">
<script>
script code
</script>
</div>
<div class="nopadding clearfix hidden-print">
<div align="center" class="col-md-12">
<!-- /316485075/agk_116000_pos_4_conteudo_desktop -->
<div id="agk_116000_pos_4_conteudo_desktop" style="height:90px; width:728px;">
<script>
script code
</script>
</div>
</div>
</div>
</div>
</p>
"""
soup = BeautifulSoup(page, 'html.parser')
p = soup.find_all('p', {'style' : 'text-align: justify'})
And I just want to get the string <p>content that I want</p> and remove all the divs

You can use the replace_with() function to remove the tags along with its contents.
soup = BeautifulSoup(html, 'html.parser') # html is HTML you've provided in question
soup.find('div').replace_with('')
print(soup)
Output:
<p style="text-align: justify">content that I want
</p>
Note: I'm using soup.find('div') here as all the unwanted tags are inside the first div tag. Hence, if you remove that tag, all others will get removed. But, if you want to remove all the tags other than the p tags in a HTML where the format is not like this, you'll have to use this:
for tag in soup.find_all():
if tag.name == 'p':
continue
tag.replace_with('')
Which is equivalent to:
[tag.replace_with('') for tag in soup.find_all(lambda t: t.name != 'p')]
If you simply want the content that I want text, you can use this:
print(soup.find('p').contents[0])
# content that I want

Capture group 2 contains your content <(.*?)(?:\s.+?>)(.*?)</\1[>]?
See https://regex101.com/r/m8DQic/1

Parse multiple href within one parent using BeautifulSoup

I have one line in my program, using BeautifulSoup's find():
print(table.find('td','monsters'))
This is the output of the above line:
<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>
Now I want to parse all five hrefs, so that it would output something like this:
/m154
/m153
/m152
/m155
/m147
I have attempted to convert my print line into a for loop by changing find() to find_all(), and then retrieve the href by using .a['href'] within the foor loop. However, no matter what I try, I would always only get one entry instead of five. Any suggestions for retrieving multiple href? Seeing that find_all() returns an array, would it make sense to make find_all() directly above the parent of a?

Input:
page = """<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "html.parser") # your source page parsed as html
links = soup.find_all('a', href=True) # get all links having href attribute
for i in links:
print(i['href'])
Result:
/m154
/m153
/m152
/m155
/m147

What you want to do is something like the following:
cell = table.find('td', 'monsters')
for a_tag in cell.find_all('a'):
print(a['href'])

Full Code, similar to posts above
import bs4
HTML= """<html>
<table>
<tr>
<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>
</tr>
</table>
</html>
"""
table = bs4.BeautifulSoup(HTML, 'lxml')
anker = table.find('td', 'monsters').find_all('a')
[print(a['href']) for a in anker]

Remove matched tags in html files?

I have some html files, each of which contains
<td id="MenuTD" style="vertical-align: top;">
...
</td>
where ... can contain anything, and </td> matches <td id="MenuTD" style="vertical-align: top;">. I would like to remove this part from the html files.
Similarly, I may also want to remove some other tags in the files.
How shall I program that in Python?
I am looking at HTMLParser module in Python 2.7, but haven't figured out if that can help.

You can accomplish this using BeautifulSoup. You have two options, depending on what you want to do with the element you're removing.
Set up:
from bs4 import BeautifulSoup
html_doc = """
<html>
<header>
<title>A test</title>
</header>
<body>
<table>
<tr>
<td id="MenuTD" style="vertical-align: top;">
Stuff here <a>with a link</a>
<p>Or paragraph tags</p>
<div>Or a DIV</div>
</td>
<td>Another TD element, without the MenuTD id</td>
</tr>
</table>
</body>
</html>
"""
soup = BeautifulSoup(html_doc)
Option 1 is to use the extract() method. Using this, you will retain a copy of your extracted element so that you can utilize it later in your application:
Code:
menu_td = soup.find(id="MenuTD").extract()
At this point, the element you are removing has been saved to the menu_td variable. Do what you want with that. Your HTML in the soup variable no longer contains your element though:
print(soup.prettify())
Outputs:
<html>
<header>
<title>
A test
</title>
</header>
<body>
<table>
<tr>
<td>
Another TD element, without the MenuTD id
</td>
</tr>
</table>
</body>
</html>
Everything in the MenuTD element has been removed. You can see it is still in the menu_td variable though:
print(menu_td.prettify())
Outputs:
<td id="MenuTD" style="vertical-align: top;">
Stuff here
<a>
with a link
</a>
<p>
Or paragraph tags
</p>
<div>
Or a DIV
</div>
</td>
Option 2: Utilize .decompose(). If you do not need a copy of the removed element, you can utilize this function to remove it from the document and destroy the contents.
Code:
soup.find(id="MenuTD").decompose()
It doesn't return anything (unlike .extract()). It does, however, remove the element from your document:
print(soup.prettify())
Outputs:
<html>
<header>
<title>
A test
</title>
</header>
<body>
<table>
<tr>
<td>
Another TD element, without the MenuTD id
</td>
</tr>
</table>
</body>
</html>

BeautifulSoup: How to get nested divs

Given the following code:
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>
How to extract the word test from <div class="category5"> test using BeautifulSoup i.e how to deal with nested divs? I tried to lookup on the Internet but I didn't find any case that treat an easy to grasp example so I set up this one. Thanks.

xpath should be the straight forward answer, however this is not supported in BeautifulSoup.
Updated: with a BeautifulSoup solution
To do so, given that you know the class and element (div) in this case, you can use a for/loop with attrs to get what you want:
from bs4 import BeautifulSoup
html = '''
<html>
<body>
<div class="category1" id="foo">
<div class="category2" id="bar">
<div class="category3">
</div>
<div class="category4">
<div class="category5"> test
</div>
</div>
</div>
</div>
</body>
</html>'''
content = BeautifulSoup(html)
for div in content.findAll('div', attrs={'class':'category5'}):
print div.text
test
I have no problem extracting the text from your html sample, like #MartijnPieters suggested, you will need to find out why your div element is missing.
Another update
As you're missing lxml as a parser for BeautifulSoup, that's why None was returned as you haven't parsed anything to start with. Install lxml should solve your issue.
You may consider using lxml or similar which supports xpath, dead easy if you ask me.
from lxml import etree
tree = etree.fromstring(html) # or etree.parse from source
tree.xpath('.//div[#class="category5"]/text()')
[' test\n ']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping string from HTML with python3-beautifulsoup3 - python

Related

Get hrefs from <a> Tags Located in the Divs with a Specific Classes Using BeautifulSoup

How to remove surplus tags from beautiful soup result

Parse multiple href within one parent using BeautifulSoup

Remove matched tags in html files?

BeautifulSoup: How to get nested divs

Categories

Resources