Parse multiple href within one parent using BeautifulSoup - python

I have one line in my program, using BeautifulSoup's find():
print(table.find('td','monsters'))
This is the output of the above line:
<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>
Now I want to parse all five hrefs, so that it would output something like this:
/m154
/m153
/m152
/m155
/m147
I have attempted to convert my print line into a for loop by changing find() to find_all(), and then retrieve the href by using .a['href'] within the foor loop. However, no matter what I try, I would always only get one entry instead of five. Any suggestions for retrieving multiple href? Seeing that find_all() returns an array, would it make sense to make find_all() directly above the parent of a?

Input:
page = """<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "html.parser") # your source page parsed as html
links = soup.find_all('a', href=True) # get all links having href attribute
for i in links:
print(i['href'])
Result:
/m154
/m153
/m152
/m155
/m147

What you want to do is something like the following:
cell = table.find('td', 'monsters')
for a_tag in cell.find_all('a'):
print(a['href'])

Full Code, similar to posts above
import bs4
HTML= """<html>
<table>
<tr>
<td class="monsters">
<div class="mim mim-154"></div>
<div class="mim mim-153"></div>
<div class="mim mim-152"></div>
<div class="mim mim-155"></div>
<div class="mim mim-147"></div>
</td>
</tr>
</table>
</html>
"""
table = bs4.BeautifulSoup(HTML, 'lxml')
anker = table.find('td', 'monsters').find_all('a')
[print(a['href']) for a in anker]

Related

Scraping string from HTML with python3-beautifulsoup3

I'm trying to get string from a table row using beautifulsoup.
String I want to get are 'SANDAL' and 'SHORTS', from second and third rows.
I know this can be solved with regular expression or with string functions but I want to learn beautifulsoup and do as much as possible with beautifulsoup.
Clipped python code
soup=beautifulsoup(page,'html.parser')
table=soup.find('table')
row=table.find_next('tr')
row=row.find_next('tr')
HTML
<html>
<body>
<div id="body">
<div class="data">
<table id="products">
<tr><td>PRODUCT<td class="ole1">ID<td class="c1">TYPE<td class="ole1">WHEN<td class="ole4">ID<td class="ole4">ID</td></tr>
<tr><td>SANDAL<td class="ole1">77313<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878717</td></tr>
<tr><td>SHORTS<td class="ole1">77314<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878718</td></tr>
</table>
</div>
</div>
</body>
</html>
To get text from first column of the table (sans header), you can use this script:
from bs4 import BeautifulSoup
txt = '''
<html>
<body>
<div id="body">
<div class="data">
<table id="products">
<tr><td>PRODUCT<td class="ole1">ID<td class="c1">TYPE<td class="ole1">WHEN<td class="ole4">ID<td class="ole4">ID</td></tr>
<tr><td>SANDAL<td class="ole1">77313<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878717</td></tr>
<tr><td>SHORTS<td class="ole1">77314<td class="ole1">wear<td class="ole1">new<td class="ole4">id<td class="ole4">878718</td></tr>
</table>
</div>
</div>
</body>
</html>'''
soup = BeautifulSoup(txt, 'lxml') # <-- lxml is important here (to parse the HTML code correctly)
for tr in soup.find('table', id='products').find_all('tr')[1:]: # <-- [1:] because we want to skip the header
print(tr.td.text) # <-- print contents of first <td> tag
Prints:
SANDAL
SHORTS

is it possible to change parent of html element with python beautifulsoup

Let's assume I have a html like following:
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
I want to move all divs with the class answer-div into the previous question-div. Can I handle it with beautifulsoup?
You can also use insert
from bs4 import BeautifulSoup
html="""
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
<div class="question-div"></div>
<div class="answer-div"></div>
"""
soup=BeautifulSoup(html,'html.parser')
for div in soup.findAll('div',{"class":"answer-div"}):
div.find_previous_sibling('div').insert(0,div)
print(soup)
Output
<div class="question-div"><div class="answer-div"></div></div>
<div class="question-div"><div class="answer-div"></div></div>
<div class="question-div"><div class="answer-div"></div></div>
No hands-on experience with beautifulsoup but I will give this one a shot!
The way I look at it is, you find all the div's with question and answer separately.
div_ques_Blocks = soup.find_all('div', class_="question-div")
div_ans_Blocks = soup.find_all('div', class_="answer-div")
and then loop through the question-div to insert/append the answer-div
for divtag in div_ans_Blocks :
print divtag.find_previous_sibling('div')
If the above print statement gives you all the answer-div, you can then try appending them instead of priting, maybe like this?

How to get all hrefs(inside <a tag) and assign them to a variable??

I need all hrefs present in 'a' tag and assign it to a variable
I did this, but only got first link
soup_level1 = BeautifulSoup(driver.page_source, 'lxml')
userName = soup_level1.find(class_='_32mo')
link1 = (userName.get('href'))
And the output i get is
print(link1)
https://www.facebook.com/xxxxxx?ref=br_rs
But i need atleast top 3 or top 5 links
The structure of webpage is
`<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
I need those hrefs
from bs4 import BeautifulSoup
html="""
<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
"""
soup=BeautifulSoup(html,'lxml')
my_links = soup.findAll("a", {"class": "_32mo"})
for link in my_links:
print(link.get('href'))
Output
https://www.facebook.com/xxxxx?ref=br_rs
https://www.facebook.com/yyyyy?ref=br_rs
https://www.facebook.com/zzzzz?ref=br_rs
To get top n links you can use
max_num_of_links=2
for link in my_links[:max_num_of_links]:
print(link.get('href'))
Output
https://www.facebook.com/xxxxx?ref=br_rs
https://www.facebook.com/yyyyy?ref=br_rs
You can also save the top n links to a list
link_list=[]
max_num_of_links=2
for link in my_links[:max_num_of_links]:
link_list.append(link.get('href'))
print(link_list)
Output
['https://www.facebook.com/xxxxx?ref=br_rs', 'https://www.facebook.com/yyyyy?ref=br_rs']
EDIT:
If you need the driver to get the links one by one
max_num_of_links=3
for link in my_links[:max_num_of_links]:
driver.get(link.get('href'))
# rest of your code ...
For some reason if you want it in different variables like link1,link2 etc..
from bs4 import BeautifulSoup
html="""
<div>
<a class="_32mo" href="https://www.facebook.com/xxxxx?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/yyyyy?ref=br_rs">`
</div>
<div>
<a class="_32mo" href="https://www.facebook.com/zzzzz?ref=br_rs">`
</div>
"""
soup=BeautifulSoup(html,'lxml')
my_links = soup.findAll("a", {"class": "_32mo"})
link1=my_links[0].get('href')
link2=my_links[1].get('href')
link3=my_links[2].get('href')
# and so on, but be careful here you don't want to try to access a link which is not there or you'll get index error

How to remove surplus tags from beautiful soup result

I want to get only the content in <p> tag and remove the surplus div tags.
My code is:
page = """
<p style="text-align: justify">content that I want
<div ><!-- /316485075/agk_116000_pos_3_sidebar_mobile -->
<div id="agk_116000_pos_3_sidebar_mobile">
<script>
script code
</script>
</div>
<div class="nopadding clearfix hidden-print">
<div align="center" class="col-md-12">
<!-- /316485075/agk_116000_pos_4_conteudo_desktop -->
<div id="agk_116000_pos_4_conteudo_desktop" style="height:90px; width:728px;">
<script>
script code
</script>
</div>
</div>
</div>
</div>
</p>
"""
soup = BeautifulSoup(page, 'html.parser')
p = soup.find_all('p', {'style' : 'text-align: justify'})
And I just want to get the string <p>content that I want</p> and remove all the divs
You can use the replace_with() function to remove the tags along with its contents.
soup = BeautifulSoup(html, 'html.parser') # html is HTML you've provided in question
soup.find('div').replace_with('')
print(soup)
Output:
<p style="text-align: justify">content that I want
</p>
Note: I'm using soup.find('div') here as all the unwanted tags are inside the first div tag. Hence, if you remove that tag, all others will get removed. But, if you want to remove all the tags other than the p tags in a HTML where the format is not like this, you'll have to use this:
for tag in soup.find_all():
if tag.name == 'p':
continue
tag.replace_with('')
Which is equivalent to:
[tag.replace_with('') for tag in soup.find_all(lambda t: t.name != 'p')]
If you simply want the content that I want text, you can use this:
print(soup.find('p').contents[0])
# content that I want
Capture group 2 contains your content <(.*?)(?:\s.+?>)(.*?)</\1[>]?
See https://regex101.com/r/m8DQic/1

Python Beautifulsoup Find_all except

I'm struggling to find a simple to solve this problem and hope you might be able to help.
I've been using Beautifulsoup's find all and trying some regex to find all the items except the 'emptyLine' line in the html below:
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 ">...</div>
<div class="product_item2 ">...</div>
<div class="product_item0 ">...</div>
<div class="product_item1 last">...</div>
<div class="product_item2 emptyItem">...</div>
Is there a simple way to find all the items except one including the 'emptyItem'?
Just skip elements containing the emptyItem class. Working sample:
from bs4 import BeautifulSoup
data = """
<div>
<div class="product_item0">test0</div>
<div class="product_item1">test1</div>
<div class="product_item2">test2</div>
<div class="product_item2 emptyItem">empty</div>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for elm in soup.select("div[class^=product_item]"):
if "emptyItem" in elm["class"]: # skip elements having emptyItem class
continue
print(elm.get_text())
Prints:
test0
test1
test2
Note that the div[class^=product_item] is a CSS selector that would match all div elements with a class starting with product_item.

Categories

Resources