Python : I used the beautifulsoup. but result value of href is disappeared - python

This is the html
<div class="s_write">
<p style="text-align:left;"></P>
<div app_paragraph="Dc_App_Img_0" app_editorno="0">
<img src="http://dcimg6.dcinside.co.kr/viewimage.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f244141456484ebe788e4b1ac8601ef468abc7cad6754f440d9ddbfc0370c7" style="cursor:pointer;" onclick="javascript:imgPop('http://image.dcinside.com/viewimagePop.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f24452490c8b9fb90ae74e4d6a2435010d29956ad37f400586d9cb','image','fullscreen=yes,scrollbars=yes,resizable=no,menubar=no,toolbar=no,location=no,status=no');"></div>
I want to extract
http://image.dcinside.com/viewimagePop.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f24452490c8b9fb90ae74e4d6a2435010d29956ad37f400586d9cb
So I programmed like this
for link in internal.find_all('div',class_="s_write"):
print (link)
But result is
<div class="s_write"><p style="text-align:left;"><div app_editorno="0" app_paragraph="Dc_App_Img_0"><img src="http://dcimg6.dcinside.co.kr/viewimage.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f24452490c8a9dbf0ae34e4a6a20370e5a2d9633d5701c48dc23ac"/></p></div>
href is not the result
What's problem?

Issue with your code :
You are searching for dev which will return you only dev block . If you want to find img href you need to searh for img tag.
Following is a rough example to get this done.
import bs4 as bs
markup = """<div class="s_write">
<p style="text-align:left;"></P>
<div app_paragraph="Dc_App_Img_0" app_editorno="0">
<img src="http://dcimg6.dcinside.co.kr/viewimage.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f244141456484ebe788e4b1ac8601ef468abc7cad6754f440d9ddbfc0370c7" style="cursor:pointer;" onclick="javascript:imgPop('http://image.dcinside.com/viewimagePop.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f24452490c8b9fb90ae74e4d6a2435010d29956ad37f400586d9cb','image','fullscreen=yes,scrollbars=yes,resizable=no,menubar=no,toolbar=no,location=no,status=no');"></div>"""
soup = bs.BeautifulSoup(markup,"html.parser")
imglink = soup.find_all("img")[0]
print(imglink.attrs["src"])
Output
http://dcimg6.dcinside.co.kr/viewimage.php?id=2fbcc323e7d334aa51b1d3a240&no=24b0d769e1d32ca73fef84fa11d028318f52c0eeb141bee560297996d466c894cf2d16427672bba3d66d67f244141456484ebe788e4b1ac8601ef468abc7cad6754f440d9ddbfc0370c7

Related

Get hrefs from <a> Tags Located in the Divs with a Specific Classes Using BeautifulSoup

I need to get hrefs from <a> tags in a website, but not all, but only ones that are in the spans locted in the <div>s with classes arm
<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>
import requests
from bs4 import BeautifulSoup as bs
request = requests.get("url")
html = bs(request.content, 'html.parser')
for arm in html.select(".arm"):
anchor = arm.select("span > a")
print("anchor['href']")
But my code doesn't print anything
Your code looks fine until you get to the print("anchor['href']") line which I assume is meant to be print(anchor['href']).
Now, anchor is a ResultSet, which means you will need another loop to get the hrefs. Here is how those final lines should look like if you want minimum modification to your code:
for arm in soup.select(".arm"):
anchor = arm.select("span > a")
for x in anchor:
print(x.attrs['href'])
We basically add:
for x in anchor:
print(x.attrs['href'])
And you should get the hrefs. All the best.
This is my output:
Try using the find.all() method to obtain the values in a specific tags and class
I have replicated your HTML file and obtain the values in the span tag. Please see my sample code below.
Replicated HTML file:
# Creating the HTML file
file_html = open("demo.html", "w")
# Adding the input data to the HTML file
file_html.write('''<html>
<body>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="arm">
<span>
link
link
link
</span>
</div>
<div class="footnote">
<span>
anotherLink
anotherLink
anotherLink
</span>
</div>
</body>
</html>''')
# Saving the data into the HTML file
file_html.close()
code:
import requests
from bs4 import BeautifulSoup as bs
#reading the replicated html file
demo = open("demo.html", "r")
results = bs(demo, 'html.parser')
#Using find.all method to find specific tags and class
job_elements = results.find_all("div", class_="arm")
for job_element in job_elements:
links = job_element.find_all("a")
for link in links:
print(link['href'])
Output:
reference:
https://realpython.com/beautiful-soup-web-scraper-python/

Missing parts in Beautiful Soup results

I'm trying to retrieve the table in the ul tag in the following html code, which mostly looks like this:
<ul class='list' id='js_list'>
<li class="first">
<div class="meta">
<div class="avatar">...</div>
<div class="name">黑崎一护</div>
<div class="type">...</div>
</div>
<div class="rates">
<div class="winrate">56.11%</div>
<div class="pickrate">7.44%</div>
</div>
</li>
</ul>
but just with more entries. It's from this website.
So far I have this (for specifically getting the win rates):
from bs4 import BeautifulSoup
import requests
r = requests.get("https://moba.163.com/m/wx/ss/")
soup = BeautifulSoup(r.content, 'html5lib')
win_rates = soup.find_all('div', class_ = "winrate")
But this returns empty and it seems like the farthest Beautiful Soup was able to get was the ul tag, but none of the information under it. Is this a parsing issue? Or is there JavaScript source code that I'm missing?
I think your issue is that your format is incorrect for pulling the div with the attribute. I was able to pull the winrate div with this:
soup.find('div',attrs={'class':'winrate'})

how to find second div from html in python beautifulsoup

there i'm finding a second div(container) with beautifulsoup but it show nothing.
<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"></div>//this div i try to select
My code its show nothing in terminal.
header = soup.find_all('div', attrs={'class': 'container'})[1]
for text in header.find_all("p"):
print(text)
driver.close()
Your code first finds all the container divs and picks the second one which is what you are trying to select. You are then searching for <p> tags within it. Your example HTML though does not containing any.
The HTML would need to contain <p> tags for it to find anything, for example:
from bs4 import BeautifulSoup
html = """<div class="section-heading-page">
<div class="container">
</div>
</div>
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>"""
soup = BeautifulSoup(html, 'html.parser')
div_2 = soup.find_all('div', attrs={'class': 'container'})[1]
for p in div_2.find_all("p"):
print(p.text) # Display the text inside any p tag
This would display:
Hello 1
Hello 2
If you print(div_2) you would see that it contains:
<div class="container"><p>Hello 1</p><p>Hello 2</p></div>
If you are trying to display any text inside div_2 you could try:
print(div_2.text)

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.
I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:
from bs4 import BeautifulSoup
html = """
<div>
<div class="somewhat">
<div class="not quite">
</div>
<div class="here">
<blockquote>
<span>
<br />content<br />
</span>
</blockquote>
</div>
<div class="not here either">
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
for y in x.find_all('div', {"class":"here"}): # find all the "here" divs
for inp in y.find_all("blockquote"): # in a "here" div find all blockquote tags for the relevant content
for newlines in inp('br'):
inp.br.replace_with("\n") # replace br tags
for link in inp('a'):
inp.a.unwrap() # unwrap all a tags
for quote in inp('span'):
inp.span.unwrap() # unwrap all span tags
for block in inp('blockquote'):
inp.blockquote.unwrap() # <----- should unwrap blockquote
la_lista.append(inp)
print(la_lista)
The result is as follows:
[<blockquote>
content
</blockquote>]
Any ideas?
The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').
The solution for you is to remove:
for block in inp('blockquote'):
inp.blockquote.unwrap()
and replace:
la_lista.append(inp)
with:
la_lista.append(inp.decode_contents())
The answer is based on the following answer BeautifulSoup innerhtml

Beautiful Soup - How can I scrape images that contain a specific src attribute?

I've just started learning webscraping a few days ago and thought it would be fun to try scraping Mangadex as a mini project. Thank you for the advice in advance!
I'm trying to scrape images by extracting the src attribute of an img tag using Beautiful Soup 4 and Python 3.7
The HTML section I'm interested in is:
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
Each image that I'm interested in contains a src attribute that begins with "https://s5.mangadex.org/data/" so I thought maybe I could target images that begin with that specific attribute.
I've tried using select() to find the img element and then using get() to find the src but didn't have any luck with that specific html section.
HTML sections that did work using select() and get() were:
<img class="mx-2" height="38px" src="/images/misc/navbar.svg?3" alt="MangaDex" title="MangaDex">
<img src="/images/misc/miku.jpg" width="100%">
<img class="mx-2" height="38px" src="/images/misc/navbar.svg?3" alt="MangaDex" title="MangaDex">
Try with this :
from bs4 import BeautifulSoup
html = """
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
<div class="reader-image-wrapper col-auto my-auto justify-content-center align-items-center noselect nodrag row no-gutters" data-state="2" data-page="1" style="order: 1;">
<img draggable="false" class="noselect nodrag cursor-pointer" src="https://s4.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg">
</div>
"""
soup = BeautifulSoup(html)
for n in soup.find_all('img'):
if(n.get('src').startswith( 'https://s5.mangadex.org/data/')):
print(n.get('src'))
result:
https://s5.mangadex.org/data/554c97a14357f3972912e08817db4a03/x1.jpg
attrs will list all the attributes set in that tag. Its a dictionary so to get specific attribute value see below.
# for getting webpages
import requests
r = requests.get(URL_LINK)
base_url='https://s5.mangadex.org/data/'
# for beautiful soup
from bs4 import BeautifulSoup
bs = BeautifulSoup(r.content)
imgs = bs.findAll('img')
for img in imgs:
src = img.attrs['src']
if not src.startswith(base_url):
src = base_url+src
print(src)
You cannot scrape mangadex with BeautifulSoup DIRECTLY. Mangadex load their images with javascript after the document is ready. What you get with BeautifulSoup is that empty document. That is why it is failing. This website explains how you can scrape web pages that rely on javascript to serve their content:
https://towardsdatascience.com/data-science-skills-web-scraping-javascript-using-python-97a29738353f

Categories

Resources