I'm trying to extract Some_Product_Title from this block of HTML code
<div id="titleSection" class="a-section a-spacing-none">
<h1 id="title" class="a-size-large a-spacing-none">
<span id="productTitle" class="a-size-large">
Some_Product_Title
</span>
The lines below are working fine
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
But the code below is not
title = soup.find_all(id="productTitle")
Since when I try print(title) I get None as the console output
Does anyone know how to fix this?
You're probably having trouble with .find() because the site from which you are creating the soup is, in all likelihood, generating its html code via javascript.
If this is the case, to find an element by id, you should implement the following:
soup1 = BeautifulSoup(page.content, "html.parser")
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
title = soup2.find(id = "productTitle")
BS4 has CSS selectors built in so you can use:
soup.select('#productTitle')
This would also work:
title = soup.find_all("span", { "id" : "productTitle" })
import requests
from bs4 import BeautifulSoup
URL = 'https://your-own.address/some-thing'
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.findAll('',{"id":"productTitle"})
print(*title)
Related
Html
<div class="QRiHXd">
"Some very secret link" <<< This is the content I want to print out / btw is a link
</div>
Code
import requests
import urllib
import bs4
url = 'https://www.reddit.com/' # There is actually another link
url_contents = urllib.request.urlopen(url).read()
soup = bs4.BeautifulSoup(url_contents, "html.parser")
div = soup.find('div', {'class_': 'QRiHXd'})
content = str(div)
print(content)
I need to print the text that's inside the class but when I try to print it it returns: "None" and I do not know why.
To get the text from a tag you can use the .text with the tag
from bs4 import BeautifulSoup
html_doc = """<div class="QRiHXd">Some very secret link</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
div = soup.find('div', {'class': 'QRiHXd'})
print(div.text) # Some very secret link
Current code:
import bs4
import requests
url = 'hidden'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
bs4_content = soup.find_all(class_='user-post-count')
print(bs4_content)
Only content I manage to get is
[<p class="user-post-count">This user has made <strong>5 posts</strong>
</p>]
I'm trying to only get the content between the strong tags.
Thank you, all help much appreciated
You can use inner .find_all
import bs4
import requests
url = 'hidden'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
bs4_content = soup.find_all(class_='user-post-count')
for strong in bs4_content.find_all('strong'):
print(strong.text)
Try using a CSS Selector .user-post-count strong, which selects the <strong> tags under the user-post-count class.
from bs4 import BeautifulSoup
html = '''<p class="user-post-count">This user has made <strong>5 posts</strong>
</p>
'''
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select('.user-post-count strong'):
print(tag.text)
Output:
5 posts
I want to extract the text here
a lot of text
I used
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
mestuff = soup.find("div", {"class":"bbcode bbcode--profile-page"})
but it never fails to return with "None" in the terminal.
How can I go about this?
Link is "https://osu.ppy.sh/users/1521445"
(This is a repost since the old question was super old. I don't know if I should've made another question or not but aa)
Data is dynamically loaded from script tag so, as in other answer, you can grab from that tag. You can target the tag by its id then you need to pull out the relevant json, then the html from that json, then parse html which would have been loaded dynamically on page (at this point you can use your original class selector)
import requests, json, pprint
from bs4 import BeautifulSoup as bs
r = requests.get('https://osu.ppy.sh/users/1521445')
soup = bs(r.content, 'lxml')
all_data = json.loads(soup.select_one('#json-user').text)
soup = bs(all_data['page']['html'], 'lxml')
pprint.pprint(soup.select_one('.bbcode--profile-page').get_text('\n'))
You could try this:
url = ('https://osu.ppy.sh/users/1521445')
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
x = soup.findAll("script",{"id":re.compile(r"json-user")})
result = re.findall('raw\":(.+)},\"previous_usernames', x[0].text.strip())
print(result)
Im not sure why the div with class='bbcode bbcode--profile-page' is string inside script tag with class='json-user', that's why you can't get it's value by div with class='bbcode bbcode--profile-page'
Hope this could help
I have a HTML Page with multiple divs like:
<div class="post-info-wrap">
<h2 class="post-title">sample post – example 1 post</h2>
<div class="post-meta clearfix">
<div class="post-info-wrap">
<h2 class="post-title">sample post – example 2 post</h2>
<div class="post-meta clearfix">
and I need to get the value for all the divs with class post-info-wrap I am new to BeautifulSoup
so I need these urls:
https://www.example.com/blog/111/this-is-1st-post/
https://www.example.com/blog/111/this-is-2nd-post/
and so on...
I have tried:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for link in soup.select('.post-info-wrap'):
print link.find('a').attrs['href']
This code doesnt seem to be working. I am new to beautiful soup. How can i extract the links?
link = i.find('a',href=True) always not return anchor tag (a), it may be return NoneType, so you need to validate link is None, continue for loop,else get link href value.
Scrape link by url:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Scrape link by HTML:
from bs4 import BeautifulSoup
html = '''<div class="post-info-wrap"><h2 class="post-title">sample post – example 1 post</h2><div class="post-meta clearfix">
<div class="post-info-wrap"><h2 class="post-title">sample post – example 2 post</h2><div class="post-meta clearfix">'''
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Update:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get("https://www.example.com/blog/author/abc")
soup = BeautifulSoup(driver.page_source, "html.parser")
for i in soup.find_all('div', {'class': 'post-info-wrap'}):
link = i.find('a', href=True)
if link is None:
continue
print(link['href'])
O/P:
https://www.example.com/blog/911/article-1/
https://www.example.com/blog/911/article-2/
https://www.example.com/blog/911/article-3/
https://www.example.com/blog/911/article-4/
https://www.example.com/blog/random-blog/article-5/
For chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
selenium tutorial
https://selenium-python.readthedocs.io/
Where '/usr/bin/chromedriver' chrome webdriver path.
You can use soup.find_all:
from bs4 import BeautifulSoup as soup
r = [i.a['href'] for i in soup(html, 'html.parser').find_all('div', {'class':'post-info-wrap'})]
Output:
['https://www.example.com/blog/111/this-is-1st-post/', 'https://www.example.com/blog/111/this-is-2nd-post/']
I'm currently trying to scrap some data over a website using BS4 under python 3.6.4 but the value returned is not what I am expecting:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.content
soup = BeautifulSoup(page, "html5lib")
price = soup.find("div", {"class" : "fieldPrice sizeC"}).text
print(price)
I should get "39 900 €" but the code return "47 880 â¬".
NB: Even without JS, the data should be "39 900 €".
Thanks for your help !
The encoding declaration is wrong on this page so BeautifulSoup gets told to use the wrong encoding. You can force it to use the correct encoding like this:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.content
soup = BeautifulSoup(page.decode('utf-8','ignore'), "html5lib")
price = soup.find("div", {"class": "fieldPrice sizeC"}).text
print(price)
Outputs:
49 070 €
Instead of page.content use page.text
Ex:
import requests
from bs4 import BeautifulSoup
link = "https://www.lacentrale.fr/listing?makesModelsCommercialNames=FERRARI&sortBy=priceAsc"
request = requests.get(link)
page = request.text
soup = BeautifulSoup(page, "html.parser")
price = soup.find("div", {"class" : "fieldPrice sizeC"}).text
print(price)
.text automatically decode content from the server