How to extract links from a page using Beautiful soup

How to extract links from a page using Beautiful soup - python

I have a HTML Page with multiple divs like:
<div class="post-info-wrap">
<h2 class="post-title">sample post – example 1 post</h2>
<div class="post-meta clearfix">
<div class="post-info-wrap">
<h2 class="post-title">sample post – example 2 post</h2>
<div class="post-meta clearfix">
and I need to get the value for all the divs with class post-info-wrap I am new to BeautifulSoup
so I need these urls:
https://www.example.com/blog/111/this-is-1st-post/
https://www.example.com/blog/111/this-is-2nd-post/
and so on...
I have tried:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for link in soup.select('.post-info-wrap'):
print link.find('a').attrs['href']
This code doesnt seem to be working. I am new to beautiful soup. How can i extract the links?

link = i.find('a',href=True) always not return anchor tag (a), it may be return NoneType, so you need to validate link is None, continue for loop,else get link href value.
Scrape link by url:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Scrape link by HTML:
from bs4 import BeautifulSoup
html = '''<div class="post-info-wrap"><h2 class="post-title">sample post – example 1 post</h2><div class="post-meta clearfix">
<div class="post-info-wrap"><h2 class="post-title">sample post – example 2 post</h2><div class="post-meta clearfix">'''
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Update:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get("https://www.example.com/blog/author/abc")
soup = BeautifulSoup(driver.page_source, "html.parser")
for i in soup.find_all('div', {'class': 'post-info-wrap'}):
link = i.find('a', href=True)
if link is None:
continue
print(link['href'])
O/P:
https://www.example.com/blog/911/article-1/
https://www.example.com/blog/911/article-2/
https://www.example.com/blog/911/article-3/
https://www.example.com/blog/911/article-4/
https://www.example.com/blog/random-blog/article-5/
For chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
selenium tutorial
https://selenium-python.readthedocs.io/
Where '/usr/bin/chromedriver' chrome webdriver path.

You can use soup.find_all:
from bs4 import BeautifulSoup as soup
r = [i.a['href'] for i in soup(html, 'html.parser').find_all('div', {'class':'post-info-wrap'})]
Output:
['https://www.example.com/blog/111/this-is-1st-post/', 'https://www.example.com/blog/111/this-is-2nd-post/']

Related

Unable to find element BeautifulSoup

I am trying to parse a specific href link from the following website: https://www.murray-intl.co.uk/en/literature-library.
Element i seek to parse:
<a class="btn btn--naked btn--icon-left btn--block focus-within" href="https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc&_ga=2.12911351.1364356977.1629796255-1577053129.1629192717" target="blank">Portfolio Holding Summary<i class="material-icons btn__icon">library_books</i></a>
However, using BeautifulSoup I am unable to obtain the desired element, perhaps due to cookies acceptance.
from bs4 import BeautifulSoup
import urllib.request
import requests as rq
page = requests.get('https://www.murray-intl.co.uk/en/literature-library')
soup = BeautifulSoup(page.content, 'html.parser')
link = soup.find_all('a', class_='btn btn--naked btn--icon-left btn--block focus-within')
url = link[0].get('href')
url
I am still new at BS4, and hope someone can help me on the right course.
Thank you in advance!

To get correct tags, remove "focus-within" class (it's added later by JavaScript):
import requests
from bs4 import BeautifulSoup
url = "https://www.murray-intl.co.uk/en/literature-library"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = soup.find_all("a", class_="btn btn--naked btn--icon-left btn--block")
for u in links:
print(u.get_text(strip=True), u.get("href", ""))
Prints:
...
Portfolio Holding Summarylibrary_books https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc
...
EDIT: To get only the specified link you can use for example CSS selector:
link = soup.select_one('a:-soup-contains("Portfolio Holding Summary")')
print(link["href"])
Prints:
https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc

Python - Filter BS4 content

Current code:
import bs4
import requests
url = 'hidden'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
bs4_content = soup.find_all(class_='user-post-count')
print(bs4_content)
Only content I manage to get is
[<p class="user-post-count">This user has made <strong>5 posts</strong>
</p>]
I'm trying to only get the content between the strong tags.
Thank you, all help much appreciated

You can use inner .find_all
import bs4
import requests
url = 'hidden'
res = requests.get(url)
soup = bs4.BeautifulSoup(res.text, 'html.parser')
bs4_content = soup.find_all(class_='user-post-count')
for strong in bs4_content.find_all('strong'):
print(strong.text)

Try using a CSS Selector .user-post-count strong, which selects the <strong> tags under the user-post-count class.
from bs4 import BeautifulSoup
html = '''<p class="user-post-count">This user has made <strong>5 posts</strong>
</p>
'''
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select('.user-post-count strong'):
print(tag.text)
Output:
5 posts

scrape dynamic content in python from https://brainly.co.id/tugas/148

how to scrape dynamic content in beautiful soup or any other library for tags
<use xlink:href="#icon-verified"></use> and <span data-test="answer-box-thanks-value">21</span>
Unable to access these using beautiful soup
# <span data-test="answer-box-thanks-value">19</span>
r=requests.get('https://brainly.co.id/tugas/148')
r=r.text
# print("Terima kasih" in r)
bsoup=BeautifulSoup(r,'html.parser')
for span in bsoup.find_all('span',{"data-test":"answer-box-thanks-value"}):
print(span)

If you inspect the page, you see that the data is under "upvoteCount":21.
Try the following the get the correct upvote count:
import re
import requests
from bs4 import BeautifulSoup
URL = "https://brainly.co.id/tugas/148"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
result = re.search(r'"upvoteCount":(\d+)', str(soup)).group(1)
print(result) # Output: 21

Pull out href's with beautifulsoup attrs

I am trying something new with pulling out all the href's in the a tags. It isn't pulling out the hrefs though and cant figure out why.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.findAll('a'):
h = href.attrs['href']
print(h)

You should check if the key exists, since it may also not exist an href between <a> tags.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
print(page.text)
soup = BeautifulSoup(page.text, 'html.parser')
for a in soup.findAll('a'):
if 'href' in a.attrs:
print(a.attrs['href'])

When I read tags using beautifulsoup I always get None

I'm trying to read some names and ids like:
<a class="inst" href="loader.aspx?ParTree=151311&i=3823243780502959"
target="3823243780502959">رتكو</a>
i = 3823243780502959
etc., from tsetmc.com. Here is my code:
import requests
from bs4 import BeautifulSoup
url = 'http://www.tsetmc.com/Loader.aspx?ParTree=15131F'
page = requests.get(url)
soup = BeautifulSoup(page.content , 'html.parser')
first_names_Id = soup.find_all('a',class_='isnt' )
print (first_names_Id)
but it returns None.
How can I read these tags? I have the same issue with other tags.

I used Selenium instead of requests to access the website needed for parsing and it gave me the results you wanted.
I believe the reason why the requests library isn't returning the html response as the selenium library is because the website you want parse is rendered with JavaScript
Also note that you have a typo in the class attribute value, it should be 'inst' not 'isnt'.
Code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url = 'http://www.tsetmc.com/Loader.aspx?ParTree=15131F'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
first_names_Id = soup.findAll('a', {'class': 'inst'})
print(first_names_Id)
Output:
[<a class="inst" href="loader.aspx?ParTree=151311&i=33541897671561960" target="33541897671561960">واتي</a>, <a class="inst" href="loader.aspx?ParTree=151311&i=33541897671561960" target="33541897671561960">سرمايه‌ گذاري‌ آتيه‌ دماوند</a>, <a class="inst" href="loader.aspx?ParTree=151311&i=9093654036027968" target="9093654036027968">طپنا7002</a>, <a class="inst" href="loader.aspx?ParTree=151311&i=9093654036027968" target="9093654036027968">اختيارف رمپنا-7840-19/07/1396</a>, <a class="inst" href="loader.aspx?ParTree=151311&i=19004627894176375" target="19004627894176375">طپنا7003</a>, <a class="inst" href="loader.aspx?ParTree=151311&i=19004627894176375" target="19004627894176375">اختيارف رمپنا-8340-19/07/1396</a>, **etc**]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract links from a page using Beautiful soup - python

You can use soup.find_all: from bs4 import BeautifulSoup as soup r = [i.a['href'] for i in soup(html, 'html.parser').find_all('div', {'class':'post-info-wrap'})] Output: ['https://www.example.com/blog/111/this-is-1st-post/', 'https://www.example.com/blog/111/this-is-2nd-post/']

Related

Unable to find element BeautifulSoup

Python - Filter BS4 content

scrape dynamic content in python from https://brainly.co.id/tugas/148

Pull out href's with beautifulsoup attrs

When I read tags using beautifulsoup I always get None

Categories

Resources