scrape dynamic content in python from https://brainly.co.id/tugas/148 - python

how to scrape dynamic content in beautiful soup or any other library for tags
<use xlink:href="#icon-verified"></use> and <span data-test="answer-box-thanks-value">21</span>
Unable to access these using beautiful soup
# <span data-test="answer-box-thanks-value">19</span>
r=requests.get('https://brainly.co.id/tugas/148')
r=r.text
# print("Terima kasih" in r)
bsoup=BeautifulSoup(r,'html.parser')
for span in bsoup.find_all('span',{"data-test":"answer-box-thanks-value"}):
print(span)

If you inspect the page, you see that the data is under "upvoteCount":21.
Try the following the get the correct upvote count:
import re
import requests
from bs4 import BeautifulSoup
URL = "https://brainly.co.id/tugas/148"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
result = re.search(r'"upvoteCount":(\d+)', str(soup)).group(1)
print(result) # Output: 21

Related

Unable to find element BeautifulSoup

I am trying to parse a specific href link from the following website: https://www.murray-intl.co.uk/en/literature-library.
Element i seek to parse:
<a class="btn btn--naked btn--icon-left btn--block focus-within" href="https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc&_ga=2.12911351.1364356977.1629796255-1577053129.1629192717" target="blank">Portfolio Holding Summary<i class="material-icons btn__icon">library_books</i></a>
However, using BeautifulSoup I am unable to obtain the desired element, perhaps due to cookies acceptance.
from bs4 import BeautifulSoup
import urllib.request
import requests as rq
page = requests.get('https://www.murray-intl.co.uk/en/literature-library')
soup = BeautifulSoup(page.content, 'html.parser')
link = soup.find_all('a', class_='btn btn--naked btn--icon-left btn--block focus-within')
url = link[0].get('href')
url
I am still new at BS4, and hope someone can help me on the right course.
Thank you in advance!
To get correct tags, remove "focus-within" class (it's added later by JavaScript):
import requests
from bs4 import BeautifulSoup
url = "https://www.murray-intl.co.uk/en/literature-library"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = soup.find_all("a", class_="btn btn--naked btn--icon-left btn--block")
for u in links:
print(u.get_text(strip=True), u.get("href", ""))
Prints:
...
Portfolio Holding Summarylibrary_books https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc
...
EDIT: To get only the specified link you can use for example CSS selector:
link = soup.select_one('a:-soup-contains("Portfolio Holding Summary")')
print(link["href"])
Prints:
https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc

Why does Beautiful Soup not return the content?

I am using bs4 to scrape some results. I could see the HTML content in the source but when I try to fetch it using bs4, it does not give rather says "File does not exist"
from bs4 import BeautifulSoup
import requests
source = requests.get("https://result.smitcs.in/grade.php?subid=BA1106")
soup = BeautifulSoup(source.text, "html.parser")
marks_pre = soup.find("pre")
marks = marks_pre.find("div")
print(marks.prettify())
The above code returns
<div style="font-family: courier; line-height: 12px;font-size:
20px;background:white;"> File does not exist </div>
The above code works fine if I copy the source code from the web and save it locally as HTML file and then fetch it.
try this
from bs4 import BeautifulSoup
import requests
URL = "https://result.smitcs.in/grade.php?subid=BA1106"
PAGE = requests.get(URL)
# get HTML content
SOUP = BeautifulSoup(PAGE.content, 'lxml')
marks = SOUP.find("div")
print(marks.prettify())

How to extract links from a page using Beautiful soup

I have a HTML Page with multiple divs like:
<div class="post-info-wrap">
<h2 class="post-title">sample post – example 1 post</h2>
<div class="post-meta clearfix">
<div class="post-info-wrap">
<h2 class="post-title">sample post – example 2 post</h2>
<div class="post-meta clearfix">
and I need to get the value for all the divs with class post-info-wrap I am new to BeautifulSoup
so I need these urls:
https://www.example.com/blog/111/this-is-1st-post/
https://www.example.com/blog/111/this-is-2nd-post/
and so on...
I have tried:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for link in soup.select('.post-info-wrap'):
print link.find('a').attrs['href']
This code doesnt seem to be working. I am new to beautiful soup. How can i extract the links?
link = i.find('a',href=True) always not return anchor tag (a), it may be return NoneType, so you need to validate link is None, continue for loop,else get link href value.
Scrape link by url:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.example.com/blog/author/abc")
data = r.content # Content of response
soup = BeautifulSoup(data, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Scrape link by HTML:
from bs4 import BeautifulSoup
html = '''<div class="post-info-wrap"><h2 class="post-title">sample post – example 1 post</h2><div class="post-meta clearfix">
<div class="post-info-wrap"><h2 class="post-title">sample post – example 2 post</h2><div class="post-meta clearfix">'''
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all('div',{'class':'post-info-wrap'}):
link = i.find('a',href=True)
if link is None:
continue
print(link['href'])
Update:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.get("https://www.example.com/blog/author/abc")
soup = BeautifulSoup(driver.page_source, "html.parser")
for i in soup.find_all('div', {'class': 'post-info-wrap'}):
link = i.find('a', href=True)
if link is None:
continue
print(link['href'])
O/P:
https://www.example.com/blog/911/article-1/
https://www.example.com/blog/911/article-2/
https://www.example.com/blog/911/article-3/
https://www.example.com/blog/911/article-4/
https://www.example.com/blog/random-blog/article-5/
For chrome browser:
http://chromedriver.chromium.org/downloads
Install web driver for chrome browser:
https://christopher.su/2015/selenium-chromedriver-ubuntu/
selenium tutorial
https://selenium-python.readthedocs.io/
Where '/usr/bin/chromedriver' chrome webdriver path.
You can use soup.find_all:
from bs4 import BeautifulSoup as soup
r = [i.a['href'] for i in soup(html, 'html.parser').find_all('div', {'class':'post-info-wrap'})]
Output:
['https://www.example.com/blog/111/this-is-1st-post/', 'https://www.example.com/blog/111/this-is-2nd-post/']

When I read tags using beautifulsoup I always get None

I'm trying to read some names and ids like:
<a class="inst" href="loader.aspx?ParTree=151311&i=3823243780502959"
target="3823243780502959">رتكو</a>
i = 3823243780502959
etc., from tsetmc.com. Here is my code:
import requests
from bs4 import BeautifulSoup
url = 'http://www.tsetmc.com/Loader.aspx?ParTree=15131F'
page = requests.get(url)
soup = BeautifulSoup(page.content , 'html.parser')
first_names_Id = soup.find_all('a',class_='isnt' )
print (first_names_Id)
but it returns None.
How can I read these tags? I have the same issue with other tags.
I used Selenium instead of requests to access the website needed for parsing and it gave me the results you wanted.
I believe the reason why the requests library isn't returning the html response as the selenium library is because the website you want parse is rendered with JavaScript
Also note that you have a typo in the class attribute value, it should be 'inst' not 'isnt'.
Code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
url = 'http://www.tsetmc.com/Loader.aspx?ParTree=15131F'
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
first_names_Id = soup.findAll('a', {'class': 'inst'})
print(first_names_Id)
Output:
[<a class="inst" href="loader.aspx?ParTree=151311&i=33541897671561960" target="33541897671561960">واتي</a>, <a class="inst" href="loader.aspx?ParTree=151311&i=33541897671561960" target="33541897671561960">سرمايه‌ گذاري‌ آتيه‌ دماوند</a>, <a class="inst" href="loader.aspx?ParTree=151311&i=9093654036027968" target="9093654036027968">طپنا7002</a>, <a class="inst" href="loader.aspx?ParTree=151311&i=9093654036027968" target="9093654036027968">اختيارف رمپنا-7840-19/07/1396</a>, <a class="inst" href="loader.aspx?ParTree=151311&i=19004627894176375" target="19004627894176375">طپنا7003</a>, <a class="inst" href="loader.aspx?ParTree=151311&i=19004627894176375" target="19004627894176375">اختيارف رمپنا-8340-19/07/1396</a>, **etc**]

Get value of span tag using BeautifulSoup

I have a number of facebook groups that I would like to get the count of the members of. An example would be this group: https://www.facebook.com/groups/347805588637627/
I have looked at inspect element on the page and it is stored like so:
<span id="count_text">9,413 members</span>
I am trying to get "9,413 members" out of the page. I have tried using BeautifulSoup but cannot work it out.
Thanks
Edit:
from bs4 import BeautifulSoup
import requests
url = "https://www.facebook.com/groups/347805588637627/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
span = soup.find("span", id="count_text")
print(span.text)
In case there is more than one span tag in the page:
from bs4 import BeautifulSoup
soup = BeautifulSoup(your_html_input, 'html.parser')
span = soup.find("span", id="count_text")
span.text
You can use the text attribute of the parsed span:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<span id="count_text">9,413 members</span>', 'html.parser')
>>> soup.span
<span id="count_text">9,413 members</span>
>>> soup.span.text
'9,413 members'
If you have more than one span tag you can try this
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
tags = soup('span')
for tag in tags:
print(tag.contents[0])
Facebook uses javascrypt to prevent bots from scraping. You need to use selenium to extract data on python.

Categories

Resources