how to scraping text from hidden div and class using python? - python

i working on a script for scraping video titles from this webpage
" https://www.google.com.eg/trends/hotvideos "
but the proplem is the titles are hidden on the html source page but i can see it if i used the inspector to looking for that
that's my code it's working good with this ("class":"wrap")
but when i used that with the hidden one like "class":"hotvideos-single-trend-title-container" that's did't give me anything on output
#import urllib2
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.google.com.eg/trends/hotvideos').read()
soup = BeautifulSoup(html)
print (soup.findAll('div',{"class":"hotvideos-single-trend-title-container"}))
#wrap

The page is generated/populated by using Javascript.
BeautifulSoup won't help you here, you need a library which supports Javascript generated HTML pages, see here for a list or have a look at Selenium

Related

I try to parse internal network webpage using by beautifulsoup library but didn't same like html

I'd like to make an auto login program in internal network website.
So, I try to parse that site using requests and Beautifulsoup library.
It works...and I get some html alot shorter than that site's html.
what's the problem? maybe security issue?..
pleas help me.
import requests
from bs4 import BeautifulSoup as bs
page = requests.get("http://test.com")
soup = bs(page.text, "html.parse")
print(soup) # I get some html alot shorter than that site's html

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

This question follows this previous question. I want to scrape data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class “c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.tipico.de/de/live-wetten/"
try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)
soup = BeautifulSoup(page, ‘html.parser’)
regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)
I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"
try:
page = urllib.request.urlopen(url)
except Exception as e:
print(f"An error occurred: {e}")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)
The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line
print(soup)
When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?
EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.
That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.
First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)
No <button> tag in the page source :
So I suggest using Selenium as the solution, and I tried a basic scrape of the website.
Here is the code I used :
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'
browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)
browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")
span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
print(span_tag.text)
browser.quit()
This is the output:
There are some junk data present in this output, but that's for you to figure out what you need and what you don't!

How to scrape link title over many pages and through specified tab

I am having trouble figuring out how to use BeautifulSoup to scrape all 100 link titles on the page since it is under "a href = ....." . I have tried the below code but it returns a blank.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import bs4
url = 'https://www150.statcan.gc.ca/n1/en/type/data?count=100'
page = urlopen(url)
soup = bs4.BeautifulSoup(page,'html.parser')
title = soup.find_all('a')
Additionally, is there a way to ensure I am scraping everything under the "Tables (8898)" tabs? Thanks in advance!
Link:
https://www150.statcan.gc.ca/n1/en/type/data?count=100
The link you provided is loading it's contents with async javascript requests. So when you exec page = urlopen(url) it is only fetching the empty HTML and javascript blocks.
You need to use a browser to execute js to load page contents. You can checkout this link to learn how to do it: https://towardsdatascience.com/web-scraping-using-selenium-python-8a60f4cf40ab

Why Beautiful Soup not extracting all of the "a" tags from a website

I am learning BeautifulSoup and i tried to extract all the "a" tags from a website. I am getting lot of "a" tags but few of them are ignored and i am confused why that is happening any help will be highly appreciated.
Link i used is : https://www.w3schools.com/python/
img : https://ibb.co/mmEKTK
red box in the image is a section that has been totally ignored by the bs4. It does contains "a" tags.
Code:
import requests
import bs4
import re
import html5lib
res = requests.get('https://www.w3schools.com/python/')
soup = bs4.BeautifulSoup(res.text,'html5lib')
try:
links_with_text = []
for a in soup.find_all('a', href=True):
print(a['href'])
except:
print ('none')
sorry for the code indentation i am new here.
The links which are being ignored by bs4 are dynamically rendered i.e Advertisements etc were not present in the HTML code but have been called by scripts based on your browser habits. requests package will only fetch static HTML content, you need to simulate browser to get the dynamic content.
Selenium can be used with any browser like Chrome, Firefox etc. If you want to achieve the same results on server (without UI), use headless browsers like Phatomjs.

Scraping text from HTML5 website using Python

I need to way to scrape just the text from a website using python. I have installed BeautifulSoup 4, HTML Requests, and NLTK but I just can't seem to find out how to scrape.
I really need a simple snippet of code that I can plug any URL into and get the plain text. I'm trying to get it from this website
BeautifulSoup can extract all the texts from a page easily. The following is an example to extract texts inside the <body>...</body> section.
import urllib
from bs4 import BeautifulSoup
from contextlib import closing
url = 'https://developer.valvesoftware.com/wiki/Hammer_Selection_Tool'
with closing(urllib.urlopen(url)) as h:
soup = BeautifulSoup(h.read())
print soup.body.get_text()

Categories

Resources