I'm trying to scrape the date, and the minimum and maximum temperatures from the site https://www.ipma.pt/pt/otempo/prev.localidade.hora/#Porto&Gondomar.
I want to find all of the divs with the date class and all the spans with the tempMin and tempMax classes, so I wrote
pagina2= "https://www.ipma.pt/pt/otempo/prev.localidade.hora/#Porto&Gondomar"
client2= uReq(pagina2)
pagina2bs= soup(client2.read(), "html.parser")
client2.close()
data = pagina2bs.find_all("div", class_="date")
minT = pagina2bs.find_all("span", class_="tempMin")
maxT = pagina2bs.find_all("span", class_="tempMax")
but all I get are empty lists. I've compared this with similar code and I can't see where I made a mistake, since there are clearly tags with these classes.
From my perspective it has to do with the content of the pagina2bs variable. You are passing the right variables to the find_all method.
Use selenium to get the html of that website.
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import html5lib
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options, executable_path='C:/Users/**USERNAME**/Desktop/chromedriver.exe')
startUrl = "https://www.ipma.pt/pt/otempo/prev.localidade.hora/#Porto&Gondomar"
driver.get(startUrl)
html = driver.page_source
soup = bs(html,features="html5lib")
divs = soup.find_all("div", class_="date")
print(divs)
Install all the needed packages and a the selenium chrome driver. Link to this chromedriver in the code like I did on my machine.
Related
I have to take the publication date displayed in the following web page with BeautifulSoup in python:
https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410
The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().
I tried this code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')
but it gives me [], while in the 'inspect' code of the online page, there is this tag.
What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?
How can I solve this issue and get the number?
The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.
For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()
Prints:
<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>
Umberto if you are looking for an html element span use the following code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]
if you are looking for an html with the id 'biblio-publication-number-content' use the following code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')
in first case you are fetching all span html elements
in second case you are fetching all elements with an id 'biblio-publication-number-content'
I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.
I am trying to webscrape form the website https://roll20.net/compendium/dnd5e/Monsters%20List#content and having some issues.
My first script I tried kept returning an empty list when finding by div and class name, which I believe is do to the site using Javascript? But a little uncertain if that is the case or not.
Here was my first attempt:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://roll20.net/compendium/dnd5e/Monsters%20List#content')
soup = BeautifulSoup(page.text, 'html.parser')
card = soup.find_all("div", class_='card')
print(card)
This one returns an empty list so then I tired to use Selenium and scrape with that. Here is that script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
url='https://roll20.net/compendium/dnd5e/Monsters%20List#content'
driver = webdriver.Firefox(executable_path = 'C:\Windows\System32\geckodriver')
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
Starting the script with that I then tried all 3 of these different options (individually ran these, just listed them here together for simplicity sake):
for card in body.find('div', {"class":"card"}):
print(card.text)
print(card)
for card in body.find_all('div', {"class":"card"}):
print(card.text)
print(card)
card = body.find_all('div', {"class":"card"})
print(card)
All of them return the same error message:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Where am I going wrong here?
Edit:
Fazul thank you for your input on this I guess I should be more specific. I was more looking to get the contents of each card. For example, the card has a "body" class and within that body class there are many fields that is the data I am looking to extract. Maybe I am misunderstanding your script and what you stated. Here is a screen shot to maybe help specify my question a bit more to what content I am looking to extract.
So everything that would be under the body i.e. name, title, subtitle, etc.. Those were the texts I was trying to extract.
That page is being loaded by JavaScript. So beautifulsoup will not work in this case. You have to use Selenium.
And the element that you are looking for - <div> with class name as card show up only when you click on the drop-down arrow. Since you are not doing any click event in your code, you get an empty result.
Use selenium to click the <div> with class name as dropdown-toggle. That click event loads the <div class='card'>
Then you can scrape the data you need.
Since the page is loaded by JavaScript, you have to use Selenium.
My solution As follows:
Every card has a link as shown:
Use BeautifulSoup to get the link for each card, Then open each link using Selenium in headless mode because it is also loaded by JavaScript.
Then you get the data you need for each card using BeautifulSoup
Working code:
from selenium import webdriver
from webdriver_manager import driver
from webdriver_manager.chrome import ChromeDriver, ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from time import sleep
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://roll20.net/compendium/dnd5e/Monsters%20List#content")
page = driver.page_source
# Wait until the page is loaded
sleep(13)
soup = BeautifulSoup(page, "html.parser")
# Get all card urls
urls = soup.find_all("div", {"class":"header inapp-hide single-hide"})
cards_list = []
for url in urls:
card = {}
card_url = url.a["href"]
card_link = f'https://roll20.net{card_url}#content'
# Open chrom in headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
driver.get(card_link )
page = driver.page_source
soup = BeautifulSoup(page, "html.parser")
card["name"] = soup.select_one(".name").text.split("\n")[1]
card["subtitle"] = soup.select_one(".subtitle").text
# Do the same to get all card detealis
cards_list.append(card)
driver.quit()
Library you need to install
pip install webdriver_manager
This library open chrome driver without the need to geckodriver and get up to date driver for you.
https://roll20.net/compendium/compendium/getList?bookName=dnd5e&pageName=Monsters%20List&_=1630747796747
u can get the json data from this link instead just parse it and get the data you need
I am using bs4 and python 3.6 my problem is that there is a youtube search page and I want to get the link of the first video in it so I found after inspecting that id of that anchor tag is video-title and I used that parameter to find that a tag using following code also the link of every video's anchor tag has the same id as video-title so I decided to use find instead of find_all
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
soup =BeautifulSoup(requests.get('https://www.youtube.com/results?search_query=unravel').text,'lxml')
link = soup.find('a',id="video-title")
print(link)
but in return it gives
None
I have tried to get all anchor tags but that also doesn't include the tag which I want.
Can anyone tell what the problem is?
you can use this "\watch?v=\w+" to get your links simpler than bs4 ☺
use selenium with regex for best results
you can try this assuming you have selenium and lxml install on your environment.
from selenium import webdriver
from bs4 import BeautifulSoup
def get_tag():
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.get('https://www.youtube.com/results?search_query=unravel')
# print(driver.page_source)
soup = BeautifulSoup(driver.page_source, 'lxml')
atags = soup.find_all('a',{'id':'video-title'})
for tag in atags:
print(tag.get('title'))
this method will return the title of <a> tag which have <video-title> as their ID.
I was trying to scrape some url's from a particular link, I used beautiful-soup for scraping those links, but I'm not able to scrape those links. Here I'm attaching the code which I have used. Actually, I want to scrape the urls from the class "fxs_aheadline_tiny"
import requests
from bs4 import BeautifulSoup
url = 'https://www.fxstreet.com/news?q=&hPP=17&idx=FxsIndexPro&p=0&dFR%5BTags%5D%5B0%5D=EURUSD'
r1 = requests.get(url)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html.parser')
coverpage_news = soup1.find_all('h4', class_='fxs_aheadline_tiny')
print(coverpage_news)
Thank you
I would use Selenium.
Please, try this code:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
#open driver
driver= webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.fxstreet.com/news?q=&hPP=17&idx=FxsIndexPro&p=0&dFR%5BTags%5D%5B0%5D=EURUSD')
# Use ChroPath to identify the xpath for the 'page hits'
pagehits=driver.find_element_by_xpath("//div[#class='ais-hits']")
# search for all a tags
links=pagehits.find_elements_by_tag_name("a")
# For each link get the href
for link in links:
print(link.get_attribute('href'))
It exactly does what you want: it takes out all urls/links on your search page (that means also the links to the authors pages).
You could even consider automating the browser and moving through the search page results. See for this the Selenium documentation:https://selenium-python.readthedocs.io/
Hope this helps
``So im trying to get the degrees from this weather site. But i t keeps returning a blank answer. This is my code
Link to a screenshot
import requests
from bs4 import BeautifulSoup
# -----------------------------get site info------------------------------- #
URL = "https://www.theweathernetwork.com/ca/hourly-weather-forecast/ontario/oakville"
request = requests.get(URL)
# print(request.content)
# ----------------------parse site info---------------- #
soup = BeautifulSoup(request.content, 'html5lib')
#print(soup.prettify().encode("utf-8"))
weatherdata = soup.find('span', class_='temp')
print(weatherdata)
It might be that those values are rendered dynamically i.e. the values might be populated by javascript in the page.
requests.get() simply returns the markup received from the server without any further client-side changes so it's not fully about waiting.
You could perhaps use Selenium Chrome Webdriver to load the page URL and get the page source. (Or you can use Firefox driver).
Go to chrome://settings/help check your current chrome version and download the driver for that version from here. Make sure to either keep the driver file in your PATH or the same folder where your python script is.
Try this:
from bs4 import BeautifulSoup as bs
from selenium.webdriver import Chrome # pip install selenium
from selenium.webdriver.chrome.options import Options
url = "https://www.theweathernetwork.com/ca/hourly-weather-forecast/ontario/oakville"
#Make it headless i.e. run in backgroud without opening chrome window
chrome_options = Options()
chrome_options.add_argument("--headless")
# use Chrome to get page with javascript generated content
with Chrome(executable_path="./chromedriver", options=chrome_options) as browser:
browser.get(url)
page_source = browser.page_source
#Parse the final page source
soup = bs(page_source, 'html.parser')
weatherdata = soup.find('span', class_='temp')
print(weatherdata.text)
10
References:
Get page generated with Javascript in Python
selenium - chromedriver executable needs to be in PATH
Problem seems to be that the data is loaded via JavaScript so it takes a while to load the value for that specific span. When you do your request it seems to be empty and only loads in after a bit. One possible solution to this would be using selenium to wait for the page to load and then extract html afterwards.
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://www.theweathernetwork.com/ca/hourly-weather-forecast/ontario/oakville"
browser = webdriver.Chrome()
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
elem = soup.find('span', class_='temp')
print(elem.text)