Selenium: how to get the entire html as a string?

Selenium: how to get the entire html as a string? - python

I am using Selenium with python. See the following code:
from selenium.webdriver.common.keys import Keys
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://finance.yahoo.com/q?s=APP")
Now, I want to do one simple thing: get the html of that web as a string from the driver. Then, I can use BeautifulSoup to parse it. Does anyone know this?
Actually, I don't how to access information from this driver, e.g., to get the stock price of apple in this case.
I am totally new to Selenium. A good tutorial for it is highly appreciated.
Thank you!

Have a look at following code.
from selenium.webdriver.common.keys import Keys
import selenium.webdriver
driver = selenium.webdriver.Firefox()
driver.get("http://finance.yahoo.com/q?s=APP")
page_html = driver.page_source
In page_html you will have html of opened page.

You're looking for page_source.
To continue along with your example:
soup = BeautifulSoup(driver.page_source)
As another commenter noted however, you could use a library like requests to the same effect:
r = requests.get('http://finance.yahoo.com/q?s=APP')
soup = BeautifulSoup(r.content)

Related

Web scraping using Beautifulsoup/Selenium unable to pull div class using find or find_all

I am trying to webscrape form the website https://roll20.net/compendium/dnd5e/Monsters%20List#content and having some issues.
My first script I tried kept returning an empty list when finding by div and class name, which I believe is do to the site using Javascript? But a little uncertain if that is the case or not.
Here was my first attempt:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://roll20.net/compendium/dnd5e/Monsters%20List#content')
soup = BeautifulSoup(page.text, 'html.parser')
card = soup.find_all("div", class_='card')
print(card)
This one returns an empty list so then I tired to use Selenium and scrape with that. Here is that script:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from selenium import webdriver
url='https://roll20.net/compendium/dnd5e/Monsters%20List#content'
driver = webdriver.Firefox(executable_path = 'C:\Windows\System32\geckodriver')
driver.get(url)
page = driver.page_source
page_soup = soup(page,'html.parser')
Starting the script with that I then tried all 3 of these different options (individually ran these, just listed them here together for simplicity sake):
for card in body.find('div', {"class":"card"}):
print(card.text)
print(card)
for card in body.find_all('div', {"class":"card"}):
print(card.text)
print(card)
card = body.find_all('div', {"class":"card"})
print(card)
All of them return the same error message:
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
Where am I going wrong here?
Edit:
Fazul thank you for your input on this I guess I should be more specific. I was more looking to get the contents of each card. For example, the card has a "body" class and within that body class there are many fields that is the data I am looking to extract. Maybe I am misunderstanding your script and what you stated. Here is a screen shot to maybe help specify my question a bit more to what content I am looking to extract.
So everything that would be under the body i.e. name, title, subtitle, etc.. Those were the texts I was trying to extract.

That page is being loaded by JavaScript. So beautifulsoup will not work in this case. You have to use Selenium.
And the element that you are looking for - <div> with class name as card show up only when you click on the drop-down arrow. Since you are not doing any click event in your code, you get an empty result.
Use selenium to click the <div> with class name as dropdown-toggle. That click event loads the <div class='card'>
Then you can scrape the data you need.

Since the page is loaded by JavaScript, you have to use Selenium.
My solution As follows:
Every card has a link as shown:
Use BeautifulSoup to get the link for each card, Then open each link using Selenium in headless mode because it is also loaded by JavaScript.
Then you get the data you need for each card using BeautifulSoup
Working code:
from selenium import webdriver
from webdriver_manager import driver
from webdriver_manager.chrome import ChromeDriver, ChromeDriverManager
from selenium.webdriver.chrome.options import Options
from time import sleep
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get("https://roll20.net/compendium/dnd5e/Monsters%20List#content")
page = driver.page_source
# Wait until the page is loaded
sleep(13)
soup = BeautifulSoup(page, "html.parser")
# Get all card urls
urls = soup.find_all("div", {"class":"header inapp-hide single-hide"})
cards_list = []
for url in urls:
card = {}
card_url = url.a["href"]
card_link = f'https://roll20.net{card_url}#content'
# Open chrom in headless mode
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options)
driver.get(card_link )
page = driver.page_source
soup = BeautifulSoup(page, "html.parser")
card["name"] = soup.select_one(".name").text.split("\n")[1]
card["subtitle"] = soup.select_one(".subtitle").text
# Do the same to get all card detealis
cards_list.append(card)
driver.quit()
Library you need to install
pip install webdriver_manager
This library open chrome driver without the need to geckodriver and get up to date driver for you.

https://roll20.net/compendium/compendium/getList?bookName=dnd5e&pageName=Monsters%20List&_=1630747796747
u can get the json data from this link instead just parse it and get the data you need

Download a search result from Twitter using webscraping

I am new to Python and newer to webscraping, so my question might be very basic.
I am trying to use webscraping to download some results from Twitter searches. I have already understood how the urls of searches work, so I am directly accessing the urls of the searches.
I expect most of my searches to provide no results, and I would like to extract that information in those cases. I’m going to use an example of a search which returns no results. The url would be:
https://twitter.com/search?q=%22John%20Doe%22%20stackexchange%20trial&src=typed_query
That would return something like:
I am trying to extract the text ‘No results for ""John Doe" stackexchange trial"’. But there is something in my code which is not working.
The html code from that part is:
The code I am trying is the following:
import os
os.getcwd()
os.chdir('my.dir')
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import pandas as pd
import os
import urllib.request
import requests
import re
urlpage = "https://twitter.com/search?q=%22John%20Doe%22%20stackexchange%20trial&src=typed_query"
page = urllib.request.urlopen(urlpage)
soup = BeautifulSoup(page, 'html.parser')
results = soup.find("main", class_="css-1dbjc4n r-16y2uox r-1wbh5a2")
text_element = results.find("span", class_="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0")
text = text_element.text
print(text)
I believe the problem is when I define "results", it is not finding what I want.
I got to that version of the code by analogy from a code which does actually work, which is the following:
urlpage="https://stackexchange.com/"
page = urllib.request.urlopen(urlpage)
soup = BeautifulSoup(page, 'html.parser')
results = soup.find(id='content')
print(results.prettify())
title = results.find('h3', class_='title')
print(title.text)
Thank you very much in advance for all your help!
Edit: so apparently BeautifulSoup doesn’t work for this (I’m not sure why, I think it has to do with the way Twitter loads its elements). I had to use Selenium.
Here is a code that does the job:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException, NoSuchElementException
import time
url = r'https://twitter.com/search?q=%22John%20Doe%22%20stackexchange%20trial&src=typed_query'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(10)
element=driver.find_element_by_xpath("//span[contains(text(),'No results')]")
print(element.text)

How to scrape url's from a website with python beautiful-soup?

I was trying to scrape some url's from a particular link, I used beautiful-soup for scraping those links, but I'm not able to scrape those links. Here I'm attaching the code which I have used. Actually, I want to scrape the urls from the class "fxs_aheadline_tiny"
import requests
from bs4 import BeautifulSoup
url = 'https://www.fxstreet.com/news?q=&hPP=17&idx=FxsIndexPro&p=0&dFR%5BTags%5D%5B0%5D=EURUSD'
r1 = requests.get(url)
coverpage = r1.content
soup1 = BeautifulSoup(coverpage, 'html.parser')
coverpage_news = soup1.find_all('h4', class_='fxs_aheadline_tiny')
print(coverpage_news)
Thank you

I would use Selenium.
Please, try this code:
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
#open driver
driver= webdriver.Chrome(ChromeDriverManager().install())
driver.get('https://www.fxstreet.com/news?q=&hPP=17&idx=FxsIndexPro&p=0&dFR%5BTags%5D%5B0%5D=EURUSD')
# Use ChroPath to identify the xpath for the 'page hits'
pagehits=driver.find_element_by_xpath("//div[#class='ais-hits']")
# search for all a tags
links=pagehits.find_elements_by_tag_name("a")
# For each link get the href
for link in links:
print(link.get_attribute('href'))
It exactly does what you want: it takes out all urls/links on your search page (that means also the links to the authors pages).
You could even consider automating the browser and moving through the search page results. See for this the Selenium documentation:https://selenium-python.readthedocs.io/
Hope this helps

When I run this my code, it returns '[]'. How can I fix this?

My first question in stackoverflow. I'm a beginner in python, and I want to request any instagram photo likes but my code return empty
import requests
from bs4 import BeautifulSoup
url = "https://www.instagram.com/p/BsYt_megGfN/"
r = requests.get(url)
soup = BeautifulSoup(r.content,"html.parser")
data = soup.findAll("div",{"class","Nm9Fw"})
print(data)
I want to see the names of people who like the photo but I didn't.

First of all, for scraping you should use a pre-compiled library like Anaconda. Download it here:https://www.anaconda.com/download/ and remember where the path to your python executable is.
You are returned with an empty list because instagram uses javascript. Requests isn't able to render the javascript into html for you, so you'll need use a more robust method, like selenium.
Try something like this:
Install Selenium
In your terminal:
conda install selenium
Download the Chromedriver
http://chromedriver.chromium.org/downloads
Import Selenium into Your Code
import os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path="path-to-chromedriver",chrome_options=chrome_options)
driver.get("https://www.instagram.com/p/BsYt_megGfN/")
html_source = driver.page_source
driver.quit()
soup = BeautifulSoup(html_source,"html.parser")
data = soup.findAll("div",{"class","Nm9Fw"})
print(comments) # syntax for printing changes here for Python3
Run this with your Anaconda version of python.

Why am I not getting the value of the field rather than the field itself?

so I'm trying to do web scraping for the first time using BeautifulSoup and Python. The page that I am trying to scrape is at: http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172
client = request('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
page_html = client.read()
client.close()
page_soup = soup(page_html)
identification = page_soup.find('div', {'data-bind':'text: name'})
print(identification.text)
When I do this I simply get an empty string. If I print out simply the identification variable I get:
<div class="col-xs-7" data-bind="text: name"></div>
This is the line of html that I am trying to get the value of, as you can see there is a value A LEBLANC there in the tag

You can try this code :
from selenium import webdriver
driver=webdriver.Chrome()
browser=driver.get('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
find=driver.find_element_by_xpath('//*[#id="identificationCollapse"]/div/div/div/div[1]/div[1]/div[2]')
print(find.text)
output:
A LEBLANC

There are several ways you can achieve the same goal. However, I've used selector in my script which is easy to understand and has got less chance to break unless the html structure of that website is heavily changed. Try this out as well.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
item_name = soup.select("[data-bind$='name']")[0].text
print(item_name)
Result:
A LEBLANC
Btw, the way you started will also work:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('http://vesselregister.dnvgl.com/VesselRegister/vesseldetails.html?vesselid=34172')
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
item_name = soup.find('div', {'data-bind':'text: name'}).text
print(item_name)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Selenium: how to get the entire html as a string? - python

Have a look at following code. from selenium.webdriver.common.keys import Keys import selenium.webdriver driver = selenium.webdriver.Firefox() driver.get("http://finance.yahoo.com/q?s=APP") page_html = driver.page_source In page_html you will have html of opened page.

You're looking for page_source. To continue along with your example: soup = BeautifulSoup(driver.page_source) As another commenter noted however, you could use a library like requests to the same effect: r = requests.get('http://finance.yahoo.com/q?s=APP') soup = BeautifulSoup(r.content)

Related

Web scraping using Beautifulsoup/Selenium unable to pull div class using find or find_all

Download a search result from Twitter using webscraping

How to scrape url's from a website with python beautiful-soup?

When I run this my code, it returns '[]'. How can I fix this?

Why am I not getting the value of the field rather than the field itself?

Categories

Resources