Trouble Parsing Text using BeautifulSoup and Python

Trouble Parsing Text using BeautifulSoup and Python - python

I am trying to retrieve the comment section on regulations.gov pages. An example is the paragraph "Restrictions on Proprietary Trading... with free market driven valuations." on http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032.
I am using BeautifulSoup and Python and have the following code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032)
source = driver.page_source.encode('ascii', 'replace')
soup = BeautifulSoup(source)
print soup
commentHolder = soup.find("div", {"class":"GGAAYMKDDNE"})
print commentHolder
When I execute "print soup" I get an output (albeit a messy one), but when I execute "print commentHolder" I get "None" as the output. I am not quite sure why this is happening and would appreciate any help. Thank you.
Note: I used Selenium webdriver to try and get around the Javascript - is this a correct approach?

You need to let PhantomJS explicitly wait for the element to become present before reading the page_source. Worked for me:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032")
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.GGAAYMKDGNE")))

Related

Selenium cannot find elements

I try to automate retrieving data from "SAP Business Client" using Python and Selenium.
Since I cannot find the element I wanted even though I am sure it is correct, I printed out the html content with the following code:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
EDGE_PATH = r"C:\Users\XXXXXX\Desktop\WPy64-3940\edgedriver_win64\msedgedriver"
service = Service(executable_path=EDGE_PATH)
options = Options()
options.use_chromium = True
options.add_argument("headless")
options.add_argument("disable-gpu")
cc_driver = webdriver.Edge(service = service, options=options)
cc_driver.get('https://saps4.sap.XXXX.de/sap/bc/ui5_ui5/ui2/ushell/shells/abap/FioriLaunchpad.html#Z_APSuche-display')
sleep(5)
cc_html = cc_driver.page_source
cc_content = soup(cc_html, 'html.parser')
print(cc_content.prettify())
cc_driver.close()
Now I am just surprised, because the printed out content is different than from firefox "inspect" function. For example, I can find the word "Nachname" from the firefox html content but not such word exists in the printed out html content from the code above:
Have someone an idea, why the printed out content is different?
Thank you for any help... Gunardi

the code you get from selenium is a the code without javascript process on it, then you shoul get the code from javascript using selenium interaction with javascipt,
String javascript = "return arguments[0].innerHTML"; String pageSource=(String)(JavascriptExecutor)driver) .executeScript(javascript, driver.findElement(By.tagName("html")enter code here)); pageSource = "<html>"+pageSource +"</html>"; System.out.println(pageSource);

Not able to scrape details using bs4, python, selenium

I am using the below code to print the soup variable that is nothing but the source code of the page.
Code
from urllib.request import urlopen
from bs4 import BeautifulSoup
import json, requests, re, sys
from selenium import webdriver
import re, time
yes_url = "https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card"
driver = webdriver.Chrome(executable_path="C:\\Users\\Hari\\Downloads\\chromedriver.exe")
driver.get(yes_url)
time.sleep(3)
# r = requests.get(yes_url)
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup)
driver.close()
Link I am scraping the page source from is : https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card
After running the above code the code keeps running till hours and hours but I don't get the output.
Please help me in scraping the page source, so that I get some output after I run the code.

Issue: You are dealing with a modern website which check the browser itself if it's controlled or not using robust.
How That can be done?
Simply, open your browser console and then type the following:
navigator.webdriver
if it's false so your browser isn't contorlled by any robust program such as selenium.
if it's true so it's controlled.
In your case, you've to disable it in order to trick the website checking mechanism.
Below you can achieve your goal:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup
options = Options()
options.headless = True
options.set_preference("dom.webdriver.enabled", False)
driver = webdriver.Firefox(options=options)
driver.get('https://www.yesbank.in/personal-banking/yes-first/cards/credit-card/yes-first-exclusive-credit-card')
try:
element = WebDriverWait(driver, 10).until(
EC.title_contains('YES'))
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup.prettify())
finally:
driver.quit()

Web scraping not getting complete source code data via Selenium/BS4

How do I scrape the data in the input tag's value attributes from the source I inspect as shown in image?
I have tried using BeautifulSoup and Selenium, and neither of them works for me.
Partial code is below:
html=driver.page_source
output=driver.find_element_by_css_selector('#bookingForm > div:nth-child(1) > div.bookingType > div:nth-child(15) > div.col-md-9 > input').get_attribute("value")
print(output)
This returns a NoSuchElementException error.
In fact when I try to print(html), a lot of source code data appear to be missing. I suspect it could be JS related issues, but Selenium - which works most of the time rendering JS - is not working for me on this site. Any idea why?
I tried these as well:
html=driver.page_source
soup=bs4.BeautifulSoup(html,'lxml')
test = soup.find("input",{"class":"inputDisable"})
print(test)
print(soup)
print(test) returns None, and print(soup) returns the source with most input tags entirely missing.

Check if this element is present on this site by inspecting the page.
If its there , many times selenium is too fast and the page sometimes doesn't manage to load completely.try the WAIT funtion of selenium.Many times thats the case.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
print "Page is ready!"
except TimeoutException:
print "Loading took too much time!"

Try to use find or find_all functions. (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
from requests import get
from bs4 import BeautifulSoup
url = 'your url'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
bs = BeautifulSoup(response.text,"lxml")
test = bs.find("input",{"class":"inputDisable"})
print(test)

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import urllib.request
import time
from bs4 import BeautifulSoup
from datetime import date
URL="https://yourUrl.com"
# Chrome session
driver = webdriver.Chrome("PathOfTheBrowserDriver")
driver.get(URL)
driver.implicitly_wait(100)
time.sleep(5)
soup=bs4.BeautifulSoup(driver.page_source,"html.parser")
Try, BEFORE making the soup, to create a break with your code, in order to give the requests the to do their job (some late requests may contain what you're looking for)

BeautifulSoup, Selenium and Python, parsing by a tag

I'm trying to parse data from this website
https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010
In particular, I am trying to get the data under Criterion(ITC). The text I want says CC+ECT
The information I want in html appears to be
<a class= js-glossary data-leg= "CC+ECT">
I'm new to web scraping and I tried the techniques taught in the tutorial but they didn't work. I heard about Selenium and tried this out too. However, this code didn't work either.
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
data = soup.find_all("a", attrs= {"class":"js-glossary"})
The code results in an empty list. I also read that I can pull out the data by treating the soup tag like a dictionary. in this case
data["data-leg"]
Am I on the right track or am I way off?

The text you're trying to get generated dynamically by JavaScript. To get it you need to wait for its appearance:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
text = WebDriverWait(driver, 5).until(lambda driver: driver.find_element_by_xpath('//div[.="criterion(itc)"]/following-sibling::div').text)
print(text)
# 'CC + ECT'

Seems you were pretty close. You may not even require Beautiful Soup if you are using Selenium. Using Selenium you need to induce WebDriverwait for the desired element to be visible and you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox(executable_path = r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='lbl' and text()='criterion(itc)']//following::div[1]/a"))).get_attribute("innerHTML"))
Console Output:
CC + ECT

Parse from a JS generated site

I am trying to parse (623) 337-**** from a JS generated site. My code is :
from selenium import webdriver
import re
browser = webdriver.Firefox()
browser.get('http://www.spokeo.com/search?q=Joe+Henderson,+Phoenix,+AZ&sao7=t104#:18643819031')
content = browser.page_source
browser.quit()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
For some reason it doesn`t print anything. Any help is apreciated
Sidenote : Is there a faster way to do it in python

The problem is that some of the content is loaded dynamically via post page load ajax requests.
You should wait until an element becomes visible (documentation) - then get the source code of the page:
import re
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
browser = webdriver.Firefox()
browser.get('http://www.spokeo.com/search?q=Joe+Henderson,+Phoenix,+AZ&sao7=t104#:18643819031')
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "profile_details_section_header")))
content = browser.page_source
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
browser.quit()
Or you can call time.sleep() or browser.implicitly_wait() instead - though it doesn't sound quite right.
Prints (623) 337-****.
Hope that helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Trouble Parsing Text using BeautifulSoup and Python - python

Related

Selenium cannot find elements

Not able to scrape details using bs4, python, selenium

Web scraping not getting complete source code data via Selenium/BS4

BeautifulSoup, Selenium and Python, parsing by a tag

Parse from a JS generated site

Categories

Resources