Parse from a JS generated site

Parse from a JS generated site - python

I am trying to parse (623) 337-**** from a JS generated site. My code is :
from selenium import webdriver
import re
browser = webdriver.Firefox()
browser.get('http://www.spokeo.com/search?q=Joe+Henderson,+Phoenix,+AZ&sao7=t104#:18643819031')
content = browser.page_source
browser.quit()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
For some reason it doesn`t print anything. Any help is apreciated
Sidenote : Is there a faster way to do it in python

The problem is that some of the content is loaded dynamically via post page load ajax requests.
You should wait until an element becomes visible (documentation) - then get the source code of the page:
import re
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
browser = webdriver.Firefox()
browser.get('http://www.spokeo.com/search?q=Joe+Henderson,+Phoenix,+AZ&sao7=t104#:18643819031')
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "profile_details_section_header")))
content = browser.page_source
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
browser.quit()
Or you can call time.sleep() or browser.implicitly_wait() instead - though it doesn't sound quite right.
Prints (623) 337-****.
Hope that helps.

Related

Selenium cannot find elements

I try to automate retrieving data from "SAP Business Client" using Python and Selenium.
Since I cannot find the element I wanted even though I am sure it is correct, I printed out the html content with the following code:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
EDGE_PATH = r"C:\Users\XXXXXX\Desktop\WPy64-3940\edgedriver_win64\msedgedriver"
service = Service(executable_path=EDGE_PATH)
options = Options()
options.use_chromium = True
options.add_argument("headless")
options.add_argument("disable-gpu")
cc_driver = webdriver.Edge(service = service, options=options)
cc_driver.get('https://saps4.sap.XXXX.de/sap/bc/ui5_ui5/ui2/ushell/shells/abap/FioriLaunchpad.html#Z_APSuche-display')
sleep(5)
cc_html = cc_driver.page_source
cc_content = soup(cc_html, 'html.parser')
print(cc_content.prettify())
cc_driver.close()
Now I am just surprised, because the printed out content is different than from firefox "inspect" function. For example, I can find the word "Nachname" from the firefox html content but not such word exists in the printed out html content from the code above:
Have someone an idea, why the printed out content is different?
Thank you for any help... Gunardi

the code you get from selenium is a the code without javascript process on it, then you shoul get the code from javascript using selenium interaction with javascipt,
String javascript = "return arguments[0].innerHTML"; String pageSource=(String)(JavascriptExecutor)driver) .executeScript(javascript, driver.findElement(By.tagName("html")enter code here)); pageSource = "<html>"+pageSource +"</html>"; System.out.println(pageSource);

Web scraping text returns 0

Whenever I try to scrape a number from a website and print it always returns 0 even if I delay it to let the window load first.
Here's my code,
from selenium import webdriver
import time
url = 'https://hytrack.me/'
browser = webdriver.Chrome(r'C:\Users\kinet\OneDrive\Documents\webscraper\chromedriver.exe')
browser.get(url)
text = browser.find_element_by_xpath('//*[#id="stat_totalPlayers"]').text
time.sleep(10)
print(text)
All I need it to do is print some text that it takes from a website.
Have I done something wrong or am I just completely missing something?

You should put the delay before getting the element!
from selenium import webdriver
import time
url = 'https://hytrack.me/'
browser = webdriver.Chrome(r'C:\Users\kinet\OneDrive\Documents\webscraper\chromedriver.exe')
browser.get(url)
time.sleep(10)
text = browser.find_element_by_xpath('//*[#id="stat_totalPlayers"]').text
print(text)
While it's better to use explicit wait, like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
url = 'https://hytrack.me/'
browser = webdriver.Chrome(r'C:\Users\kinet\OneDrive\Documents\webscraper\chromedriver.exe')
wait = WebDriverWait(driver, 20)
browser.get(url)
text = wait.until(EC.visibility_of_element_located((By.XPATH, '//*[#id="stat_totalPlayers"]'))).text
print(text)

Python, Selenium. Google Chrome. Web Scraping. How to navigate between 'tabs' in website

im quite noob in python and right now building up a web scraper in Selenium that would take all URL's for products in the clicked 'tab' on web page. But my code take the URL's from the first 'tab'. Code below. Thank you guys. Im starting to be kind of frustrated lol.
Screenshot
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from lxml import html
PATH = 'C:\Program Files (x86)\chromedriver.exe'
driver = webdriver.Chrome(PATH)
url = 'https://www.alza.sk/vypredaj-akcia-zlava/e0.htm'
driver.get(url)
driver.find_element_by_xpath('//*[#id="tabs"]/ul/li[2]').click()
links = []
try:
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CLASS_NAME, 'blockFilter')))
link = driver.find_elements_by_xpath("//a[#class='name browsinglink impression-binded']")
for i in link:
links.append(i.get_attribute('href'))
finally:
driver.quit()
print(links)

To select current tab:
current_tab = driver.current_window_handle
To switch between tabs:
driver.switch_to_window(driver.window_handles[1])
driver.switch_to.window(driver.window_handles[-1])
Assuming you have the new tab url as TAB_URL, you should try:
from selenium.webdriver.common.action_chains import ActionChains
action = ActionChains(driver)
action.key_down(Keys.CONTROL).click(TAB_URL).key_up(Keys.CONTROL).perform()
Also, apparently the li doesn't have a click event, are you sure this element you are getting '//*[#id="tabs"]/ul/li[2]' has the aria-selected property set to true or any of these classes: ui-tabs-active ui-state-active?
If not, you should call click on the a tag inside this li.
Then you should increase the timeout parameter of your WebDriverWait to guarantee that the div is loaded.

Web scraping not getting complete source code data via Selenium/BS4

How do I scrape the data in the input tag's value attributes from the source I inspect as shown in image?
I have tried using BeautifulSoup and Selenium, and neither of them works for me.
Partial code is below:
html=driver.page_source
output=driver.find_element_by_css_selector('#bookingForm > div:nth-child(1) > div.bookingType > div:nth-child(15) > div.col-md-9 > input').get_attribute("value")
print(output)
This returns a NoSuchElementException error.
In fact when I try to print(html), a lot of source code data appear to be missing. I suspect it could be JS related issues, but Selenium - which works most of the time rendering JS - is not working for me on this site. Any idea why?
I tried these as well:
html=driver.page_source
soup=bs4.BeautifulSoup(html,'lxml')
test = soup.find("input",{"class":"inputDisable"})
print(test)
print(soup)
print(test) returns None, and print(soup) returns the source with most input tags entirely missing.

Check if this element is present on this site by inspecting the page.
If its there , many times selenium is too fast and the page sometimes doesn't manage to load completely.try the WAIT funtion of selenium.Many times thats the case.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
print "Page is ready!"
except TimeoutException:
print "Loading took too much time!"

Try to use find or find_all functions. (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
from requests import get
from bs4 import BeautifulSoup
url = 'your url'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
bs = BeautifulSoup(response.text,"lxml")
test = bs.find("input",{"class":"inputDisable"})
print(test)

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import urllib.request
import time
from bs4 import BeautifulSoup
from datetime import date
URL="https://yourUrl.com"
# Chrome session
driver = webdriver.Chrome("PathOfTheBrowserDriver")
driver.get(URL)
driver.implicitly_wait(100)
time.sleep(5)
soup=bs4.BeautifulSoup(driver.page_source,"html.parser")
Try, BEFORE making the soup, to create a break with your code, in order to give the requests the to do their job (some late requests may contain what you're looking for)

Trouble Parsing Text using BeautifulSoup and Python

I am trying to retrieve the comment section on regulations.gov pages. An example is the paragraph "Restrictions on Proprietary Trading... with free market driven valuations." on http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032.
I am using BeautifulSoup and Python and have the following code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032)
source = driver.page_source.encode('ascii', 'replace')
soup = BeautifulSoup(source)
print soup
commentHolder = soup.find("div", {"class":"GGAAYMKDDNE"})
print commentHolder
When I execute "print soup" I get an output (albeit a messy one), but when I execute "print commentHolder" I get "None" as the output. I am not quite sure why this is happening and would appreciate any help. Thank you.
Note: I used Selenium webdriver to try and get around the Javascript - is this a correct approach?

You need to let PhantomJS explicitly wait for the element to become present before reading the page_source. Worked for me:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032")
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.GGAAYMKDGNE")))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse from a JS generated site - python

Related

Selenium cannot find elements

Web scraping text returns 0

Python, Selenium. Google Chrome. Web Scraping. How to navigate between 'tabs' in website

Web scraping not getting complete source code data via Selenium/BS4

Trouble Parsing Text using BeautifulSoup and Python

Categories

Resources