I try to automate retrieving data from "SAP Business Client" using Python and Selenium.
Since I cannot find the element I wanted even though I am sure it is correct, I printed out the html content with the following code:
from selenium import webdriver
from bs4 import BeautifulSoup as soup
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
EDGE_PATH = r"C:\Users\XXXXXX\Desktop\WPy64-3940\edgedriver_win64\msedgedriver"
service = Service(executable_path=EDGE_PATH)
options = Options()
options.use_chromium = True
options.add_argument("headless")
options.add_argument("disable-gpu")
cc_driver = webdriver.Edge(service = service, options=options)
cc_driver.get('https://saps4.sap.XXXX.de/sap/bc/ui5_ui5/ui2/ushell/shells/abap/FioriLaunchpad.html#Z_APSuche-display')
sleep(5)
cc_html = cc_driver.page_source
cc_content = soup(cc_html, 'html.parser')
print(cc_content.prettify())
cc_driver.close()
Now I am just surprised, because the printed out content is different than from firefox "inspect" function. For example, I can find the word "Nachname" from the firefox html content but not such word exists in the printed out html content from the code above:
Have someone an idea, why the printed out content is different?
Thank you for any help... Gunardi
the code you get from selenium is a the code without javascript process on it, then you shoul get the code from javascript using selenium interaction with javascipt,
String javascript = "return arguments[0].innerHTML"; String pageSource=(String)(JavascriptExecutor)driver) .executeScript(javascript, driver.findElement(By.tagName("html")enter code here)); pageSource = "<html>"+pageSource +"</html>"; System.out.println(pageSource);
Related
As the title said, I'd like to performa a Google Search using Selenium and then open all results of the first page on separate tabs.
Please have a look at the code, I can't get any further (it's just my 3rd day learning Python)
Thank you for your help !!
Code:
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
import pyautogui
query = 'New Search Query'
browser = webdriver.Chrome('/Users/MYUSERNAME/Desktop/Desktop-Files/Chromedriver/chromedriver')
browser.get('http://www.google.com')
search = browser.find_element_by_name('q')
search.send_keys(query)
search.send_keys(Keys.RETURN)
element = browser.find_element_by_class_name('LC20lb')
element.click()
The reason why I imported pyautogui is because I tried simulating a right click and then open in new tab for each result but it was a little confusing :)
Forget about pyautogui as what you want to do can be done in Selenium. Same with most of the rest. You just do not need it. See if this code meets your needs.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
query = 'sins of a solar empire' #my query about a video game
browser = webdriver.Chrome()
browser.get('http://www.google.com')
search = browser.find_element_by_name('q')
search.send_keys(query)
search.send_keys(Keys.RETURN)
links = browser.find_elements_by_class_name('r') #I went on Google Search and found the container class for the link
for link in links:
url = link.find_element_by_tag_name('a').get_attribute("href") #this code extracts the url of the HTML link
browser.execute_script('''window.open("{}","_blank");'''.format(url)) # this code uses Javascript to open a new tab and open the given url in that new tab
print(link.find_element_by_tag_name('a').get_attribute("href"))
I'm trying to parse data from this website
https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010
In particular, I am trying to get the data under Criterion(ITC). The text I want says CC+ECT
The information I want in html appears to be
<a class= js-glossary data-leg= "CC+ECT">
I'm new to web scraping and I tried the techniques taught in the tutorial but they didn't work. I heard about Selenium and tried this out too. However, this code didn't work either.
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')
data = soup.find_all("a", attrs= {"class":"js-glossary"})
The code results in an empty list. I also read that I can pull out the data by treating the soup tag like a dictionary. in this case
data["data-leg"]
Am I on the right track or am I way off?
The text you're trying to get generated dynamically by JavaScript. To get it you need to wait for its appearance:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
driver = webdriver.Firefox(executable_path = r"D:\Python work\driver\geckodriver.exe")
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
text = WebDriverWait(driver, 5).until(lambda driver: driver.find_element_by_xpath('//div[.="criterion(itc)"]/following-sibling::div').text)
print(text)
# 'CC + ECT'
Seems you were pretty close. You may not even require Beautiful Soup if you are using Selenium. Using Selenium you need to induce WebDriverwait for the desired element to be visible and you can use the following solution:
Code Block:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox(executable_path = r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get(r"https://findrulesoforigin.org/home/compare?reporter=392&partner=036&product=020130010")
print(WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='lbl' and text()='criterion(itc)']//following::div[1]/a"))).get_attribute("innerHTML"))
Console Output:
CC + ECT
I'm trying to get the xpath of an element in site https://www.tradingview.com/symbols/BTCUSD/technicals/
Specifically the result under the summary speedometer. Whether it's buy or sell.
Speedometer
Using Google Chrome xpath I get the result
//*[#id="technicals-root"]/div/div/div[2]/div[2]/span[2]
and to try and get that data in python I plugged it into
from lxml import html
import requests
page = requests.get('https://www.tradingview.com/symbols/BTCUSD/technicals/')
tree = html.fromstring(page.content)
status = tree.xpath('//*[#id="technicals-root"]/div/div/div[2]/div[2]/span[2]/text()')
When I print status I get an empty array. But it doesn't seem like anything is wrong with the xpath. I've read that google does some shenanigans with incorrectly written HTML tables which will output the wrong xpath but that doesn't seem to be the issue.
When I run your code, the "technicals-root" div is empty. I assume javascript is filling it in. When you can't get a page statically, you can always turn to Selenium to run a browser and let it figure everything out. You may have to tweak the driver path to get it working in your environment but this works for me:
import time
import contextlib
import selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
option = webdriver.ChromeOptions()
option.add_argument(" — incognito")
with contextlib.closing(webdriver.Chrome(
executable_path='/usr/lib/chromium-browser/chromedriver',
chrome_options=option)) as browser:
browser.get('https://www.tradingview.com/symbols/BTCUSD/technicals/')
# wait until js has filled in the element - and a bit longer for js churn
WebDriverWait(browser, 20).until(EC.visibility_of_element_located(
(By.XPATH,
'//*[#id="technicals-root"]/div/div/div[2]/div[2]/span')))
time.sleep(1)
status = browser.find_elements_by_xpath(
'//*[#id="technicals-root"]/div/div/div[2]/div[2]/span[2]')
print(status[0].text)
I am trying to retrieve the comment section on regulations.gov pages. An example is the paragraph "Restrictions on Proprietary Trading... with free market driven valuations." on http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032.
I am using BeautifulSoup and Python and have the following code:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032)
source = driver.page_source.encode('ascii', 'replace')
soup = BeautifulSoup(source)
print soup
commentHolder = soup.find("div", {"class":"GGAAYMKDDNE"})
print commentHolder
When I execute "print soup" I get an output (albeit a messy one), but when I execute "print commentHolder" I get "None" as the output. I am not quite sure why this is happening and would appreciate any help. Thank you.
Note: I used Selenium webdriver to try and get around the Javascript - is this a correct approach?
You need to let PhantomJS explicitly wait for the element to become present before reading the page_source. Worked for me:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.PhantomJS()
driver.get("http://www.regulations.gov/#!documentDetail;D=OCC-2011-0014-0032")
wait = WebDriverWait(driver, 10)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "div.GGAAYMKDGNE")))
I am trying to parse (623) 337-**** from a JS generated site. My code is :
from selenium import webdriver
import re
browser = webdriver.Firefox()
browser.get('http://www.spokeo.com/search?q=Joe+Henderson,+Phoenix,+AZ&sao7=t104#:18643819031')
content = browser.page_source
browser.quit()
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
For some reason it doesn`t print anything. Any help is apreciated
Sidenote : Is there a faster way to do it in python
The problem is that some of the content is loaded dynamically via post page load ajax requests.
You should wait until an element becomes visible (documentation) - then get the source code of the page:
import re
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
browser = webdriver.Firefox()
browser.get('http://www.spokeo.com/search?q=Joe+Henderson,+Phoenix,+AZ&sao7=t104#:18643819031')
WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.ID, "profile_details_section_header")))
content = browser.page_source
m_obj = re.search(r"(\(\d{3}\)\s\d{3}-\*{4})", content)
if m_obj:
print m_obj.group(0)
browser.quit()
Or you can call time.sleep() or browser.implicitly_wait() instead - though it doesn't sound quite right.
Prints (623) 337-****.
Hope that helps.