How to get HTML code after logging in? - python

I am quite new to Selenium, it would be great if you guys can point me to the right direction.
I'm trying to access the HTML code of a website AFTER the login sequence.
I've used Selenium to direct the browser to initiate the login sequence, the part of the HTML I need will show up after I login. But when I tried to call the HTML code after the login sequence with page_source, it just gave me the HTML code for the site before logging in.
def test_script(ticker):
base_url = "http://amigobulls.com/stocks/%s/income-statement/quarterly" %ticker
driver = webdriver.Firefox()
verificationErrors = []
accept_next_alert = True
driver.get(base_url)
driver.maximize_window()
driver.implicitly_wait(30)
driver.find_element_by_xpath("//header[#id='header_cont']/nav/div[4]/div/span[3]").click()
driver.find_element_by_id("login_email").clear()
driver.find_element_by_id("login_email").send_keys(email)
driver.find_element_by_id("login_pswd").clear()
driver.find_element_by_id("login_pswd").send_keys(pwd)
driver.find_element_by_id("loginbtn").click()
amigo_script = driver.page_source

Related

Stuck in Hcapture loop

Am using selenium and python plus 2capture API.i was able to retrieve the tokens successfully and even submit the form using js.
The form is submitted but the link keeps on reloading therefore cannot go past the hcapture loop.
here is my code:
def Solver(self, browser):
WebDriverWait(browser, 60).until(Ec.frame_to_be_available_and_switch_to_it((By.XPATH,'//*[#id="cf-hcaptcha-container"]/div[2]/iframe')))
captcha = CaptchaRecaptcha()
url = browser.current_url
code = captcha.HCaptcha(url)
script = "let submitToken = (token) => {document.querySelector('[name=h-captcha-response]').innerText = token document.querySelector('.challenge-form').submit() }submitToken('{}')".format(code)
script1 = (f"document.getElementsByName('h-captcha-response')[0].innerText='{code}'")
print(script)
browser.execute_script(script)
time.sleep(5)
browser.switch_to.parent_frame()
time.sleep(10)
Am using proxies in the web driver and also switching the user agent
someone, please explain what am doing wrong or what I should do to break the loop.

https://www.realestate.com.au/ not permitting web scraping?

I am trying to extract data from https://www.realestate.com.au/
First I create my url based on the type of property that I am looking for and then I open the url using selenium webdriver, but the page is blank!
Any idea why it happens? Is it because this website doesn't provide web scraping permission? Is there any way to scrape this website?
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PostCode = "2153"
propertyType = "house"
minBedrooms = "3"
maxBedrooms = "4"
page = "1"
url = "https://www.realestate.com.au/sold/property-{p}-with-{mib}-bedrooms-in-{po}/list-{pa}?maxBeds={mab}&includeSurrounding=false".format(p = propertyType, mib = minBedrooms, po = PostCode, pa = page, mab = maxBedrooms)
print(url)
# url should be "https://www.realestate.com.au/sold/property-house-with-3-bedrooms-in-2153/list-1?maxBeds=4&includeSurrounding=false"
driver = webdriver.Edge("./msedgedriver.exe") # edit the address to where your driver is located
driver.get(url)
time.sleep(3)
src = driver.page_source
soup = BeautifulSoup(src, 'html.parser')
print(soup)
you are passing the link incorrectly, try it
driver.get("your link")
api - https://selenium-python.readthedocs.io/api.html?highlight=get#:~:text=ef_driver.get(%22http%3A//www.google.co.in/%22)
I did try to access realestate.com.au through selenium, and in a different use case through scrapy.
I even got the results from scrapy crawling through use of proper user-agent and cookie but after a few days realestate.com.au detects selenium / scrapy and blocks the requests.
Additionally, it it clearly written in their terms & conditions that indexing any content in their website is strictly prohibited.
You can find more information / analysis in these questions:
Chrome browser initiated through ChromeDriver gets detected
selenium isn't loading the page
Bottom line is, you have to surpass their security if you want to scrape the content.

Issue when scraping data on McMaster-Carr website

I'm writing a crawler for McMaster-Carr. For example, the page https://www.mcmaster.com/98173A200 , if I open the page directly in browser, I can view all the product data.
Because the data is in dynamically-loaded content, so I'm using Selenium + bs4.
if __name__ == "__main__":
url = "https://www.mcmaster.com/98173A200"
options = webdriver.ChromeOptions()
options.add_argument("--enable-javascript")
driver = webdriver.Chrome("C:/chromedriver/chromedriver.exe", options=options)
driver.set_page_load_timeout(20)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
delay = 20
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'MainContent')))
except TimeoutException:
print("Timeout loading DOM!")
print(soup)
However, if I run the code I would get a login dialog, which I wouldn't get if I open this page directly in a browser like I mentioned.
I also tried logging in with the code below
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'Email')))
print("Page is ready!!")
input("Press Enter to continue...")
except TimeoutException:
print("Loading took too much time!")
email_input.send_keys(email)
password_input = driver.find_element_by_id('Password')
password_input.send_keys(password)
login_button = driver.find_element_by_class_name("FormButton_primaryButton__1kNXY")
login_button.click()
Then it shows access restricted.
I compared the requested header in the page opened by Selenium and the page in my browser, I couldn't find anything wrong. I also tried other webdrivers like PhantomJS and FireFox, and I got the same result.
I also tried using random user-agent using the code below
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names,
operating_systems=operating_systems,
limit=100)
user_agent = user_agent_rotator.get_random_user_agent()
chrome_options = Options()
chrome_options.add_argument('user-agent=' + user_agent)
Still same result.
The developer tool in the page opened by Selenium showed there were a bunch of errors. I guess the tokenauthorization one is the key to this issue, but I don't know what should I do with it.
Any help would be appreciated!
The reason you saw a login window is that you were accessing McMaster carr via a chrome driver. When the server recognizes your behaviour, it will require you to sign in.
A typical login wouldn't work if you haven't been authenticated by McMaster (need to sign NDA)
You should look into McMaster API. With the API, you can access the database directly. However, you need to sign an NDA with McMaster Carr before obtaining access to the API. https://www.mcmaster.com/help/api/

How to get the new open page source?

I'm writing a Python crawler using the Selenium library and the PhantomJs browser. I triggered a click event in a page to open a new page, and then I used the browser.page_source method, but I get the original page source instead of the new open page source. I wonder how to get the new open page source?
Here's my code:
import requests
from selenium import webdriver
url = 'https://sf.taobao.com/list/50025969__2__%D5%E3%BD%AD.htm?auction_start_seg=-1&page=150'
browser = webdriver.PhantomJS(executable_path='C:\\ProgramData\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
browser.get(url)
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
html = browser.page_source
print(html)
browser.quit()
You need to switch to the new window first
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
browser.switch_to_window(browser.window_handles[-1])
html = browser.page_source
I believe you need to add a wait before getting page source.
I've used an implicit wait at the code below.
from selenium import webdriver
url = 'https://sf.taobao.com/list/50025969__2__%D5%E3%BD%AD.htm?auction_start_seg=-1&page=150'
browser = webdriver.PhantomJS(executable_path='C:\\ProgramData\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
browser.get(url)
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
browser.implicitly_wait(5)
html = browser.page_source
browser.quit()
Better to use an explicit wait, but it required a condition like EC.element_to_be_clickable((By.ID, 'someid'))

How to save browser session using selenium python

I am trying to write a python script which can change language of youtube and next time when I open browser using webdriver it show me youtube page in language which I saved before. Problem is my script is not working as expected and I am not sure why?
Selecting language and saving cookies in following code
url='https://www.youtube.com'
print(url)
#page = requests.get(url)
#open web browser
browser = webdriver.Firefox()
#load specific url
browser.get(url)
#wait to load js
time.sleep(5)
#find language picker and click
browser.find_element_by_xpath('//*[#id="yt-picker-language-button"]').click()
#wait to open language list
time.sleep(2)
#find and click specific language
browser.find_element_by_xpath('//*[#id="yt-picker-language-footer"]/div[2]/form/div/div[1]/button[1]').click()
pickle.dump(browser.get_cookies() , open("youtubeCookies.pkl","wb"))
Loading data from cookies.
url='https://www.youtube.com'
print(url)
driver=webdriver.Firefox()
driver.get(url)
for cookie in pickle.load(open("youtubeCookies.pkl", "rb")):
driver.add_cookie(cookie)
time.sleep(3)
driver.refresh()
please guide me what I am doing wrong?
Thank you

Categories

Resources