how do i get current page url with python selenium?

how do i get current page url with python selenium? - python

I want to know from which page the api request has come.
so I tried,
#test_bp.route("/test2")
def test2():
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-logging"])
driver = webdriver.Chrome(executable_path="D:\pathto/chromedriver.exe", options=options)
path = driver.current_url
print(path)
return ""
and got this for print(path)
data:,
I do get proper url if I put driver.get(URL) before getting path = driver.current_url. However, I can't put the url which I don't know at the moment dealing with the api request.
Would there be any other way? Or should I just giveup and ask additional page information from the client side?

Related

get screenshot of the page using selenium and requests

i need to get screenshot of the page. Site https://kad.arbitr.ru/ blocked selenium. After pressing the search button site do nothing. In inspector i see POST request using XHR. How can i execute POST request at this site? Sorry for stupid question, im newbie in Python. Maybe you can suggest other solution
fp = webdriver.FirefoxProfile()
options = Options()
#options.headless = True
driver = webdriver.Firefox(options=options, executable_path=r'FILES\\geckodriver.exe', firefox_profile=fp)
driver.get('https://kad.arbitr.ru/')
#pickle.dump( driver.get_cookies() , open("cookies.pkl","wb")) #Cookies
#response = webdriver.request('POST', 'https://kad.arbitr.ru/Kad/SearchInstances')
#print(response)
for request in driver.requests:
if request.response:
print(
request.url,
request.response.status_code,
request.response.headers['Content-Type']
)

You can take screenshots in a simple way: using driver.save_screenshot('screenshot.png')

You can read more in the first link in google: web
from selenium import webdriver
from PIL import Image
from Screenshot import Screenshot_Clipping
fp = webdriver.FirefoxProfile()
options = Options() # I don't know what is it in your code
driver = webdriver.Firefox(options=options, executable_path=r'FILES\\geckodriver.exe', firefox_profile=fp)
driver.get('https://kad.arbitr.ru/')
driver.save_screenshot('ss.png') # this is the answer
screenshot = Image.open('ss.png')
screenshot.show()
ss = Screenshot_Clipping.Screenshot() # another method
image = ss.full_Screenshot(driver, save_path=r'.' , image_name='name.png')

You can take a screenshot with selenium pretty easily:
driver.save_screenshot("image.png")
This question is so popular, there are dozens of APIs dedicated to taking screenshots of a webpage. You can find them out by doing a Google search.

Is it possible to dump the cookes in the request headers using selenium?

My goal is to get the cookies from the request headers using the request headers.
I have tried using request, selenium and seleniumwire, but the results that I got from them are not the same as the ones I found in the browser(chrome).
Code I tried:
from seleniumwire import webdriver
import pickle, time
print("Start\n")
options = webdriver.ChromeOptions()
options.add_argument('--no-sandbox')
options.add_argument('window-size=414x736')
options.add_argument('--disable-gpu')
options.add_argument('--hide-scrollbars')
options.add_argument('blink-settings=imagesEnabled=false')
options.add_argument('--headless')
options.add_argument('--enable-file-cookies')
## Get the URL
browser = webdriver.Chrome(executable_path='/usr/bin/chromedriver', options=options)
browser.get("https://www.xxxxxxx.com/xxxxxx")
time.sleep(5)
print(browser.get_cookies())
browser.quit()
I also tried using the requests library in python, but the result is the same as the selenium.
The headers request I found in the browser tool which I want to get from selenium:

Yes,
For getting the cookies,
browser.get("https://www.example.com")
# get all the cookies from this domain
cookies = browser.get_cookies()
# store it somewhere, maybe a text file
For restoring the cookies
browser.get("https://www.example.com")
# get back the cookies
cookies = {‘name’ : ‘foo’, ‘value’ : ‘bar’}
browser.add_cookies(cookies)

Issue when scraping data on McMaster-Carr website

I'm writing a crawler for McMaster-Carr. For example, the page https://www.mcmaster.com/98173A200 , if I open the page directly in browser, I can view all the product data.
Because the data is in dynamically-loaded content, so I'm using Selenium + bs4.
if __name__ == "__main__":
url = "https://www.mcmaster.com/98173A200"
options = webdriver.ChromeOptions()
options.add_argument("--enable-javascript")
driver = webdriver.Chrome("C:/chromedriver/chromedriver.exe", options=options)
driver.set_page_load_timeout(20)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
delay = 20
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'MainContent')))
except TimeoutException:
print("Timeout loading DOM!")
print(soup)
However, if I run the code I would get a login dialog, which I wouldn't get if I open this page directly in a browser like I mentioned.
I also tried logging in with the code below
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'Email')))
print("Page is ready!!")
input("Press Enter to continue...")
except TimeoutException:
print("Loading took too much time!")
email_input.send_keys(email)
password_input = driver.find_element_by_id('Password')
password_input.send_keys(password)
login_button = driver.find_element_by_class_name("FormButton_primaryButton__1kNXY")
login_button.click()
Then it shows access restricted.
I compared the requested header in the page opened by Selenium and the page in my browser, I couldn't find anything wrong. I also tried other webdrivers like PhantomJS and FireFox, and I got the same result.
I also tried using random user-agent using the code below
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names,
operating_systems=operating_systems,
limit=100)
user_agent = user_agent_rotator.get_random_user_agent()
chrome_options = Options()
chrome_options.add_argument('user-agent=' + user_agent)
Still same result.
The developer tool in the page opened by Selenium showed there were a bunch of errors. I guess the tokenauthorization one is the key to this issue, but I don't know what should I do with it.
Any help would be appreciated!

The reason you saw a login window is that you were accessing McMaster carr via a chrome driver. When the server recognizes your behaviour, it will require you to sign in.
A typical login wouldn't work if you haven't been authenticated by McMaster (need to sign NDA)
You should look into McMaster API. With the API, you can access the database directly. However, you need to sign an NDA with McMaster Carr before obtaining access to the API. https://www.mcmaster.com/help/api/

Can't make my script fetch desired content using proxies

I've written a script in python in combination with selenium using proxies to get the text of differnt links populated upon navigating to a url, as in this one. What I want to parse from there is the visible text connected to each link.
The script I've tried so far with is capable of producing new proxies when this function start_script() is called within it. The problem is that the very url lead me to this redirected link. I can get rid off this redirection only when I keep trying on until the url is satisfied with a proxy. My current script can try twice only with two new proxies.
How can I use any loop within get_texts() function in such a way so that it will keep trying using new proxies until it parses the required content?
My attempt so far:
import requests
import random
from itertools import cycle
from bs4 import BeautifulSoup
from selenium import webdriver
link = 'http://www.google.com/search?q=python'
def get_proxies():
response = requests.get('https://www.us-proxy.org/')
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
return proxies
def start_script():
proxies = get_proxies()
random.shuffle(proxies)
proxy = next(cycle(proxies))
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(f'--proxy-server={proxy}')
driver = webdriver.Chrome(chrome_options=chrome_options)
return driver
def get_texts(url):
driver = start_script()
driver.get(url)
if "index?continue" not in driver.current_url:
for item in [items.text for items in driver.find_elements_by_tag_name("h3")]:
print(item)
else:
get_texts(url)
if __name__ == '__main__':
get_texts(link)

The code below works well for me, however it can't help you with bad proxies. It also loops through the list of proxies and tries one until it succeeds or the list runs out.
It prints which proxy it uses so that you can see that it tries more than one time.
However as https://www.us-proxy.org/ points out:
What is Google proxy? Proxies that support searching on Google are
called Google proxy. Some programs need them to make large number of
queries on Google. Since year 2016, all the Google proxies are dead.
Read that article for more information.
Article:
Google Blocks Proxy in 2016 Google shows a page to verify that you are
a human instead of the robot if a proxy is detected. Before the year
2016, Google allows using that proxy for some time if you can pass
this human verification.
from contextlib import contextmanager
import random
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
def get_proxies():
response = requests.get('https://www.us-proxy.org/')
soup = BeautifulSoup(response.text,"lxml")
proxies = [':'.join([item.select_one("td").text,item.select_one("td:nth-of-type(2)").text]) for item in soup.select("table.table tbody tr") if "yes" in item.text]
random.shuffle(proxies)
return proxies
# Only need to fetch the proxies once
PROXIES = get_proxies()
#contextmanager
def proxy_driver():
try:
proxy = PROXIES.pop()
print(f'Running with proxy {proxy}')
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_argument("--headless")
chrome_options.add_argument(f'--proxy-server={proxy}')
driver = webdriver.Chrome(options=chrome_options)
yield driver
finally:
driver.close()
def get_texts(url):
with proxy_driver() as driver:
driver.get(url)
if "index?continue" not in driver.current_url:
return [items.text for items in driver.find_elements_by_tag_name("h3")]
print('recaptcha')
if __name__ == '__main__':
link = 'http://www.google.com/search?q=python'
while True:
links = get_texts(link)
if links:
break
print(links)

while True:
driver = start_script()
driver.get(url)
if "index?continue" in driver.current_url:
continue
else:
break
This will loop until index?continue is not in the url, and then break out of the loop.
This answer only addresses your specific question - it doesn't address the problem that you might be creating a large number of web drivers, but you never destroy the unsused / failed ones. Hint: you should.

How to get HTML code after logging in?

I am quite new to Selenium, it would be great if you guys can point me to the right direction.
I'm trying to access the HTML code of a website AFTER the login sequence.
I've used Selenium to direct the browser to initiate the login sequence, the part of the HTML I need will show up after I login. But when I tried to call the HTML code after the login sequence with page_source, it just gave me the HTML code for the site before logging in.
def test_script(ticker):
base_url = "http://amigobulls.com/stocks/%s/income-statement/quarterly" %ticker
driver = webdriver.Firefox()
verificationErrors = []
accept_next_alert = True
driver.get(base_url)
driver.maximize_window()
driver.implicitly_wait(30)
driver.find_element_by_xpath("//header[#id='header_cont']/nav/div[4]/div/span[3]").click()
driver.find_element_by_id("login_email").clear()
driver.find_element_by_id("login_email").send_keys(email)
driver.find_element_by_id("login_pswd").clear()
driver.find_element_by_id("login_pswd").send_keys(pwd)
driver.find_element_by_id("loginbtn").click()
amigo_script = driver.page_source

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

how do i get current page url with python selenium? - python

Related

get screenshot of the page using selenium and requests

Is it possible to dump the cookes in the request headers using selenium?

Issue when scraping data on McMaster-Carr website

Can't make my script fetch desired content using proxies

How to get HTML code after logging in?

Categories

Resources