how to load multiple urls in driver.get()? - python

How to load multiple urls in driver.get() ?
I am trying to load 3 urls in below code, but how to load the other 2 urls?
And afterwards the next challenge is to pass authentication for all the urls as well which is same.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path=r"C:/Users/RYadav/AppData/Roaming/Microsoft/Windows/Start Menu/Programs/Python 3.8/chromedriver.exe")
driver.get("https://fleet.my.salesforce.com/reportbuilder/reportType.apexp")#put here the adress of your page
elem = driver.find_elements_by_xpath('//*[#id="ext-gen63"]')#put here the content you have put in Notepad, ie the XPath
button = driver.find_element_by_id('id="ext-gen63"')
print(elem.get_attribute("class"))
driver.close
submit_button.click()

Try below code :
def getUrls(targeturl):
driver = webdriver.Chrome(executable_path=r" path for chromedriver.exe")
driver.get("http://www."+targeturl+".com")
# perform your taks here
driver.quit()
for i in range(3):
webPage = ['google','facebook','gmail']
for i in webPage:
print i;
getUrls(i)

You can't load more than 1 url at a time for each Webdriver. If you want to do so, you maybe need some multiprocessing module. If you want to do an iterative solution, just create a list with every url you need and loop through it. With that you won't have the credential problem neither.

Related

Get all <thspan> contents in Python Selenium

Say that I have a piece of HTML code that looks like this:
<html>
<body>
<thspan class="sentence">He</thspan>
<thspan class="sentence">llo</thspan>
</body>
</html>
And I wanted to get the content of both and connect them into a string in Python Selenium.
My current code looks like this:
from selenium import webdriver
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
thspans = browser.find_elements(By.CLASS_NAME, "sentence")
context = ""
for thspan in thspans:
context.join(thspan.text)
The code can run without any problem, but the context variable doesn't contain anything. How can I get the content of both and connect them into a string in Python Selenium?
context += thspan.text instead of using context.join(thspan.text) just like #Rajagopalan said
Get all <thspan> contents in Python Selenium
Hi! You were not redirecting the browser to the page you actually want to scrap the data from. And you were misusing the function join. Here is a code that will work for you:
from selenium import webdriver
from selenium.webdriver.common.by import By
browser = webdriver.Chrome()
# Put the absolute path to your html file if you are working locally, or
# the URL of the domain you want to scrap
browser.get('file:///your/absolute/path/to/the/html/code/index.html')
thspans = browser.find_elements(By.CLASS_NAME, "sentence")
context = ''
print('thspans', thspans, end='\n\n')
for thspan in thspans:
context += thspan.text
print(context)
Good luck!
Use this line without the loop:
context = "".join([thspan.text for thspan in thspans])

Python Scraper Won't Complete

I am using this to code to scrape emails from google search results. However, it only scrapes the first 10 results despite having 100 search results loaded.
Ideally, I would like for it to scrape all search results.
Is there a reason for this?
from selenium import webdriver
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
html = driver.page_source
emails = re.findall(email_pattern, html)
time.sleep(10)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_excel('email_addresses_.xlsx',index=False)
# print(emails)
driver.close()
The code is working as expected and scraping 10 results which is the default from google search. You can use the methods like 'find_element_by_xpath' to find the next button and click it.
This operation needs to be done till the sufficient results are collected in loop. Refer this for more details selenium locating elements
How to use the selenium commands, probably you can look upto web. I found one similar question which can provide some reference
Following up on Bijendra's answer,
you could update the code as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
import re
import pandas as pd
PATH = 'C:\Program Files (x86)\chromedriver.exe'
l=list()
o={}
target_url = "https://www.google.com/search?q=solicitors+wales+%27email%27+%40&rlz=1C1CHBD_en-GBIT1013IT1013&sxsrf=AJOqlzWC1oRbVtWcmcIgC4-3ZnGkQ8sP_A%3A1675764565222&ei=VSPiY6WeDYyXrwStyaTwAQ&ved=0ahUKEwjlnIy9lYP9AhWMy4sKHa0kCR4Q4dUDCA8&uact=5&oq=solicitors+wales+%27email%27+%40&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAzIFCAAQogQyBwgAEB4QogQyBQgAEKIESgQIQRgASgQIRhgAUABYAGD4AmgAcAF4AIABc4gBc5IBAzAuMZgBAKABAcABAQ&sclient=gws-wiz-serp"
driver=webdriver.Chrome(PATH)
driver.get(target_url)
emails = []
email_pattern = r"[A-Za-z0-9._%+-]+#[A-Za-z0-9.-]+\.[A-Z|a-z]{2,4}"
for i in range(2):
html = driver.page_source
for e in re.findall(email_pattern, html):
emails.append(e)
a_attr = driver.find_element(By.ID,"pnnext")
a_attr.click()
time.sleep(2)
df = pd.DataFrame(emails, columns=['Email Addresses'])
df.to_csv('email_addresses_.csv',index=False)
driver.close()
You could either change the range value passed in for loop or entirely replace the for loop with while loop so instead of
for i in range(2):
You could do:
while len(emails) < 100:
Make sure to manage the time as to when the page navigates to next page and wait for the next page to load before extracting the available emails and then moving on to clicking the next button on search result page.
Make sure to refer to docs to get a clear idea of what you should do to achieve what you want to. Happy Hacking!!
Selenium loads its own empty browser so your google settings for 100 results need to be on the code because the default is 10 results which is what your getting. You will have better luck using query parameters and adding the one for the number of results to the end of your URL
If you need further information on query parameters to achieve this its the second method described below
tldevtech.com/how-to-show-100-results-per-page-in-google-search

xpath returns more than one result, how to handle in python

I have started selenium using python. I am able to change the message text using find_element_by_id. I want to do the same with find_element_by_xpath which is not successful as the xpath has two instances. want to try this out to learn about xpath.
I want to do web scraping of a page using python in which I need clarity on using Xpath mainly needed for going to next page.
#This code works:
import time
import requests
import requests
from selenium import webdriver
driver = webdriver.Chrome()
url = "http://www.seleniumeasy.com/test/basic-first-form-demo.html"
driver.get(url)
eleUserMessage = driver.find_element_by_id("user-message")
eleUserMessage.clear()
eleUserMessage.send_keys("Testing Python")
time.sleep(2)
driver.close()
#This works fine. I wish to do the same with xpath.
#I inspect the the Input box in chrome, copy the xpath '//*[#id="user-message"]' which seems to refer to the other box as well.
# I wish to use xpath method to write text in this box as follows which does not work.
driver = webdriver.Chrome()
url = "http://www.seleniumeasy.com/test/basic-first-form-demo.html"
driver.get(url)
eleUserMessage = driver.find_elements_by_xpath('//*[#id="user-message"]')
eleUserMessage.clear()
eleUserMessage.send_keys("Test Python")
time.sleep(2)
driver.close()
To elaborate on my comment you would use a list like this:
eleUserMessage_list = driver.find_elements_by_xpath('//*[#id="user-message"]')
my_desired_element = eleUserMessage_list[0] # or maybe [1]
my_desired_element.clear()
my_desired_element.send_keys("Test Python")
time.sleep(2)
The only real difference between find_elements_by_xpath and find_element_by_xpath is the first option returns a list that needs to be indexed. Once it's indexed, it works the same as if you had run the second option!

python parse evernote shared notebook

I am trying to get data from evernote 'shared notebook'.
For example, from this one: https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c
I tried to use Beautiful Soup:
url = 'https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c'
r = requests.get(url)
bs = BeautifulSoup(r.text, 'html.parser')
bs
The result doesn't contain any text information from the notebook, only some code.
I also seen an advice to use selenium and find elements by XPath.
For example I want to find the head of this note - 'Term 3 Week2'. In Google Chrome i found that it's XPath is '/html/body/div[1]/div[1]/b/span/u/b'.
So i tried this:
driver = webdriver.PhantomJS()
driver.get(url)
t = driver.find_element_by_xpath('/html/body/div[1]/div[1]/b/span/u/b')
But it also didn't work, the result was 'NoSuchElementException:... '.
I am a newbie in python and especially parsing, so I would be glad to receive any help.
I am using python 3.6.2 and jupiter-notebook.
Thanks in advance.
The easiest way to interface with Evernote is to use their official Python API.
After you've configured your API key and can generally connect, you can then download and reference Notes and Notebooks.
Evernote Notes use their own template language called ENML (EverNote Markup Language) which is a subset of HTML. You'll be able to use BeautifulSoup4 to parse the ENML and extract the elements you're looking for.
If you're trying to extract information against a local installation (instead of their web app) you may also be able to get what you need from the executable. See how to pass arguments to the local install to extract data. For this you're going to need to use the Python3 subprocess module.
HOWEVER
If you want to use selenium, this will get you started:
import selenium.webdriver.support.ui as ui
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# your example URL
URL = 'https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c'
# create the browser interface, and a generic "wait" that we can use
# to intelligently block while the driver looks for elements we expect.
# 10: maximum wait in seconds
# 0.5: polling interval in seconds
driver = Chrome()
wait = ui.WebDriverWait(driver, 10, 0.5)
driver.get(URL)
# Note contents are loaded in an iFrame element
find_iframe = By.CSS_SELECTOR, 'iframe.gwt-Frame'
find_html = By.TAG_NAME, 'html'
# .. so we have to wait for the iframe to exist, switch our driver context
# and then wait for that internal page to load.
wait.until(EC.frame_to_be_available_and_switch_to_it(find_iframe))
wait.until(EC.visibility_of_element_located(find_html))
# since ENML is "just" HTML we can select the top tag and get all the
# contents inside it.
doc = driver.find_element_by_tag_name('html')
print(doc.get_attribute('innerHTML')) # <-- this is what you want
# cleanup our browser instance
driver.quit()

extracting more information from webdriver

I have written a code to extract the mobile models from the following website
"http://www.kart123.com/mobiles/pr?p%5B%5D=sort%3Dfeatured&sid=tyy%2C4io&ref=659eb948-c365-492c-99ef-59bd9f0427c6"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.kart123.com/mobiles/pr?p%5B%5D=sort%3Dfeatured&sid=tyy%2C4io&ref=659eb948-c365-492c-99ef-59bd9f0427c6")
elem=[]
elem=driver.find_elements_by_xpath('.//div[#class="pu-title fk-font-13"]')
for e in elem:
print e.text
Everything is working fine but the problem arises at the end of the page. It is showing the contents of the first page only.Please could you help me what can I do in order to get all the models.
This will get you on your way, I would use while loops using sleep to get all the page loaded before getting the information from the page.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
driver = webdriver.Firefox()
driver.get("http://www.flipkart.com/mobiles/pr? p%5B%5D=sort%3Dfeatured&sid=tyy%2C4io&ref=659eb948-c365-492c-99ef-59bd9f0427c6")
time.sleep(3)
for i in range(5):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") # scroll to bottom of page
time.sleep(2)
driver.find_element_by_xpath('//*[#id="show-more-results"]').click() # click load more button, needs to be done until you reach the end.
elem=[]
elem=driver.find_elements_by_xpath('.//div[#class="pu-title fk-font-13"]')
for e in elem:
print e.text
Ok this is going to be a major hack but here goes... The site gets more phones as you scroll down by hitting an ajax script giving you 20 more each time. The script its hitting is this:
http://www.flipkart.com/mobiles/pr?p[]=sort%3Dpopularity&sid=tyy%2C4io&start=1&ref=8aef4a5f-3429-45c9-8b0e-41b05a9e7d28&ajax=true
Notice the start parameter you can hack this into what you want with
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
num = 1
while num <=2450:
"""
This condition will need to be updated to the maximum number
of models you're interested in (or if you're feeling brave try to extract
this from the top of the page)
"""
driver.get("http://www.flipkart.com/mobiles/pr?p[]=sort%3Dpopularity&sid=tyy%2C4io&start=%f&ref=8aef4a5f-3429-45c9-8b0e-41b05a9e7d28&ajax=true" % num)
elem=[]
elem=driver.find_elements_by_xpath('.//div[#class="pu-title fk-font-13"]')
for e in elem:
print e.text
num += 20
You'll be making 127 get requests so this will be quite slow...
You can get full source of the page and do all the analysis based on it:
page_text = driver.page_source
The page shall contain current content including whatever was generated by JavaScript. Be carefull to get this content at the moment, all the rendering is completed (you may e.g. wait for presence of some string, which gets rendered at the end).

Categories

Resources