Scrape website data without opening the browser (python) - python

I want to iteratively search for 30+ items through a search button in webpage and scrape the related data.
My search items are stored in a list: vol_list
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("driver path")
driver.get("web url")
for item in vol_list :
mc_search_box = driver.find_element_by_name("search_str")
mc_search_box.clear()
search_box.send_keys(item)
search_box.send_keys(Keys.RETURN)
After search is complete I will proceed to scrape the data for each item and store in array/list.
Is it possible to repeat this process without opening browser for every item in the loop?

You can't use chrome and other browsers without opening it.
In your case, headless browsers should do the job. Headless browsers simulates browser, but doesn't have GUI.
Try ghost driver/ html unit driver/ NodeJS. Then you will have to modify at least this line with the driver you want to use:
driver = webdriver.Chrome("driver path")
Good luck!

If you're using firefox, you can apply the headless option:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
driver.get('your url')

Related

I want to use Selenium to find specific text

I was going to use Selenium to crawl the web
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome('./chromedriver', options=options)
driver.get('https://steamdb.info/tag/1742/?all')
driver.implicitly_wait(3)
li = []
games = driver.find_elements_by_xpath('//*[#class="table-products.text-center.dataTable"]')
for i in games:
time.sleep(5)
li.append(i.get_attribute("href"))
print(li)
After accessing the steam url that I was looking for, I tried to find something called an appid
The picture below is the HTML I'm looking for
I'm trying to find the number next to "data-appid="
But if I run my code, nothing is saved in the "games"
Correct me if I'm wrong but from what I can see this steam page requires you to log-in, are you sure that when webdriver opens the page that same data is available to you ?
Additionally when using By, the correct syntax would be games = driver.find_element(By.CSS_SELECTOR('//*[#class="table-products.text-center.dataTable"]'))

Any faster way to find element by xpath in multiple pages?

I'm trying to find a directory element in an URL for multiple users. In the url variable, I'm changing the Userid={uid} each time to get the page and read the directory element text.
I get the output this way but this is still pretty slow.
Is there a way to make this faster like reusing the same browser session or using parallel processing?
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
for uid in ['uid1', 'uid2', 'uid3']:
url = f'https://domain-name.com/...?Userid={uid}&...'
opt = Options()
opt.headless = True
driver = webdriver.Chrome(options=opt)
driver.get(url)
data = driver.find_element_by_xpath('//*[#id="directory"]').text
print(data.strip())
driver.close()

How to scrape the actual data from the website in headless mode chrome python

from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
opts = Options()
opts.set_headless()
assert opts.headless # Operating in headless mode
browser = Chrome(executable_path=r"C:\Users\taksh\AppData\Local\Programs\Python\Python37-32\chromedriver.exe", options=opts)
browser.implicitly_wait(3)
browser.get('https://ca.finance.yahoo.com/quote/AMZN/profile?p=AMZN')
results = browser.find_elements_by_xpath('//*[#id="quote-header-info"]/div[3]/div/div/span[1]')
print(results)
And I get back:
[<selenium.webdriver.remote.webelement.WebElement (session="b3f4e2760ffec62836828e62530f082e", element="3e2741ee-8e7e-4181-9b76-e3a731cefecf")>]
What I actually what selenium to scrape is the price of the stock. I thought i was doing it correctly because this would find the element when I used selenium on Chrome without headless mode. How can I scrape the actual data from the website in headless mode?
You need to further extract the data after getting all element in a list.
results = browser.find_elements_by_xpath('//*[#id="quote-header-info"]/div[3]/div/div/span[1]')
for result in results:
print(result.text)
This will display all the data present in list.
It could be same xpath and locator appearing multiple time in html.
So
if we can put this code in try-catch while checking in headless mode.
Headless mode basically will scan HTML only so to debug better Try - differnt version of xpath like going to its parent of span and then traversing it

Python- Getting link of new webpage using selenium

I'm new to selenium and I wrote this code that gets user input and searches in ebay but I want to save the new link of the search so I can pass it on to BeautifulSoup.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
search_ = input()
browser = webdriver.Chrome(r'C:\Users\Leila\Downloads\chromedriver_win32')
browser.get("https://www.ebay.com.au/sch/i.html?_from=R40&_trksid=p2499334.m570.l1311.R1.TR12.TRC2.A0.H0.Xphones.TRS0&_nkw=phones&_sacat=0")
Search = browser.find_element_by_id('kw')
Search.send_keys(search_)
Search.send_keys(Keys.ENTER)
#how do you write a code that gets the link of the new page it loads
To extract a link from a webpage, you need to make use of the HREF attribute and use the get_attribute() method.
This example from here illustrates how it would work.
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.w3.org/')
for a in driver.find_elements_by_xpath('.//a'):
print(a.get_attribute('href'))
In your case, do:
Search = browser.find_element_by_id('kw')
page_link = Search.get_attribute('href')

Trying to open a tab in my opened browser with selenium

I have written a small python script with selenium to search Google and open the first link but whenever I run this script, it opens a console and open a new Chrome window and run this script in that Chrome window.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pyautogui
def main():
setup()
# open Chrome and open Google
def setup():
driver = webdriver.Chrome(r'C:\\python_programs'+
'(Starting_out_python)'+
'\\chromedriver.exe')
driver.get('https://www.google.com')
assert 'Google' in driver.title
mySearch(driver)
#Search keyword
def mySearch(driver):
search = driver.find_element_by_id("lst-ib")
search.clear()
search.send_keys("Beautiful Islam")
search.send_keys(Keys.RETURN)
first_link(driver)
#click first link
def first_link(driver):
link = driver.find_elements_by_class_name("r")
link1 = link[0]
link1.click()
main()
How can I open this in the same browser I am using?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
def main():
setup()
# open Chrome and open Google
def setup():
driver = webdriver.Chrome()
driver.get('https://www.google.com')
assert 'Google' in driver.title
mySearch(driver)
#Search keyword
def mySearch(driver):
search = driver.find_element_by_id("lst-ib")
search.clear()
search.send_keys("test")
search.send_keys(Keys.RETURN)
first_link(driver)
#click first link
def first_link(driver):
link = driver.find_elements_by_xpath("//a[#href]")
# uncomment to see each href of the found links
# for i in link:
# print(i.get_attribute("href"))
first_link = link[0]
url = first_link.get_attribute("href")
driver.execute_script("window.open('about:blank', 'tab2');")
driver.switch_to.window("tab2")
driver.get(url)
# Do something else with this new tab now
main()
A few observation: the first link you get might not be the first link you want. In my case, the first link is the login to Google account. So you might want to do some more validation on it until you open it, like check it's href property, check it's text to see if it matches something etc.
Another observation is that there are easier ways of crawling google search results and using googles API directly or a thirdparty implementation like this: https://pypi.python.org/pypi/google or https://pypi.python.org/pypi/google-search
To my knowledge, there's no way to attach Selenium to an already-running browser.
More to the point, why do you want to do that? The only thing I can think of is if you're trying to set up something with the browser manually, and then having Selenium do things to it from that manually-set-up state. If you want your tests to run as consistently as possible, you shouldn't be relying on a human setting up the browser in a particular way; the script should do this itself.

Categories

Resources