Any faster way to find element by xpath in multiple pages? - python

I'm trying to find a directory element in an URL for multiple users. In the url variable, I'm changing the Userid={uid} each time to get the page and read the directory element text.
I get the output this way but this is still pretty slow.
Is there a way to make this faster like reusing the same browser session or using parallel processing?
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
for uid in ['uid1', 'uid2', 'uid3']:
url = f'https://domain-name.com/...?Userid={uid}&...'
opt = Options()
opt.headless = True
driver = webdriver.Chrome(options=opt)
driver.get(url)
data = driver.find_element_by_xpath('//*[#id="directory"]').text
print(data.strip())
driver.close()

Related

I want to use Selenium to find specific text

I was going to use Selenium to crawl the web
from selenium import webdriver
from selenium.webdriver.common.by import By
import time
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome('./chromedriver', options=options)
driver.get('https://steamdb.info/tag/1742/?all')
driver.implicitly_wait(3)
li = []
games = driver.find_elements_by_xpath('//*[#class="table-products.text-center.dataTable"]')
for i in games:
time.sleep(5)
li.append(i.get_attribute("href"))
print(li)
After accessing the steam url that I was looking for, I tried to find something called an appid
The picture below is the HTML I'm looking for
I'm trying to find the number next to "data-appid="
But if I run my code, nothing is saved in the "games"
Correct me if I'm wrong but from what I can see this steam page requires you to log-in, are you sure that when webdriver opens the page that same data is available to you ?
Additionally when using By, the correct syntax would be games = driver.find_element(By.CSS_SELECTOR('//*[#class="table-products.text-center.dataTable"]'))

Python - saving issue while scraping (Selenium, Chrome Driver)

I've recently started learning Python.
I have a file of 1000 words for which I want to do direct searches on a website, and extract the number of results for each word on the search results page.
I'm using Selenium and Chrome Driver, on Mac.
My script runs well, manages to input the keyword, submit the search, retrieve the output and save it correctly, c.f. screenshot of the output.
However past a certain point, I have no idea why, it starts saving the same output for a bunch of keywords.
Could that be due to wifi, Chrome driver? I have tried: Running the script while not using the computer, restarting my whole setup in between script runs, changing network, running the code in several batches, into smaller lists of keywords... I have no idea why it would stop behaving correctly past a certain point, would really appreciate any idea to solve this!
Script and imports below:
import pandas as pd
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import os
import time
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
import selenium.webdriver.support.ui as ui
import selenium.webdriver.support.expected_conditions as EC
from tqdm.notebook import tqdm
import re
URL = "XXX"
# Add list of words i.e. pull from csv.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
dir_path= '/Users/XXXXX/'
chromedriver = dir_path + "/chromedriver"
header_list=['Word']
words_df = pd.read_csv('/Users/XXXXX/results-file.csv', names=header_list)
words_list = words_df.Word.tolist()
os.environ["webdriver.chrome?driver"] = chromedriver
# Start the Driver
driver = webdriver.Chrome(options=options, executable_path = chromedriver)
# Hit the url and wait for 2 seconds.
time.sleep(2)
driver.get(URL)
# Empty dictionnary is created to store the data
dictionnary_numberresults = {}
for element in words_list[3000:10000]:
# Enter word in searchbox, and wait for 4 seconds.
driver.find_element_by_xpath("//input[#id='twotabsearchtextbox']").clear()
driver.find_element_by_xpath("//input[#id='twotabsearchtextbox']").send_keys(element)
driver.find_element_by_xpath("//input[#id='twotabsearchtextbox']").submit()
time.sleep(3)
# Web page fetched from driver is parsed using Beautiful Soup.
HTMLPage = BeautifulSoup(driver.page_source, 'html.parser')
Pagination = HTMLPage.find_all(class_="a-section a-spacing-small a-spacing-top-small")
Number = re.findall('<span dir="auto">(.*)</span><span dir="auto">', str(Pagination), re.DOTALL)
if re.findall('<span dir="auto">(.*)</span><span dir="auto">', str(Pagination), re.DOTALL) == []:
Number = ['No results']
dictionnary_numberresults[element] = Number[0]

Grab CSV file from pop-up windows with Python

I am currently working on a project where I need to extract many files from a database, for which there is no API.
I need to do it through a webpage by constructing URL's similar to this one:
https://bmsnet.cas.dtu.dk/Trendlogs/ExportCSV_TrendlogRecordData/1
The integer at the end of the URL (in the example above: 1), will be ranging from 1 to 35000. When constructing the URL, I get a pop-up windows for saving the file such as:
Pop-up window for file download
My question is how do I automate that process using python. I am capable of generating these URLs and handle the data resulting from the file download (so far when doing this manually). The step I am stuck at, is for constructing a python command/bit of code that allows me to click on the save as button. Eventually I want to end up with a code doing the following:
Construct the URL
Save the file arising from the pop-up window
Load/read and process the data
EDIT :
I have now found a solution using Selenium.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pyautogui
import time
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
dl_path = "MY_LOCAL_DOWNLOAD_PATH"
profile = FirefoxProfile()
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference("browser.download.dir", dl_path)
profile.set_preference("browser.helperApps.neverAsk.saveToDisk",
"text/plain,text/x-csv,text/csv,application/vnd.ms-excel,application/csv,application/x-csv,text/csv,text/comma-separated-values,text/x-comma-separated-values,text/tab-separated-values,application/pdf")
driver = webdriver.Firefox(firefox_profile=profile)
URL = "https://bmsnet.cas.dtu.dk"
driver.get(URL)
# Let the page load
time.sleep(5)
username = driver.find_element_by_id("Email")
password = driver.find_element_by_id("Password")
username.send_keys("my_username")
password.send_keys("my_password")
elem = driver.find_element_by_xpath("/html/body/div[2]/div/div[1]/section/form/div[4]/div/input")
elem.click()
time.sleep(5)
start = 1
stop = 10
for file_integer in range(start, stop):
URL = "https://bmsnet.cas.dtu.dk/Trendlogs/ExportCSV_TrendlogRecordData/{0}".format(file_integer)
driver.get(URL)
time.sleep(5)
print('Done downloading integer: {0}'.format(file_integer))
The above code works but only once. For some reason the for loop gets stuck after the first iteration. Any clue on what I am doing wrong there?
Thank you for your time and help. Looking forward to hearing your ideas on that.

Python- Getting link of new webpage using selenium

I'm new to selenium and I wrote this code that gets user input and searches in ebay but I want to save the new link of the search so I can pass it on to BeautifulSoup.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
search_ = input()
browser = webdriver.Chrome(r'C:\Users\Leila\Downloads\chromedriver_win32')
browser.get("https://www.ebay.com.au/sch/i.html?_from=R40&_trksid=p2499334.m570.l1311.R1.TR12.TRC2.A0.H0.Xphones.TRS0&_nkw=phones&_sacat=0")
Search = browser.find_element_by_id('kw')
Search.send_keys(search_)
Search.send_keys(Keys.ENTER)
#how do you write a code that gets the link of the new page it loads
To extract a link from a webpage, you need to make use of the HREF attribute and use the get_attribute() method.
This example from here illustrates how it would work.
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
driver = webdriver.Chrome(chrome_options=options)
driver.get('https://www.w3.org/')
for a in driver.find_elements_by_xpath('.//a'):
print(a.get_attribute('href'))
In your case, do:
Search = browser.find_element_by_id('kw')
page_link = Search.get_attribute('href')

Scrape website data without opening the browser (python)

I want to iteratively search for 30+ items through a search button in webpage and scrape the related data.
My search items are stored in a list: vol_list
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("driver path")
driver.get("web url")
for item in vol_list :
mc_search_box = driver.find_element_by_name("search_str")
mc_search_box.clear()
search_box.send_keys(item)
search_box.send_keys(Keys.RETURN)
After search is complete I will proceed to scrape the data for each item and store in array/list.
Is it possible to repeat this process without opening browser for every item in the loop?
You can't use chrome and other browsers without opening it.
In your case, headless browsers should do the job. Headless browsers simulates browser, but doesn't have GUI.
Try ghost driver/ html unit driver/ NodeJS. Then you will have to modify at least this line with the driver you want to use:
driver = webdriver.Chrome("driver path")
Good luck!
If you're using firefox, you can apply the headless option:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
driver.get('your url')

Categories

Resources