I am scraping a webpage using Selenium in Python. I am able to locate the elements using this code:
from selenium import webdriver
import codecs
driver = webdriver.Chrome()
driver.get("url")
results_table=driver.find_elements_by_xpath('//*[#id="content"]/table[1]/tbody/tr')
Each element in results_table is in turn a set of sub-elements, with the number of sub-elements varying from element to element. My goal is to output each element, as a list or as a delimited string, into an output file. My code so far is this:
results_file=codecs.open(path+"results.txt","w","cp1252")
for element in enumerate(results_table):
element_fields=element.find_elements_by_xpath(".//*[text()][count(*)=0]")
element_list=[field.text for field in element_fields]
stuff_to_write='#'.join(element_list)+"\r\n"
results_file.write(stuff_to_write)
#print (i)
results_file.close()
driver.quit()
This second part of code takes about 2.5 minutes on a list of ~400 elements, each with about 10 sub-elements. I get the desired output, but it is too slow. What could I do to improve the prformance ?
Using python 3.6
Download the whole page in one shot, then use something like BeautifulSoup to process it. I haven't used splinter or selenium in a while, but in Splinter, .html will give you the page. I'm not sure what the syntax is for that in Selenium, but there should be a way to grab the whole page.
Selenium (and Splinter, which is layered on top of Selenium) are notoriously slow for randomly accessing web page content. Looks like .page_source may give the entire contents of the page in Selenium, which I found at stackoverflow.com/questions/35486374/…. If reading all the chunks on the page one at a time is killing your performance (and it probably is), reading the whole page once and processing it offline will be oodles faster.
Related
According to performance it is more than obvious that web scraping with BautifulSoup is much faster than using a webdriver with Selenium. However I don't know any other way to get content from a dynamic web page. I thought the difference comes from the time needed for the browser to load elements but it is definitely more than that. Once the browser loads the page(5 seconds) all I had to do is to extract some <tr> tags from a table. It took about 3-4 minutes to extract 1016 records which is extremely slow in my opinion. I came to a conclusion that webdriver methods for finding elements such as find_elements_by_name are slow. Is find_elements_by.. from webdriver much slower than the find method in BeautifulSoup? And would it be faster if I get the whole html from the webdriver browser and then parse it with lxml and use the BeautifulSoup?
Yes it would be much faster to use Selenium only to get the HTML after waiting for the page to be ready and then use BeautifulSoup or lxml to parse that HTML.
Another option could be to use Puppeteer either only to get the HTML or to get the info that you want directly. It should also be faster than Selenium. There are some unofficial python bindings for it: pyppeteer
Look into 2 options:
1) sometimes these dynamic pages do actually have the data within <script> tags in a valid json format. You can use requests to get the html, beautifulsoup will get the <script> tag, then you can use json,loads() to parse.
2) go directly to the source. Look at the dev tools and search the XHR to see if you can go directly to the url/API and that generates the data and return the data that way (most likely again in json format). In my opinion, this is by far the better/faster option if available.
If you can provide the url, I can check to see if either of these options apply to your situation.
Web Scraping with Python using either with selenium or beautifulsoup should be a part of the testing strategy. Putting it straight if your intent is to scrape the static content BeautifulSoup is unmatched. But incase the website content is dynamically rendered Selenium is the way to go.
Having said that, BeautifulSoup won't wait for the dynamic content which isn't readily present in the DOM Tree once page loading completes. Where as using Selenium you have Implicit Wait and Explicit Wait at your disposal to locate the desired dynamic elements.
Finally, find_elements_by_name() may be delta expensive in terms of performance as Selenium translates it into it's equivalent find_element_by_css_selector(). You can find some more details in this discussion
Outro
Official locator strategies for the webdriver
You could also try evaluating in javascript. For example this:
item = driver.execute_script("""return {
div: document.querySelector('div').innerText,
h2: document.querySelector('h2').innerText
}""")
will be at least 10x faster than this:
item = {
"div": driver.find_element_by_css_selector('div').text,
"h2": driver.find_element_by_css_selector('h2').text
}
I wouldn't be surprised if it was faster than BS a lot of the time too.
I'm trying to scrape data from this website called Anhembi
But when I try all the options from selenium to find elements, I get nothing. Anyone know why this happens?
I've already tried:
driver.find_element_by_xpath('//*[#class="agenda_result_laco_box"]')
And made a for-loop through that to click in every single one and get the info I need which consists of the day, website and name of the events. How can I do that?
Clearly, there is an iframe involved , you need to switch the focus of your web driver in order to interact with elements which are in iframe/frameset/frame.
You can try with this code :
driver.get("http://www.anhembi.com.br/agenda/")
driver.switch_to.frame(driver.find_element_by_css_selector("iframe[src='http://intranet.spturis.com.br/intranet/modulos/booking/anhembisite_busca.php']"))
all_data = driver.find_elements_by_css_selector("div.agenda_result_laco_box")
print(len(all_data))
for data in all_data:
print(data.text)
I'm trying to scroll down a specific div in a webpage (ticker box in facebook) using selenium (with python).
I can find the element, but when i try to scroll it down using:
header = driver.find_element_by_class_name("tickerActivityStories")
driver.execute_script('arguments[0].scrollTop = arguments[0].scrollHeight', header)
Nothing happens.
I've found a lot of answers, but mostly are for selenium in java, so i would like to know if it's possible to do it from python.
I've found a lot of answers, but mostly are for selenium in java
There is no difference, selenium has the same api in java and python, you just need to find the same function in python selenium.
First of all check if your JS works in the browser(if not, try scroll parent element).
Also you can try :
from selenium.webdriver.common.keys import Keys
header.send_keys(Keys.PAGE_DOWN)
I work with Selenium in Python 2.7. I get that loading a page and similar thing takes far longer than raw requests because it simulates everything including JS etc.
The thing I don't understand is that parsing of already loaded page takes too long.
Everytime when page is loaded, I find all tags meeting some condition (about 30 div tags) and then I put each tag as an attribute to parsing function. For parsing I'm using css_selectors and similar methods like: on.find_element_by_css_selector("div.carrier p").text
As far as I understand, when tha page is loaded, the source code of this page is saved in my RAM or anywhere else so parsing should be done in miliseconds.
EDIT: I bet that parsing the same source code using BeautifulSoup would be more than 10 times faster but I don't understand why.
Do you have any explanation? Thanks
These are different tools for different purposes. Selenium is a browser automation tool that has a rich set of techniques to locate elements. BeautifulSoup is an HTML parser. When you find an element with Selenium - this is not an HTML parsing. In other words, driver.find_element_by_id("myid") and soup.find(id="myid") are very different things.
When you ask selenium to find an element, say, using find_element_by_css_selector(), there is an HTTP request being sent to /session/$sessionId/element endpoint by the JSON wire protocol. Then, your selenium python client would receive a response and return you a WebElement instance if everything went without errors. You can think of it as a real-time/dynamic thing, you are getting a real Web Element that is "living" in a browser, you can control and interact with it.
With BeautifulSoup, once you download the page source, there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.
In practice, if you are doing web-scraping and you need a real browser to execute javascript and handle AJAX, and you are doing a complex HTML parsing afterwards, it would make sense to get the desired .page_source and feed it to BeautifulSoup, or, even better in terms of speed - lxml.html.
Note that, in cases like this, usually there is no need for the complete HTML source of the page. To make the HTML parsing faster, you can feed an "inner" or "outer" HTML of the page block containing the desired data to the html parser of the choice. For example:
container = driver.find_element_by_id("container").getAttribute("outerHTML")
driver.close()
soup = BeautifulSoup(container, "lxml")
http://www.vliz.be/vmdcdata/mangroves/aphia.php?p=browser&id=235056&expand=true#ct
(That's the information I am trying to scrape)
I wanna to scrape this detailed taxonomic trees so that I can manipulate them anyway I like.
But there are a few problem in geting this tree data.
I can' t fully expand the taxonomic tree . when some expanding ,some collapse as the instruction indicated .
so saving the full page as html files can not sove my problem.
or I can repeat the process some times to get separate files and concatenate them.. but it seems to be a ugly way.
I am tired of clicking , there are so many "plus" signs and I have to wait.
Is there a way to solve this out using Python ?
Use Selenium, this will expand the tree by clicking on the "plus signs" and get the entire DOM with all the elements in it after it's done:
from selenium import webdriver
import time
browser=webdriver.Chrome()
browser.get('http://www.vliz.be/vmdcdata/mangroves/aphia.php?p=browser&id=235301&expand=true#ct')
while True:
try:
elem=browser.find_elements_by_xpath('.//*[#src="http://www.marinespecies.org/images/aphia/pnode.gif" or #src="http://www.marinespecies.org/images/aphia/plastnode.gif"]')[1]
elem.click()
time.sleep(2)
except:
break
content=browser.page_source