How to accelerate this process of JavaScript web-page scraping? - python

This python function aims to scrape a specific identifier (called as PMID) from a JavaScript web-page. When a URL is passed to the function, it gets the page using selenium. The code then tries to find the class "pubmedLink" within tag of html. If found, it returns the extracted PMID to another function.
This works fine, but is literally really slow. Is there a way to accelerate the process may be by using another parser or with a completely different method?
from selenium import webdriver
def _getPMIDfromURL_(url):
driver = webdriver.Chrome('/usr/protoLivingSystematicReviews/drivers/chromedriver')
driver.get(url)
try:
if driver.find_element_by_css_selector('a.pubmedLink').is_displayed():
json_text = driver.find_element_by_css_selector('a.pubmedLink').text
return json_text
except:
return "no_pmid"
driver.quit()
Examples of the URL for the JS web-page,
http://www.embase.com/search/results?subaction=viewrecord&from=export&id=L617434973
http://www.embase.com/search/results?subaction=viewrecord&from=export&id=L617388849
http://www.embase.com/search/results?subaction=viewrecord&from=export&id=L46141767

Well, selenium is fast, that's why is the favorite for many testers. On the other hand you could improve your code by parsing the content once instead two times.
The return value of the statement
driver.find_element_by_css_selector('a.pubmedLink')
might by stored in a variable and use that variable. This will improve your speed about 1.5x.
try:
elem =driver.find_element_by_css_selector('a.pubmedLink')
if elem.is_displayed():
return elem.text
except:
return "no_pmid

You can try phantomjs, its faster:
https://realpython.com/headless-selenium-testing-with-python-and-phantomjs/

Related

Xpath returns empty array - lxml

I'm trying to write a program that scrapes https://www.tcgplayer.com/ to get a list of Pokemon TCG prices based on a specified list
from lxml import etree, html
import requests
import string
def clean_text(element):
all_text = element.text_content()
cleaned = ' '.join(all_text.split())
return cleaned
page = requests.get("http://www.tcgplayer.com/product/231462/pokemon-first-partner-pack-pikachu?xid=pi731833d1-f2cc-4043-9551-4ca08506b43a&page=1&Language=English")
tree = html.fromstring(page.content)
price = tree.xpath("/html/body/div[2]/div/div/section[2]/section/div/div[2]/section[3]/div/section[1]/ul/li[1]/span[2]")
print(price)
However, when I am running this code the output ends up just being an empty list "[]"
I have tried using selenium and the browser function that it has, however I would like it to not need to open a browser for 100+ cards to get the price data. I have tested this code on another website url and xpath (https://www.pricecharting.com/game/pokemon-promo/jolteon-v-swsh183, /html/body/div[1]/div[2]/div/div/table/tbody[1]/tr[1]/td[4]/span[1]) - so I wonder if it is just how https://www.tcgplayer.com/ is built.
The expected return value is around $5
Question answered above by #Grismar:
When you test the XPath on a site, you probably do this in the Developer Console in the browser, after the page has loaded. At that point in time, any JavaScript will have already executed and completed and the page may have been updated or even been constructed from scratch by it. When using requests, it just loads the basic page and no scripts get executed - you'll need something that can execute JavaScript to get the same result, like selenium
BeautifulSoup scraping returns no data

Get a page with Selenium but wait for unknown element value to not be empty

Context
This is a repost of Get a page with Selenium but wait for element value to not be empty, which was Closed without any validity so far as I can tell.
The linked answers in the closure reasoning both rely on knowing what the expected text value will be. In each answer, it explicitly shows the expected text hardcoded into the WebDriverWait call. Furthermore, neither of the linked answers even remotely touch upon the final part of my question:
[whether the expected conditions] come before or after the page Get
"Duplicate" Questions
How to extract data from the following html?
Assert if text within an element contains specific partial text
Original Question
I'm grabbing a web page using Selenium, but I need to wait for a certain value to load. I don't know what the value will be, only what element it will be present in.
It seems that using the expected condition text_to_be_present_in_element_value or text_to_be_present_in_element is the most likely way forward, but I'm having difficulty finding any actual documentation on how to use these and I don't know if they come before or after the page Get:
webdriver.get(url)
Rephrase
How do I get a page using Selenium but wait for an unknown text value to populate an element's text or value before continuing?
I'm sure that my answer is not the best one but, here is a part of my own code, which helped me with similar to your question.
In my case I had trouble with loading time of the DOM. Sometimes it took 5 sec sometimes 1 sec and so on.
url = 'www.somesite.com'
browser.get(url)
Because in my case browser.implicitly_wait(7) was not enought. I made a simple for loop to check if the content is loaded.
some code...
for try_html in range(7):
""" Make 7 tries to check if the element is loaded """
browser.implicitly_wait(7)
html = browser.page_source
soup = BeautifulSoup(html, 'lxml')
raw_data = soup.find_all('script', type='application/ld+json')
"""if SKU in not found in the html page we skip
for another loop, else we break the
tryes and scrape the page"""
if 'sku' not in html:
continue
else:
scrape(raw_data)
break
It's not perfect, but you can try it.

Export data from dynamic website with BS4 + Python:

I want to export all store data from the following website into a excel-file:
https://www.ybpn.de/ihre-parfuemerien
The problem: The Map is "dynamic", so the needed data loads when you enter a postal code.
The data is need is stored in the div-class "storefinder__list-item" with a unique reference in the data-"storefinder-reference" div-class, example: data-storefinder-reference="132"
I tried:
soup.find("div", {"data-storefinder-reference": "132"})
But the output is: NONE
I think this problem is caused by the fact that the page is dynamic, so the needed data loads then, when you enter a postal code. So when I search for the reference id "132" its "there", but not loaded on the website and bs4 cant find this id.
Any ideas to improve the code?
For this you might need to look into tools like selenium and/or "firefox-headless".
Especially selenium allows you to "remote-control" web-pages with Python
Here is a tutorial: https://realpython.com/modern-web-automation-with-python-and-selenium/
If the problem is waiting for the page to load, you can do it with selenium.
`result = driver.execute_script('var text = document.title ; return text')`
If there is jquery on the page, it certainly does
result=driver.execute_script("""
$(document).ready({
var $text=$('yourselector').text()
return $text
})
""")
Note: For selenium you can look here
You could just open the page in chrome or ff, open the web debug console and query the elements. if you see them they are in the dom and thus queryable. But that will be done in javascript. if you‘re lucky they use jQuery.

Selenium Webdriver Timeout (Python 2.7)

When scraping data from NASDAQ there are tickers like ACHC that have empty pages. ACHC Empty Field
My program iterates through all ticker symbols and when I get to this one it times out because there is no data to grasp. I am trying to figure out a way to check if there is nothing and if so skip the ticker, but continue the loop. The code is pretty long, so Ill post the most relevant part: the beginning of the loop where it opens the page:
## navigate to income statement annualy page
url = url_form.format(symbol, "income-statement")
browser.get(url)
company_xpath = "//h1[contains(text(), 'Company Financials')]"
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
annuals_xpath = "//thead/tr[th[1][text() = 'Period Ending:']]/th[position()>=3]"
annuals = get_elements(browser,annuals_xpath)
Here is a pic of the error message
Selenium doesn't have a built-in method for determining whether an element exists or not, so the most common thing to do is use a try/except block.
from selenium.common.exceptions import TimeoutException
...
try:
company = WebDriverWait(browser, 10).until(EC.presence_of_element_located((By.XPATH, company_xpath))).text
except TimeoutException:
continue
This should keep the loop going without crashing, assuming that continue works as expected with your loop.
You can use libraries like requests or urllib to scrape that web page and check if what you need is there. These libraries are much faster than Selenium because they just fetch the source of the page. If there are particular tags or structures like tables, etc. that you're looking for, you should take a look at beautifulsoup which you can use with requests to identify very specific parts of the page.

Python Splinter clicking button CSS

I'm having trouble selecting a button in my Splinter script using the find_by_css method. The documentation is sparse at best, and I haven't found a lot of good articles out there with examples.
br.find_by_css('div#edit-field-download-files-und-0 a.button.launcher').first.click()
...where br is my browser instance.
I've tried a few different ways of writing it. I'm really not sure how I'm supposed to do it because the documentation doesn't give any hard examples of the syntax.
Here's a screenshot of the element.
Sorry the screenshot kind of sucks.
Does anyone have any experience with this?
The css selector looks alright, just that i am not sure from where have you got find_by_css as a method?
How about this :-
br.find_element_by_css_selector("div#edit-field-download-files-und-0 a.button.launcher").click()
Selenium provides the following methods to locate elements in a page:
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector
To find multiple elements (these methods will return a list):
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector
I'm working on something similar where I'm trying to click stuff on a webpage. The documentation for find_by_css() is very poor and you need to type the css path to the element you want to click.
Say we want to go to the about tab on python.org
from splinter import Browser
from time import sleep
with Browser() as browser: #<--Create browser instance (firefox default driver)
browser.visit('http://www.python.org') #<--Visits url string
browser.find_by_css('#about > a').click()
# ^--Put css path here in quotes
sleep(5)
If your connection is good you might not get the chance to see the about tab getting clicked but you should end up on the about page.
The hard part is figuring out the css path to an element. However once you have it, the find_by_css() method looks pretty easy
I like the W3Schools reference for CSS selection parameters: http://www.w3schools.com/cssref/css_selectors.asp
As for your code... I recommend breaking this down into a few steps, at least during debug. The call to br.find_by_css('css_string') returns a list of elements. So you can grab that list and check the count.
elems = br.find_by_css('div#edit-field-download-files-und-0 a.button.launcher')
if len(elems) == 1:
elems.first.click()
If you don't check the length of the returned list and call '.first' on an empty list, you'll get an exception. If len > 1, you're probably getting things you don't expect.
Each id on a page is unique, and you can daisy-chain searches, so you can use a few different statements to make this happen:
id_elems = br.find_by_id('edit-field-download-files-und-0')
if id_elems:
id_elem = id_elems.first
a_elems = id_elem.find_by_tag("a")
for e in a_elems:
if e.has_class("button launcher"):
print('Found it!')
e.click()
This is, of course, just one of many ways to do this.
Lastly, Splinter is a wrapper around Selenium and other webdrivers. It's possible that, even after you find the element to click, the actual click won't do anything. If this happens, you can also try clicking on the wrapped Selenium object, available as e._element. So you could try e._element.click() if necessary.

Categories

Resources