Speed up Selenium, WebDriver, and BeautifulSoup? - Python - python

I am trying to scrape a google shopping page and have it work very reliably. The page is full of javascript(which BeautifulSoup can't parse to my knowledge) so I am using selenium and web driver to wake the page up first and then using BeautifulSoup to parse the page content. The problem is that it is really really slow. Just parsing this one page takes about 9 seconds on average and I need to parse multiple pages using the same method at once. 9 seconds for each is just too long for my application. I have done a lot of research and implemented various methods to speed up Selenium, WebDriver, and BeautifulSoup such as (cchardet) but to no significant or noticeable difference. In order to test what is slowing the operation down, I put a print in between each line and watched the prints in the terminal to see where it was getting stuck. My code is below and the slowest line by far which is causing %99 of the problem is...
google_driver.get('https://www.google.com/search?q=desk&source=lmns&tbm=shop&bih=1043&biw=1866&hl=en&sa=X&ved=2ahUKEwjxh5DYj9T5AhVEsHIEHfpsA_0Q_AUoAXoECAEQAQ')
I can't tell if the long pause in this line is only because it takes a while to wake up the page fully before extracting the contents or if it is taking a long time due to the extraction of content.
def google_initiate(request):
form = SearchForm(request.POST or None)
google_service = Service(chromedriver_path)
google_options = webdriver.ChromeOptions()
google_options.add_argument("--incognito")
google_options.add_argument('headless')
google_driver = webdriver.Chrome(service=google_service, options=google_options)
google_driver.get('https://www.google.com/search?q=desk&source=lmns&tbm=shop&bih=1043&biw=1866&hl=en&sa=X&ved=2ahUKEwjxh5DYj9T5AhVEsHIEHfpsA_0Q_AUoAXoECAEQAQ')
google_soup = BeautifulSoup(google_driver.page_source, 'lxml')
google_parsed = google_soup.find_all('div', {'class': ['sh-dgr__gr-auto', 'sh-dgr__grid-result']})
return google_parsed
If it is due to the page needing to fully wake up and there are no fixes for the current set up is there an alternative way to do this that is faster? Can I do this with just BeautifulSoup because it is very fast on its own(Again the reason I am not is due to javascript on the page)? Thanks in Advance!!
P.S. I am new to Selenium and WebDriver and really know just enough to make this work and some various modifications.
UPDATE: - Still Stuck
def home(request):
form = SearchForm(request.POST or None)
if form.is_valid():
form.save()
if request.POST:
for google_post in google_initiate(request, self):
#Do some stuff
#Make a list
#Append stuff to list
Call function at top of code
def google_initiate(request, self):
self.open(
"https://www.google.com/"
"search?q=desk&source=lmns&tbm=shop&bih=1043&biw=1866&hl=en&sa="
"X&ved=2ahUKEwjxh5DYj9T5AhVEsHIEHfpsA_0Q_AUoAXoECAEQAQ")
soup = self.get_beautiful_soup()
parsed = soup.find_all(
'div', {'class': ['sh-dgr__gr-auto', 'sh-dgr__grid-result']}
)
print(parsed)
return parsed
Underlying function at the bottom of the code
I'm still working at it and trying different stuff just stuck on getting seleniumbase to work with django and views. Thanks!

Below is a SeleniumBase pytest test that will do that in 5 seconds or less.
Add --headless --block-images as command-line options to speed it up:
from seleniumbase import BaseCase
class MyTestClass(BaseCase):
def test_parse_shopping(self):
self.open(
"https://www.google.com/"
"search?q=desk&source=lmns&tbm=shop&bih=1043&biw=1866&hl=en&sa="
"X&ved=2ahUKEwjxh5DYj9T5AhVEsHIEHfpsA_0Q_AUoAXoECAEQAQ")
soup = self.get_beautiful_soup()
parsed = soup.find_all(
'div', {'class': ['sh-dgr__gr-auto', 'sh-dgr__grid-result']}
)
print(len(parsed))
pytest test_NAME.py --headless --block-images
I ran that and it found 88 items.

Related

Selenium with proxy returns empty website

I am having trouble getting a page source HTML out of a site with selenium through a proxy. Here is my code
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import codecs
import time
import shutil
proxy_username = 'myProxyUser'
proxy_password = 'myProxyPW'
port = '1080'
hostname = 'myProxyIP'
PROXY = proxy_username+":"+proxy_password+"#"+hostname+":"+port
options = Options()
options.add_argument("--headless")
options.add_argument("--kiosk")
options.add_argument('--proxy-server=%s' %PROXY)
driver = webdriver.Chrome(r'C:\Users\kingOtto\Downloads\chromedriver\chromedriver.exe', options=options)
driver.get("https://www.whatismyip.com")
time.sleep(10)
html = driver.page_source
f = codecs.open('dummy.html', "w", "utf-8")
f.write(html)
driver.close()
This results in a very incomplete HTML, showing only outer brackets of head and body:
html
Out[3]: '<html><head></head><body></body></html>'
Also the dummy.html file written to disk does not show any other content that what is displayed in the line above.
I am lost, here is what I tried
It does work when I run it without options.add_argument('--proxy-server=%s' %PROXY) line. So I am sure it is the proxy. But the proxy connection itself seems to be ok (I do not get any proxy connection erros - plus I do get the outer frame from the website, right? So the driver request gets through & back to me)
Different URLs: Not only whatismyip.com fails, also any other pages - tried different news outlets such as CNN or even google - virtually nothing comes back from any website, except for head and body brackets. It cannot be any javascript/iframe issue, right?
Different wait times (this article does not help: Make Selenium wait 10 seconds), up to 60 seconds -- plus my connection is super fast, <1 second should be enough (in browser)
What am I getting wrong about the connection?
driver.page_source does not always return what you expect via selenium. It's likely NOT the full dom. This is documented in the selenium doc and in various SO answers, e.g.:
https://stackoverflow.com/a/45247539/1387701
Selenium gives a best effort to provide the page source as it is fetched. Only highly dynamic pages this can often be limited in it's return.

Python Selenium - Get everything and write to file

noob here who just managed to be actively refused by the remote server. Too many connection attempts I suspect.
..and really, I should not be trying to connect every time I want to try some new code, so that got me to this question:
So, how can I grab everything off the page, and save it to file...and then just load the file offline to search for the fields I need.
I was in the process of testing the below code when I was Refused so I don't know what works - there are probably typos below :/
Could anyone please offer any suggestions or improvements.
print ("Get CSS elements from page")
parent_elements_css = driver.find_elements_by_css_selector("*")
driver.quit()
print ("Saving Parent_Elements to CSV")
with open('ReadingEggs_BookReviews_Dump.csv', 'w') as file:
file.write(parent_elements_css)
print ("Open CSV to Parents_Elements")
with open('ReadingEggs_BookReviews_Dump.csv', 'r') as file:
parent_elements_css = file
print ("Find the children of the Parent")
# Print stuff to screen to quickly find the css_selector 'codes'
# A bit brute force ish
for css in parent_elements_css:
print (css.text)
child_elements_span = parent_element.find_element_by_css_selector("span")
child_elements_class = parent_element.find_element_by_css_selector("class")
child_elements_table = parent_element.find_element_by_css_selector("table")
child_elements_tr = parent_element.find_element_by_css_selector("tr")
child_elements_td = parent_element.find_element_by_css_selector("td")
These other pages looked interesting:
python selenium xpath/css selector
Get all child elements
Locating Elements
xpath-partial-match-tr-id-with-python-selenium (ah cos I asked this one :D..but the answer by Sers is awesome)
My previous file save was using a dictionary and json...but I could not use it above because of this error: "TypeError: Object of type WebElement is not JSON serializable". I have not saved files before that.
You can get the html of the whole page via driver.page_source. You can then read from the html using beautiful soup so
from bs4 import BeautifulSoup
# navigate to page
html_doc = driver.page_source
soup = BeautifulSoup(html_doc, 'html.parser')
child_elements_span = soup.find_all('span')
child_elements_table = soup.find_all('table')
Here is a good documentation for parsing the html via BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/

BeautifoulSoup not returning everything in Facebook

I'm trying to extract all the pages liked by a given person on Facebook. Therefore, I'm using Python with BeautifulSoup and selenium to automatize the connection.
However, even though my code works, it doesn't actually return all the results (on my own profile, for instance, it only returns about 20% of all pages).
I read that it might be the parser used in BeautifulSoup, but I tried a bunch of them (html.parser, lxml...) and it's always the same thing.
Could that be because Facebook is dynamically generating the pages with AJAX? But then I have Selenium, which should correctly interpret it..!
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
id_user = ""
driver = webdriver.Chrome()
driver.get('https://facebook.com')
driver.find_element_by_id('email').send_keys('')
driver.find_element_by_id('pass').send_keys('')
driver.find_element_by_id('loginbutton').click()
time.sleep(2)
pages_liked = "https://www.facebook.com/search/" + id_user + "/pages-liked"
driver.get(pages_liked)
soup = BeautifulSoup(driver.page_source, 'html.parser')
likes_divs = soup.find_all('a', class_="_32mo")
for div in likes_divs:
print(div['href'].split("/?")[0])
print(div.find('span').text)
Thank you very much,
Loïc
Facebook is famous for make web scrapers's life dificult... That said, looks like you do your homework correctly, the snipet looks rigth to the point.
Start to look into 'driver.page_source', what Selenium gets... if the information is in there, the problem is within BeautifulSoup, if its not, Facebook found an strategy to hide the page (looking at browser signature or fingerprint - yes, these are diferent concepts).

Retrieving scripted page urls via web scrape

I'm trying to get all of the article link from a web scraped search query, however I don't seem to get any results.
Web page in question: http://www.seek.com.au/jobs/in-australia/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&advertiserID=&advertiserGroup=&keywords=police+check&page=1&isAreaUnspecified=false&location=&area=&nation=3000&sortMode=Advertiser&searchFrom=quick&searchType=
my approach:
I'm trying to get the ids of articles and then append them to the already known url (http://www.seek.com.au/job/+ id) however there are no ids on my request(python package from http://docs.python-requests.org/en/latest/) retrieval, in fact there are no articles at all.
it seems that in this particular case I need to execute the scripts(that generate ids) in some way to get the full data, how could I do that?
maybe there are other ways to retrieve all of the results from this search query?
As mentioned, download Selenium. There are python bindings.
Selenium is a web testing automation framework. In effect, by using selenium you are remote controlling a web browser. This is necessary as web browsers have javascript engines and DOMs, allowing AJAX to occur.
Using this test script (it assumes you have Firefox installed; Selenium supports other browsers if needed):
# Import 3rd Party libraries
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
class requester_firefox(object):
def __init__(self):
self.selenium_browser = webdriver.Firefox()
self.selenium_browser.set_page_load_timeout(30)
def __del__(self):
self.selenium_browser.quit()
self.selenium_browser = None
def __call__(self, url):
try:
self.selenium_browser.get(url)
the_page = self.selenium_browser.page_source
except Exception:
the_page = ""
return the_page
test = requester_firefox()
print test("http://www.seek.com.au/jobs/in-australia/#dateRange=999&workType=0&industry=&occupation=&graduateSearch=false&salaryFrom=0&salaryTo=999999&salaryType=annual&advertiserID=&advertiserGroup=&keywords=police+check&page=1&isAreaUnspecified=false&location=&area=&nation=3000&sortMode=Advertiser&searchFrom=quick&searchType=").encode("ascii", "ignore")
It will load SEEK and wait for AJAX pages. The encode method is necessary (for me at least) because SEEK returns a unicode string which the Windows console seemingly can't print.

python selenium - how can I get all visible text on the page (i.e., not the page source) [duplicate]

I've been googling this all day with out finding the answer, so apologies in advance if this is already answered.
I'm trying to get all visible text from a large number of different websites. The reason is that I want to process the text to eventually categorize the websites.
After a couple of days of research, I decided that Selenium was my best chance. I've found a way to grab all the text, with Selenium, unfortunately the same text is being grabbed multiple times:
from selenium import webdriver
import codecs
filen = codecs.open('outoput.txt', encoding='utf-8', mode='w+')
driver = webdriver.Firefox()
driver.get("http://www.examplepage.com")
allelements = driver.find_elements_by_xpath("//*")
ferdigtxt = []
for i in allelements:
if i.text in ferdigtxt:
pass
else:
ferdigtxt.append(i.text)
filen.writelines(i.text)
filen.close()
driver.quit()
The if condition inside the for loop is an attempt at eliminating the problem of fetching the same text multiple times - it does not however, only work as planned on some webpages. (it also makes the script A LOT slower)
I'm guessing the reason for my problem is that - when asking for the inner text of an element - I also get the inner text of the elements nested inside the element in question.
Is there any way around this? Is there some sort of master element I grab the inner text of? Or a completely different way that would enable me to reach my goal? Any help would be greatly appreciated as I'm out of ideas for this one.
Edit: the reason I used Selenium and not Mechanize and Beautiful Soup is because I wanted JavaScript tendered text
Using lxml, you might try something like this:
import contextlib
import selenium.webdriver as webdriver
import lxml.html as LH
import lxml.html.clean as clean
url="http://www.yahoo.com"
ignore_tags=('script','noscript','style')
with contextlib.closing(webdriver.Firefox()) as browser:
browser.get(url) # Load page
content=browser.page_source
cleaner=clean.Cleaner()
content=cleaner.clean_html(content)
with open('/tmp/source.html','w') as f:
f.write(content.encode('utf-8'))
doc=LH.fromstring(content)
with open('/tmp/result.txt','w') as f:
for elt in doc.iterdescendants():
if elt.tag in ignore_tags: continue
text=elt.text or ''
tail=elt.tail or ''
words=' '.join((text,tail)).strip()
if words:
words=words.encode('utf-8')
f.write(words+'\n')
This seems to get almost all of the text on www.yahoo.com, except for text in images and some text that changes with time (done with javascript and refresh perhaps).
Here's a variation on #unutbu's answer:
#!/usr/bin/env python
import sys
from contextlib import closing
import lxml.html as html # pip install 'lxml>=2.3.1'
from lxml.html.clean import Cleaner
from selenium.webdriver import Firefox # pip install selenium
from werkzeug.contrib.cache import FileSystemCache # pip install werkzeug
cache = FileSystemCache('.cachedir', threshold=100000)
url = sys.argv[1] if len(sys.argv) > 1 else "https://stackoverflow.com/q/7947579"
# get page
page_source = cache.get(url)
if page_source is None:
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get(url)
page_source = browser.page_source
cache.set(url, page_source, timeout=60*60*24*7) # week in seconds
# extract text
root = html.document_fromstring(page_source)
# remove flash, images, <script>,<style>, etc
Cleaner(kill_tags=['noscript'], style=True)(root) # lxml >= 2.3.1
print root.text_content() # extract text
I've separated your task in two:
get page (including elements generated by javascript)
extract text
The code is connected only through the cache. You can fetch pages in one process and extract text in another process or defer to do it later using a different algorithm.

Categories

Resources