I'm going through Automate Boring Stuff with Python and I'm stuck at the chapter about downloading data from the internet. One of the tasks is download photos for a given keyword from Flickr.
I have a massive problem with scraping this site. I've tried BeautifulSoup (which I think is not appropriate in this case as it uses Javascript) and Selenium. Looking at the html I think that I should locate 'overlay' class. However no matter which option I use (find_element_by_class_name, ...by_text, ...by_partial_text) I am not able to find these elements (I get: ".
Could you please help me to clarify what I'm doing wrong? I'd be also grateful for any materials that could help me understadt such cases better. Thanks!
Here's my simple code:
import sys
search_keywords = sys.argv[1]
from selenium import webdriver
browser = webdriver.Firefox()
browser.get(f'https://www.flickr.com/search/?text={search_keywords}')
elems = browser.find_element_by_class_name("overlay")
print(elems)
elems.click()
Sample keywords I type in shell: "industrial design interior"
Are you getting any error message? With Selenium it's useful to surround your code in try/except blocks.
What are you trying to do exactly, download the photos? With a bit of re-writing
try:
options = webdriver.ChromeOptions()
#options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options = options)
search_keywords = "cars"
driver.get(f'https://www.flickr.com/search/?text={search_keywords}')
time.sleep(1)
except Exception as e:
print("Error loading search results page" + str(e))
try:
elems = driver.find_element_by_class_name("overlay")
print(elems)
elems.click()
time.sleep(5)
except Exception as e:
print(str(e))
Loads the page as expected and then clicks on the photo, taking us to This Page
I would be able to help more if you could go into more detail of what you're wanting to accomplish.
Related
noob here who just managed to be actively refused by the remote server. Too many connection attempts I suspect.
..and really, I should not be trying to connect every time I want to try some new code, so that got me to this question:
So, how can I grab everything off the page, and save it to file...and then just load the file offline to search for the fields I need.
I was in the process of testing the below code when I was Refused so I don't know what works - there are probably typos below :/
Could anyone please offer any suggestions or improvements.
print ("Get CSS elements from page")
parent_elements_css = driver.find_elements_by_css_selector("*")
driver.quit()
print ("Saving Parent_Elements to CSV")
with open('ReadingEggs_BookReviews_Dump.csv', 'w') as file:
file.write(parent_elements_css)
print ("Open CSV to Parents_Elements")
with open('ReadingEggs_BookReviews_Dump.csv', 'r') as file:
parent_elements_css = file
print ("Find the children of the Parent")
# Print stuff to screen to quickly find the css_selector 'codes'
# A bit brute force ish
for css in parent_elements_css:
print (css.text)
child_elements_span = parent_element.find_element_by_css_selector("span")
child_elements_class = parent_element.find_element_by_css_selector("class")
child_elements_table = parent_element.find_element_by_css_selector("table")
child_elements_tr = parent_element.find_element_by_css_selector("tr")
child_elements_td = parent_element.find_element_by_css_selector("td")
These other pages looked interesting:
python selenium xpath/css selector
Get all child elements
Locating Elements
xpath-partial-match-tr-id-with-python-selenium (ah cos I asked this one :D..but the answer by Sers is awesome)
My previous file save was using a dictionary and json...but I could not use it above because of this error: "TypeError: Object of type WebElement is not JSON serializable". I have not saved files before that.
You can get the html of the whole page via driver.page_source. You can then read from the html using beautiful soup so
from bs4 import BeautifulSoup
# navigate to page
html_doc = driver.page_source
soup = BeautifulSoup(html_doc, 'html.parser')
child_elements_span = soup.find_all('span')
child_elements_table = soup.find_all('table')
Here is a good documentation for parsing the html via BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Currently I try to download the historical stock prices from Yahoo Finance for personal research purpose. But when I used Selenium in Python to download data, I encountered 2 issues:
1. It took too long time to fully download the web page because it has a lot of external links need to load. There was always a Loading Timeout exception.
When I used try and exception to deal with the timeout exception, but the button used to change the date doesn't work. I guess that this is caused by the web page hasn't been totally loaded.
I am a beginner to Python and Selenium, so could you please advise on this issue?
Find below 3 methods:
Checking page readyState (not reliable):
def page_has_loaded(self):
self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
page_state = self.driver.execute_script('return document.readyState;')
return page_state == 'complete'
Comparing new page ids with the old one:
def page_has_loaded2():
self.log.info("Checking if {} page is loaded.".format(self.driver.current_url))
try:
new_page = browser.find_element_by_tag_name('html')
return new_page.id != old_page.id
except NoSuchElementException:
return False
Using staleness_of method:
#contextlib.contextmanager
def wait_for_page_load(self, timeout=10):
self.log.debug("Waiting for page to load at {}.".format(self.driver.current_url))
old_page = self.find_element_by_tag_name('html')
yield
WebDriverWait(self, timeout).until(staleness_of(old_page))
For more details, check Harry's blog.
Hope it will help you :)
guys I need to write a script that use selenium to go over the pages on the website and download each page to a file.
This is the website I need to go through and I wanna download all 10 pages of reviews.
This is my code:
import urllib2,os,sys,time
from selenium import webdriver
browser=urllib2.build_opener()
browser.addheaders=[('User-agent', 'Mozilla/5.0')]
url='http://www.imdb.com/title/tt2948356/reviews?ref_=tt_urv'
driver = webdriver.Chrome('chromedriver.exe')
driver.get(url)
time.sleep(2)
if not os.path.exists('reviewPages'):os.mkdir('reviewPages')
response=browser.open(url)
myHTML=response.read()
fwriter=open('reviewPages/'+str(1)+'.html','w')
fwriter.write(myHTML)
fwriter.close()
print 'page 1 done'
page=2
while True:
cssPath='#tn15content > table:nth-child(4) > tbody > tr > td:nth-child(2) > a:nth-child(11) > img'
try:
button=driver.find_element_by_css_selector(cssPath)
except:
error_type, error_obj, error_info = sys.exc_info()
print 'STOPPING - COULD NOT FIND THE LINK TO PAGE: ', page
print error_type, 'Line:', error_info.tb_lineno
break
button.click()
time.sleep(2)
response=browser.open(url)
myHTML=response.read()
fwriter=open('reviewPages/'+str(page)+'.html','w')
fwriter.write(myHTML)
fwriter.close()
time.sleep(2)
print 'page',page,'done'
page+=1
But the program just stop downloading the first page. Could someone help? Thanks.
So a few things that are causing this.
Your first I think that's causing you issues is:
table:nth-child(4)
When I go to that website, I think you just want:
table >
The second error is the break statement in your except message. This says, when I get an error, stop looping.
So what's happening is your try, except is not working because your CSS selector is not quite correct, and going to your exception where you are telling it to stop looping.
Instead of that very complex CSS path try this simpler xpath ('//a[child::img[#alt="[Next]"]]/#href') which will return the URL associated with the little triangular 'next' button on each page.
Or notice that each page has 10 reviews and the URLs for pages 2 to 10 just give the start review number, ie http://www.imdb.com/title/tt2948356/reviews?start=10 which is the URL for page 2. Simply calculate the URL for the next page and stop when it doesn't fetch anything.
I am scraping a website with a lot of javascript that is generated when the page is called. As a result, traditional web scraping methods (beautifulsoup, ect.) are not working for my purposes (at least I have been unsuccessful in getting them to work, all of the important data is in the javascript parts). As a result I have started using selenium webdriver. I need to scrape a few hundred pages, each of which has between 10 and 80 data points (each with about 12 fields), so it is important that this script (is that the right terminology?) can run for quite awhile without me having to babysit it.
I have the code working for a single page, and I have a controlling section that tells the scraping section what page to scrape. The problem is that sometimes the javascript portions of the page load, and sometimes they don't when they don't(~1/7), a refresh fixes things, but occasionally the refresh will freeze webdriver and thus the python runtime environment as well. Annoyingly, when it freezes like this, the code fails to time out. What is going on?
Here is a stripped down version of my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import time, re, random, csv
from collections import namedtuple
def main(url_full):
driver = webdriver.Firefox()
driver.implicitly_wait(15)
driver.set_page_load_timeout(30)
#create HealthPlan namedtuple
HealthPlan = namedtuple( "HealthPlan", ("State, County, FamType, Provider, PlanType, Tier,") +
(" Premium, Deductible, OoPM, PrimaryCareVisitCoPay, ER, HospitalStay,") +
(" GenericRx, PreferredPrescription, RxOoPM, MedicalDeduct, BrandDrugDeduct"))
#check whether the page has loaded and handle page load and time out errors
pageNotLoaded= bool(True)
while pageNotLoaded:
try:
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
# Handle page load error by testing presence of showAll,
# an important feature of the page, which only appears if everything else loads
try:
driver.find_element_by_xpath('//*[#id="showAll"]').text
# catch NoSuchElementException=>refresh page
except NoSuchElementException:
try:
driver.refresh()
# catch TimeoutException => quit and load the page
# in a new instance of firefox,
# I don't think the code ever gets here, because it freezes in the refresh
# and will not throw the timeout exception like I would like
except TimeoutException:
driver.quit()
time.sleep(3+ abs(random.normalvariate(1.8,3)))
driver.get(url_full)
time.sleep(6+ abs(random.normalvariate(1.8,3)))
pageNotLoaded= False
scrapePage() # this is a dummy function, everything from here down works fine,
I have looked extensively for similar problems, and I do not think anyone else has posted about this on so, or anywhere else that I have looked. I am using python 2.7, selenium 2.39.0 and I am trying to scrape Healthcare.gov 's get premium estimate's pages
EDIT: (as an example,this page) It may also be worth mentioning that the page fails to load completely more often when the computer has been on/ doing this for awhile (i'm guessing that the free ram is getting full, and it glitches while loading) this is kind of beside the point though, because this should be handled by the try/except.
EDIT2: I should also mention that this is being run on windows7 64bit, with firefox 17 (which I believe is the newest supported version)
Dude, time.sleep it's a fail!
What's this?
time.sleep(3+ abs(random.normalvariate(1.8,3)))
Try this:
class TestPy(unittest.TestCase):
def waits(self):
self.implicit_wait = 30
Or this:
(self.)driver.implicitly_wait(10)
Or this:
WebDriverWait(driver, 10).until(lambda driver: driver.find_element_by_xpath('some_xpath'))
Or, instead of driver.refresh() you can trick it :
driver.get(your url)
Also you can cick the cookie :
driver.delete_all_cookies()
scrapePage() # this is a dummy function, everything from here down works fine, :
http://scrapy.org
I wondering to know if is any way to open url in browser and read source opened url ?
I'm trying to check if my XPath selector getting right value of captcha img src. I can't do this making 2 connections to url cause captcha will reload every single time i connect to url.
For reading source i'm using:
url = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx"
sock = urllib.urlopen(url)
htmlSource = sock.read()
sock.close()
To open url in browser i'm using:
if sys.platform=='win32':
os.startfile(url)
elif sys.platform=='darwin':
subprocess.Popen(['open', url])
else:
try:
subprocess.Popen(['xdg-open', url])
except OSError:
print 'Please open a browser on: '+url
Does any of you guys know how to solve it ?
Thanks
I found solution. To see url in browser and in the same time see source code of this page just use this code:
from selenium import webdriver
from lxml import etree, html
url = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/Search.aspx"
adres_prefix = "https://prod.ceidg.gov.pl/CEIDG/CEIDG.Public.UI/"
adres_sufix = etree.XPath('string(//img[#class="captcha"]/#src)')
browser = webdriver.Firefox()
browser.get(url)
html_source = browser.page_source # i'm getting source code of open url
root = etree.HTML(html_source)
result = etree.tostring(root, pretty_print=True, method="html")
result2 = adres_sufix(root)
www = adres_prefix + result2
print www # now i see if XPath gives me right value
Hope it will help others
Thanks anyway for any help
Most of the cross platform python GUI tool kits such as wx.Python, pyside, etc., have a html display window that you can use to display the html source from within your python. I would recommend using one of those to display your content from within your python code.
You probably are going to need to make more than one request to get the CAPTCHA. Get yourself a copy of Fiddler 2 (free) http://fiddler2.com/get-fiddler. It will allow you to see the "conversation" between the server and your browser. Once you see that, you will probably know what you need.