I have a python scraper with selenium for scraping a dynamically loaded javascript website.
Scraper by itself works ok but pages sometimes fail to load with 404 error.
Problem is that public http doesn't have data I need but loads everytime and javascript http with data I need sometimes won't load for a random time.
Even weirder is that same javascript http loads in one browser but not in another and vice versa.
I tried webdriver for chrome, firefox, firefox developer edition and opera. Not a single one loads all pages every time.
Public link that doesn't have data I need looks like this: <https://www.sazka.cz/kurzove-sazky/fotbal/*League*/>.
Javascript link that have data I need looks like this <https://rsb.sazka.cz/fotbal/*League*/>.
On average from around 30 links, about 8 fail to load although in different browsers that same link at the same time loads flawlessly.
I tried to search in page source for some clues but I found nothing.
Can anyone help me find out where might be a problem? Thank you.
Edit: here is my code that i think is relevant
Edit2: You can reproduce this problem by right-clicking on some league and try to open link in another tab. Then can be seen that even that page at first loaded properly after opening it in new tab it changes start of http link from https://www.sazka.cz to https://rsb.sazka.cz and sometimes gives 404 error that can last for an hour or more.
driver = webdriver.Chrome(executable_path='chromedriver',
service_args=['--ssl-protocol=any',
'--ignore-ssl-errors=true'])
driver.maximize_window()
for single_url in urls:
randomLoadTime = random.randint(400, 600)/100
time.sleep(randomLoadTime)
driver1 = driver
driver1.get(single_url)
htmlSourceRedirectCheck = driver1.page_source
# Redirect Check
redirectCheck = re.findall('404 - Page not found', htmlSourceRedirectCheck)
if '404 - Page not found' in redirectCheck:
leaguer1 = single_url
leagueFinal = re.findall('fotbal/(.*?)/', leaguer1)
print(str(leagueFinal) + ' ' + '404 - Page not found')
pass
else:
try:
loadedOddsCheck = WebDriverWait(driver1, 25)
loadedOddsCheck.until(EC.element_to_be_clickable \
((By.XPATH, ".//h3[contains(#data-params, 'hideShowEvents')]")))
except TimeoutException:
pass
unloadedOdds = driver1.find_elements_by_xpath \
(".//h3[contains(#data-params, 'loadExpandEvents')]")
for clicking in unloadedOdds:
clicking.click()
randomLoadTime2 = random.randint(50, 100)/100
time.sleep(randomLoadTime2)
matchArr = []
leaguer = single_url
htmlSourceOrig = driver1.page_source
noob here who just managed to be actively refused by the remote server. Too many connection attempts I suspect.
..and really, I should not be trying to connect every time I want to try some new code, so that got me to this question:
So, how can I grab everything off the page, and save it to file...and then just load the file offline to search for the fields I need.
I was in the process of testing the below code when I was Refused so I don't know what works - there are probably typos below :/
Could anyone please offer any suggestions or improvements.
print ("Get CSS elements from page")
parent_elements_css = driver.find_elements_by_css_selector("*")
driver.quit()
print ("Saving Parent_Elements to CSV")
with open('ReadingEggs_BookReviews_Dump.csv', 'w') as file:
file.write(parent_elements_css)
print ("Open CSV to Parents_Elements")
with open('ReadingEggs_BookReviews_Dump.csv', 'r') as file:
parent_elements_css = file
print ("Find the children of the Parent")
# Print stuff to screen to quickly find the css_selector 'codes'
# A bit brute force ish
for css in parent_elements_css:
print (css.text)
child_elements_span = parent_element.find_element_by_css_selector("span")
child_elements_class = parent_element.find_element_by_css_selector("class")
child_elements_table = parent_element.find_element_by_css_selector("table")
child_elements_tr = parent_element.find_element_by_css_selector("tr")
child_elements_td = parent_element.find_element_by_css_selector("td")
These other pages looked interesting:
python selenium xpath/css selector
Get all child elements
Locating Elements
xpath-partial-match-tr-id-with-python-selenium (ah cos I asked this one :D..but the answer by Sers is awesome)
My previous file save was using a dictionary and json...but I could not use it above because of this error: "TypeError: Object of type WebElement is not JSON serializable". I have not saved files before that.
You can get the html of the whole page via driver.page_source. You can then read from the html using beautiful soup so
from bs4 import BeautifulSoup
# navigate to page
html_doc = driver.page_source
soup = BeautifulSoup(html_doc, 'html.parser')
child_elements_span = soup.find_all('span')
child_elements_table = soup.find_all('table')
Here is a good documentation for parsing the html via BeautifulSoup https://www.crummy.com/software/BeautifulSoup/bs4/doc/
I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:
url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);
response = request.urlopen(url)
data = str(response.read(10000))
data = data.replace("\\n", "\n")
print(data)
Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".
How can I make it so that I can retrieve the value instead of the placeholder text?
Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}
Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.
The code I have now is:
from bs4 import BeautifulSoup
from selenium import webdriver
#...
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
print "Finding..."
for tag in soup.find_all('formatPrice median'):
print tag.text
Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.
Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)
I used selenium + chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "www.sitetotarget.com"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')`
Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.
Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.
import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()
Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:
url = 'http://www.example.org' ## Any website
driver.get(url)
pageSource = driver.page_source
print(driver.get_window_size())
locations = []
for element in driver.find_elements_by_tag_name("a"):
location = element.location;
size = element.size;
# Collect coordinates of object: left/right, top/bottom
x1 = location['x'];
y1 = location['y'];
x2 = location['x']+size['width'];
y2 = location['y']+size['height'];
locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])
locations = [ (el,x1,y1,x2,y2, width,height)
for el,x1,y1,x2,y2,width,height in locations
if not (
## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
x2 <= x1 or y2 <= y1 or x2<0 or y2<0
## Further restrict if you expect the objects to be around a specific size
## or width<200 or height<100
)
]
for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
try:
print('-'*100,f'({width},{height})')
print(el.text[:100])
element_png = el.screenshot_as_png
with open('/tmp/_pageImage.png', 'wb') as f:
f.write(element_png)
img = Image.open('/tmp/_pageImage.png')
display(img)
except Exception as err:
print(err)
Installation for mac+chrome:
pip install selenium
brew cask install chromedriver
brew cask install google-chrome
I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.
Regarding responsibility, please respect robots.txt on each site.
I know this is an old question, but sometimes there is a better solution than using heavy selenium.
This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal.
Though, sometimes if you have to click elements or sth, I guess selenium is the only option.
I wrote a program to take a screenshoot of a chosen webpage. User types an url and then my application takes a screenshoot of typed page. I wonder is it possible (and how) to hide a browser window? I mean, no to open it but take a screenshoot? thanks in advance :)
I use python 2.7 and splinter for this. Code below:
from splinter import Browser
import socket
url = raw_input('> ')
browser = None
try:
browser = Browser('firefox')
try:
browser.visit(url)
if browser.status_code.is_success():
browser.driver.save_screenshot('picture.png')
except socket.gaierror, e:
print "URL not found: %s" % url
finally:
if browser is not None:
browser.quit()
For Ubuntu, I found this: Selenium-Python Client Library - Automating in Background but how about Windows?
You have a few options:
Use a "dumb" headless browser, like mechanize. This is fast, and perfect for a quick visit and a screenshot. However, it doesn't understand javascript.
Use the zope.testbrowser browser in your splinter tests. It's a headless browser, so it won't appear on screen. It understands javascript, but will take more investment to get up and going.
Just use urllib2 with some special headers.
I am working on a url using python.
If I click the url, I am able to get the excel file.
but If I run following code, it gives me weird output.
>>> import urllib2
>>> urllib2.urlopen('http://intranet.stats.gov.my/trade/download.php?id=4&var=2012/2012%20MALAYSIA%27S%20EXPORTS%20BY%20ECONOMIC%20GROUPING.xls').read()
output :
"<script language=javascript>window.location='2012/2012 MALAYSIA\\'S EXPORTS BY ECONOMIC GROUPING.xls'</script>"
why its not able to read content with urllib2?
Take a look using an http listener (or even Google Chrome Developer Tools), there's a redirect using javascript when you get to the page.
You will need to access the initial url, parse the result and fetch again the actual url.
#Kai in this question seems to have found an answer to javascript redirects using the module Selenium
from selenium import webdriver
driver = webdriver.Firefox()
link = "http://yourlink.com"
driver.get(link)
#this waits for the new page to load
while(link == driver.current_url):
time.sleep(1)
redirected_url = driver.current_url