phantomjs failed to load dynamic page - python

I use Python selenium with phantomjs driver to scrape landing pages of website.
One site Phantomjs failed is : http://stance.com . It looks like the site uses angularjs and expressjs
My code is simply:
from selenium import webdriver
driver = webdriver.PHANTOMJS()
driver.get(http://www.stance.com')
The response is basically empty. Can anyone provide hints on why it failed? and what I can do to make to work.

Related

Selenium Python Authenticating issue

To open in Selenium a page where authentication is needed the code below is appropriate.
driver = webdriver.Firefox()
driver.get("https://username:password#testwebsite.com/testpage.html")
driver.implicitly_wait(30)
But the above works when a page should be open directly from Selenium as the initial page.
In my project, I click a link that opens a page where an authentication as above is required.
How to resolve it? I can not use directly driver.get(...) because I can not open directly this page...

Is it possible to scrape HTML from Inspect Element in Python?

I am trying to scrape a site that attempts to block scraping. Viewing the source code through Chrome, or requests, or requests_html results it not showing the correct source code.
Here is an example:
from requests_html import HTMLSession
session = HTMLSession()
content = session.get('website')
content.html.render()
print(content.html.html)
It gives this page:
It looks like JavaScript is disabled or not supported by your browser.
Even though Javascript is enabled. Same thing happens on an actual browser.
However, on my actual browser, when I go to inspect element, I can see the source code just fine. Is there a way to extract the HTML source from inspect element?
Thanks!
The issue you are facing is that it's a page which is rendered by the Javascript on the front-end. In this case, you would require a javasacript enabled browser-engine and then you can easily read the HTML source.
Here's a working code of how I would do it (using selenium):
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
driver = webdriver.Chrome(chrome_options=chrome_options)
# Ensure that the full URL path is given
URL = 'https://proper_url'
# The following step will launch a browser.
driver.get(URL)
# Now you can easily read the source HTML
HTML = driver.page_source
You will have to figure out the details of installing and setting up Selenium and the webdriver. Here's a good place to start.

Selenium is not loading TikTok pages

I'm implementing a TikTok crawler using selenium and scrapy
start_urls = ['https://www.tiktok.com/trending']
....
def parse(self, response):
options = webdriver.ChromeOptions()
from fake_useragent import UserAgent
ua = UserAgent()
user_agent = ua.random
options.add_argument(f'user-agent={user_agent}')
options.add_argument('window-size=800x841')
driver = webdriver.Chrome(chrome_options=options)
driver.get(response.url)
The crawler open Chrome but it does not load videos.
Image loading
The same problem happens also using Firefox
No loading page using Firefox
The same problem using a simple script using Selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("https://www.tiktok.com/trending")
time.sleep(10)
driver.close()
driver = webdriver.Chrome()
driver.get("https://www.tiktok.com/trending")
time.sleep(10)
driver.close()
Did u try to navigate further within the selenium browser window? If an error 404 appears on following sites, I have a solution that worked for me:
I simply changed my User-Agent to "Naverbot" which is "allowed" by the robots.txt file from Tik Tok
(Robots.txt)
After changing that all sites and videos loaded properly.
Other user-agents that are listed under the "allow" segment should work too, if you want to add a rotation.
You can use Windows IE. Instead of chrome or firefox
Videos will load in IE but IE's Layout of showing feed is somehow different from chrome and firefox.
Reasons, why your page, is not loading.
Few advance web apps check your browser history, profile data and cached to check the authentication of the user.
One other thing you can do is run your default profile within your selenium It would be helpfull.

Browser screenshot using python selenium webdiver + phantomjs for a session managed page

I am trying to capture the output on browser as an image. The below piece of code works fine
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_window_size(1024, 768)
driver.get('http://google.com')
driver.save_screenshot('screen.png')
But my html pages are managed under a browser session, So I need to capture the output for a session managed page.
Tried supplying session parameters in the url itself for
driver.get()
But was not successful.
Is there a way possible to bypass the session.
Thanks in advance.

Can I read the browser url using selenium webdriver?

I am using python2.7 with beautiful Soup4 and Selenium webdriver. Now in my webautomation script i will open the link or URL and get into the home page. Now I need to click onto some anchor Labels to navigate through other pages.I did till now. now when i will be going to a new page, I need to get the new URL from the browser as I need to pass it Beautiful Soup4 for webpage scraping. So now my concern is how to get such URLs dynamic way?
Please advice if any!
You get current_url attribute on the driver:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get('http://www.google.com')
print(browser.current_url)

Categories

Resources