I'm implementing a TikTok crawler using selenium and scrapy
start_urls = ['https://www.tiktok.com/trending']
....
def parse(self, response):
options = webdriver.ChromeOptions()
from fake_useragent import UserAgent
ua = UserAgent()
user_agent = ua.random
options.add_argument(f'user-agent={user_agent}')
options.add_argument('window-size=800x841')
driver = webdriver.Chrome(chrome_options=options)
driver.get(response.url)
The crawler open Chrome but it does not load videos.
Image loading
The same problem happens also using Firefox
No loading page using Firefox
The same problem using a simple script using Selenium
from selenium import webdriver
import time
driver = webdriver.Firefox()
driver.get("https://www.tiktok.com/trending")
time.sleep(10)
driver.close()
driver = webdriver.Chrome()
driver.get("https://www.tiktok.com/trending")
time.sleep(10)
driver.close()
Did u try to navigate further within the selenium browser window? If an error 404 appears on following sites, I have a solution that worked for me:
I simply changed my User-Agent to "Naverbot" which is "allowed" by the robots.txt file from Tik Tok
(Robots.txt)
After changing that all sites and videos loaded properly.
Other user-agents that are listed under the "allow" segment should work too, if you want to add a rotation.
You can use Windows IE. Instead of chrome or firefox
Videos will load in IE but IE's Layout of showing feed is somehow different from chrome and firefox.
Reasons, why your page, is not loading.
Few advance web apps check your browser history, profile data and cached to check the authentication of the user.
One other thing you can do is run your default profile within your selenium It would be helpfull.
Related
I'm trying to use Selenium (3.141.0) with ChromeDriver (87.0.4280) to access a page. When accessed manually, it brings me to a policy page (different URL) where you have to hit 'Ok' before continuing to the site. Edit This is using Win 10 and I have the folder with the chromedriver on PATH.
When using the following code, I'm able to get to the policy page with the ("--headless") option but without it I get a blank page with 'data:,' in the URL and nothing else loads. I've tried accessing straight from the policy page and the site URL but they both get stuck when the webdriver is created. Am I missing something? I'm open to any suggestions, thanks!
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver_path = 'D:\....\chromedriver.exe'
driver = webdriver.Chrome(executable_path= driver_path, options= chrome_options)
driver.get(...) # left out the url
This is the output page I get without using ("--headless")
Funny enough, I realized it was because my Chrome Developer tools had become disabled. Not sure how but when I re-enabled them, it worked perfectly again. Weird.
I developed an application using Selenium WebDriver for opening some pages. It's perfectly working locally but I need to launch the browser in client side as well.
I deployed the application using Apache2 under Ubuntu 18.
driver = webdriver.Chrome(executable_path="chromedriver",chrome_options=chromeOptions)
# Specify the URL.
url = "www.google.com"
driver.get(url)
download chrome driver from here
then extract /user/bin
if you have problem fro import selenium view this video
from selenium import webdriver
driver = webdriver.Chrome()
driver.set_page_load_timeout(30)
driver.get("https://www.facebook.com/")
driver.maximize_window()
driver.quit()
I want to iteratively search for 30+ items through a search button in webpage and scrape the related data.
My search items are stored in a list: vol_list
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome("driver path")
driver.get("web url")
for item in vol_list :
mc_search_box = driver.find_element_by_name("search_str")
mc_search_box.clear()
search_box.send_keys(item)
search_box.send_keys(Keys.RETURN)
After search is complete I will proceed to scrape the data for each item and store in array/list.
Is it possible to repeat this process without opening browser for every item in the loop?
You can't use chrome and other browsers without opening it.
In your case, headless browsers should do the job. Headless browsers simulates browser, but doesn't have GUI.
Try ghost driver/ html unit driver/ NodeJS. Then you will have to modify at least this line with the driver you want to use:
driver = webdriver.Chrome("driver path")
Good luck!
If you're using firefox, you can apply the headless option:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
options = Options()
options.add_argument("--headless")
driver = webdriver.Firefox(options=options)
driver.get('your url')
When using selenium to open a web page, it automatically deletes all cookies u had saved in browser which is inconvenient.
Find solution in this page java solution
but don't know how to solve the problem using Python.
Point the browser to a profile file (that's what the java example in the given link is doing)
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('user-data-dir=<path to chrome profile>')
browser = webdriver.Chrome(chrome_options=chrome_options)
On Linux the <path to chrome profile> is /home/<user>/.config/google-chrome.
I am trying to capture the output on browser as an image. The below piece of code works fine
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_window_size(1024, 768)
driver.get('http://google.com')
driver.save_screenshot('screen.png')
But my html pages are managed under a browser session, So I need to capture the output for a session managed page.
Tried supplying session parameters in the url itself for
driver.get()
But was not successful.
Is there a way possible to bypass the session.
Thanks in advance.