I am trying to write a python script which can change language of youtube and next time when I open browser using webdriver it show me youtube page in language which I saved before. Problem is my script is not working as expected and I am not sure why?
Selecting language and saving cookies in following code
url='https://www.youtube.com'
print(url)
#page = requests.get(url)
#open web browser
browser = webdriver.Firefox()
#load specific url
browser.get(url)
#wait to load js
time.sleep(5)
#find language picker and click
browser.find_element_by_xpath('//*[#id="yt-picker-language-button"]').click()
#wait to open language list
time.sleep(2)
#find and click specific language
browser.find_element_by_xpath('//*[#id="yt-picker-language-footer"]/div[2]/form/div/div[1]/button[1]').click()
pickle.dump(browser.get_cookies() , open("youtubeCookies.pkl","wb"))
Loading data from cookies.
url='https://www.youtube.com'
print(url)
driver=webdriver.Firefox()
driver.get(url)
for cookie in pickle.load(open("youtubeCookies.pkl", "rb")):
driver.add_cookie(cookie)
time.sleep(3)
driver.refresh()
please guide me what I am doing wrong?
Thank you
Related
I have been getting a lot of issues when trying to do some Python webscraping using BeautifulSoup. Since this particular web page is dynamic, I have been trying to use Selenium first in order to "open" the web page before trying to work with the dynamic content with BeautifulSoup.
The issue I am getting is that the dynamic content is only showing up in my HTML output when I manually scroll through the website upon running the program, otherwise those parts of the HTML remain empty as if I was just using BeautifulSoup by itself without Selenium.
Here is my code:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
if __name__ == "__main__":
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
# options.add_argument('--headless')
driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe", chrome_options=options)
driver.get('https://coinmarketcap.com/')
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
tbody = soup.tbody
trs = tbody.contents
for tr in trs:
print(tr)
driver.close()
Now if I have Selenium open Chrome with the headless option turned on, I get the same output I would normally get without having pre-loaded the page. The same thing happens if I'm not in headless mode and I simply let the page load by itself, without scrolling through the content manually.
Does anyone know why this is? Is there a way to get the dynamic content to load without manually scrolling through each time I run the code?
Actually, data is loaded dynamically by javascipt. So you can grab data easily
from api calls json response:
Here is the working example:
Code:
import requests
import json
url= 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
r = requests.get(url)
for item in r.json()['data']['cryptoCurrencyList']:
name = item['name']
print('crypto_name:' + str(name))
Output:
crypto_name:Bitcoin
crypto_name:Ethereum
crypto_name:Binance Coin
crypto_name:Cardano
crypto_name:Tether
crypto_name:Solana
crypto_name:XRP
crypto_name:Polkadot
crypto_name:USD Coin
crypto_name:Dogecoin
crypto_name:Terra
crypto_name:Uniswap
crypto_name:Wrapped Bitcoin
crypto_name:Litecoin
crypto_name:Avalanche
crypto_name:Binance USD
crypto_name:Chainlink
crypto_name:Bitcoin Cash
crypto_name:Algorand
crypto_name:SHIBA INU
crypto_name:Polygon
crypto_name:Stellar
crypto_name:VeChain
crypto_name:Internet Computer
crypto_name:Cosmos
crypto_name:FTX Token
crypto_name:Filecoin
crypto_name:Axie Infinity
crypto_name:Ethereum Classic
crypto_name:TRON
crypto_name:Bitcoin BEP2
crypto_name:Dai
crypto_name:THETA
crypto_name:Tezos
crypto_name:Fantom
crypto_name:Hedera
crypto_name:NEAR Protocol
crypto_name:Elrond
crypto_name:Monero
crypto_name:Crypto.com Coin
crypto_name:PancakeSwap
crypto_name:EOS
crypto_name:The Graph
crypto_name:Flow
crypto_name:Aave
crypto_name:Klaytn
crypto_name:IOTA
crypto_name:eCash
crypto_name:Quant
crypto_name:Bitcoin SV
crypto_name:Neo
crypto_name:Kusama
crypto_name:UNUS SED LEO
crypto_name:Waves
crypto_name:Stacks
crypto_name:TerraUSD
crypto_name:Harmony
crypto_name:Maker
crypto_name:BitTorrent
crypto_name:Celo
crypto_name:Helium
crypto_name:OMG Network
crypto_name:THORChain
crypto_name:Dash
crypto_name:Amp
crypto_name:Zcash
crypto_name:Compound
crypto_name:Chiliz
crypto_name:Arweave
crypto_name:Holo
crypto_name:Decred
crypto_name:NEM
crypto_name:Theta Fuel
crypto_name:Enjin Coin
crypto_name:Revain
crypto_name:Huobi Token
crypto_name:OKB
crypto_name:Decentraland
crypto_name:SushiSwap
crypto_name:ICON
crypto_name:XDC Network
crypto_name:Qtum
crypto_name:TrueUSD
crypto_name:yearn.finance
crypto_name:Nexo
crypto_name:Celsius
crypto_name:Bitcoin Gold
crypto_name:Curve DAO Token
crypto_name:Mina
crypto_name:KuCoin Token
crypto_name:Zilliqa
crypto_name:Perpetual Protocol
crypto_name:Ren
crypto_name:dYdX
crypto_name:Ravencoin
crypto_name:Synthetix
crypto_name:renBTC
crypto_name:Telcoin
crypto_name:Basic Attention Token
crypto_name:Horizenput:
I am trying to extract data from https://www.realestate.com.au/
First I create my url based on the type of property that I am looking for and then I open the url using selenium webdriver, but the page is blank!
Any idea why it happens? Is it because this website doesn't provide web scraping permission? Is there any way to scrape this website?
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PostCode = "2153"
propertyType = "house"
minBedrooms = "3"
maxBedrooms = "4"
page = "1"
url = "https://www.realestate.com.au/sold/property-{p}-with-{mib}-bedrooms-in-{po}/list-{pa}?maxBeds={mab}&includeSurrounding=false".format(p = propertyType, mib = minBedrooms, po = PostCode, pa = page, mab = maxBedrooms)
print(url)
# url should be "https://www.realestate.com.au/sold/property-house-with-3-bedrooms-in-2153/list-1?maxBeds=4&includeSurrounding=false"
driver = webdriver.Edge("./msedgedriver.exe") # edit the address to where your driver is located
driver.get(url)
time.sleep(3)
src = driver.page_source
soup = BeautifulSoup(src, 'html.parser')
print(soup)
you are passing the link incorrectly, try it
driver.get("your link")
api - https://selenium-python.readthedocs.io/api.html?highlight=get#:~:text=ef_driver.get(%22http%3A//www.google.co.in/%22)
I did try to access realestate.com.au through selenium, and in a different use case through scrapy.
I even got the results from scrapy crawling through use of proper user-agent and cookie but after a few days realestate.com.au detects selenium / scrapy and blocks the requests.
Additionally, it it clearly written in their terms & conditions that indexing any content in their website is strictly prohibited.
You can find more information / analysis in these questions:
Chrome browser initiated through ChromeDriver gets detected
selenium isn't loading the page
Bottom line is, you have to surpass their security if you want to scrape the content.
I'm writing a crawler for McMaster-Carr. For example, the page https://www.mcmaster.com/98173A200 , if I open the page directly in browser, I can view all the product data.
Because the data is in dynamically-loaded content, so I'm using Selenium + bs4.
if __name__ == "__main__":
url = "https://www.mcmaster.com/98173A200"
options = webdriver.ChromeOptions()
options.add_argument("--enable-javascript")
driver = webdriver.Chrome("C:/chromedriver/chromedriver.exe", options=options)
driver.set_page_load_timeout(20)
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
delay = 20
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'MainContent')))
except TimeoutException:
print("Timeout loading DOM!")
print(soup)
However, if I run the code I would get a login dialog, which I wouldn't get if I open this page directly in a browser like I mentioned.
I also tried logging in with the code below
try:
email_input = WebDriverWait(driver, delay).until(
EC.presence_of_element_located((By.ID, 'Email')))
print("Page is ready!!")
input("Press Enter to continue...")
except TimeoutException:
print("Loading took too much time!")
email_input.send_keys(email)
password_input = driver.find_element_by_id('Password')
password_input.send_keys(password)
login_button = driver.find_element_by_class_name("FormButton_primaryButton__1kNXY")
login_button.click()
Then it shows access restricted.
I compared the requested header in the page opened by Selenium and the page in my browser, I couldn't find anything wrong. I also tried other webdrivers like PhantomJS and FireFox, and I got the same result.
I also tried using random user-agent using the code below
from random_user_agent.user_agent import UserAgent
from random_user_agent.params import SoftwareName, OperatingSystem
software_names = [SoftwareName.CHROME.value]
operating_systems = [OperatingSystem.WINDOWS.value, OperatingSystem.LINUX.value]
user_agent_rotator = UserAgent(software_names=software_names,
operating_systems=operating_systems,
limit=100)
user_agent = user_agent_rotator.get_random_user_agent()
chrome_options = Options()
chrome_options.add_argument('user-agent=' + user_agent)
Still same result.
The developer tool in the page opened by Selenium showed there were a bunch of errors. I guess the tokenauthorization one is the key to this issue, but I don't know what should I do with it.
Any help would be appreciated!
The reason you saw a login window is that you were accessing McMaster carr via a chrome driver. When the server recognizes your behaviour, it will require you to sign in.
A typical login wouldn't work if you haven't been authenticated by McMaster (need to sign NDA)
You should look into McMaster API. With the API, you can access the database directly. However, you need to sign an NDA with McMaster Carr before obtaining access to the API. https://www.mcmaster.com/help/api/
I'm writing a Python crawler using the Selenium library and the PhantomJs browser. I triggered a click event in a page to open a new page, and then I used the browser.page_source method, but I get the original page source instead of the new open page source. I wonder how to get the new open page source?
Here's my code:
import requests
from selenium import webdriver
url = 'https://sf.taobao.com/list/50025969__2__%D5%E3%BD%AD.htm?auction_start_seg=-1&page=150'
browser = webdriver.PhantomJS(executable_path='C:\\ProgramData\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
browser.get(url)
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
html = browser.page_source
print(html)
browser.quit()
You need to switch to the new window first
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
browser.switch_to_window(browser.window_handles[-1])
html = browser.page_source
I believe you need to add a wait before getting page source.
I've used an implicit wait at the code below.
from selenium import webdriver
url = 'https://sf.taobao.com/list/50025969__2__%D5%E3%BD%AD.htm?auction_start_seg=-1&page=150'
browser = webdriver.PhantomJS(executable_path='C:\\ProgramData\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
browser.get(url)
browser.find_element_by_xpath("//*[#class='pai-item pai-status-done']").click()
browser.implicitly_wait(5)
html = browser.page_source
browser.quit()
Better to use an explicit wait, but it required a condition like EC.element_to_be_clickable((By.ID, 'someid'))
I am quite new to Selenium, it would be great if you guys can point me to the right direction.
I'm trying to access the HTML code of a website AFTER the login sequence.
I've used Selenium to direct the browser to initiate the login sequence, the part of the HTML I need will show up after I login. But when I tried to call the HTML code after the login sequence with page_source, it just gave me the HTML code for the site before logging in.
def test_script(ticker):
base_url = "http://amigobulls.com/stocks/%s/income-statement/quarterly" %ticker
driver = webdriver.Firefox()
verificationErrors = []
accept_next_alert = True
driver.get(base_url)
driver.maximize_window()
driver.implicitly_wait(30)
driver.find_element_by_xpath("//header[#id='header_cont']/nav/div[4]/div/span[3]").click()
driver.find_element_by_id("login_email").clear()
driver.find_element_by_id("login_email").send_keys(email)
driver.find_element_by_id("login_pswd").clear()
driver.find_element_by_id("login_pswd").send_keys(pwd)
driver.find_element_by_id("loginbtn").click()
amigo_script = driver.page_source