Using Python to scrape without damage to the web

Using Python to scrape without damage to the web - python

I wrote the following hybrid code.
import time
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome (executable_path="C:\chromedriver.exe")
driver.get("https://example.com/welcome")
for i in range(6):
source = driver.page_source
soup = BeautifulSoup(source, "html.parser")
print(soup.prettify)
time.sleep(5)
driver.quit()
driver.close
This code allows me to get to the welcome page where I can manually enter username and password. I can read the HTML code and using bs4, trap a dynamic value. The value on the website is sent to my browser at about 2 Hz. If I reduce the sleep time and make the passes in the loop large (>>> 6), am I limited to the refresh rate of the server?
In other words, is my loop (driver.page_source) requesting data from the server or is it only reading what the server sends at its desired rate? I do not want my code to tie up the server by asking for updated data by sending multiple (driver.page_source)s.

Related

Why is the html content I got from inspector different from what I got from Request?

Here is the site I am trying to scrap data from:
https://www.onestopwineshop.com/collection/type/red-wines
import requests
from bs4 import BeautifulSoup
url = "https://www.onestopwineshop.com/collection/type/red-wines"
response = requests.get(url)
#print(response.text)
soup = BeautifulSoup(response.content,'lxml')
The code I have above.
It seems like the HTML content I got from the inspector is different from what I got from BeautifulSoup.
My guess is that they are preventing me from getting their data as they detected I am not accessing the site with a browser. If so, is there any way to bypass that?
(Update) Attempt with selenium:
from selenium import webdriver
import time
path = "C:\Program Files (x86)\chromedriver.exe"
# start web browser
browser=webdriver.Chrome(path)
#navigate to the page
url = "https://www.onestopwineshop.com/collection/type/red-wines"
browser.get(url)
# sleep the required amount to let the page load
time.sleep(3)
# get source code
html = browser.page_source
# close web browser
browser.close()
Update 2:(loaded with devtool)

Any website with content that is loaded after the inital page load is unavailable with BS4 with your current method. This is because the content will be loaded with an AJAX call via javascript and the requests library is unable to parse and run JS code.
To achieve this you will have to look at something like selenium which controls a browser via python or other languages... There is a seperate version of selenium for each browser i.e firefox, chrome etc.
Personally I use chrome so the drivers can be found here...
https://chromedriver.chromium.org/downloads
download the correct driver for your version of chrome
install selenium via pip
create a scrape.py file and put the driver in the same folder.
then to get the html string to use with bs4
from selenium import webdriver
import time
# start web browser
browser=webdriver.Chrome()
#navigate to the page
browser.get('http://selenium.dev/')
# sleep the required amount to let the page load
time.sleep(2)
# get source code
html = browser.page_source
# close web browser
browser.close()
You should then be able to use the html variable with BS4

I'll actually turn my comment to an answer because it is a solution to your problem :
As other said, this page is loaded dynamically, but there are ways to retrieve data without running javascript, in your case you want to look at the "network" tab or your dev tools and filter "fetch" requests.
This could be particularly interesting for you :
You don't need selenium or beautifulsoup at all, you can just use requests and parse the json, if you are good enough ;)
There is a working cURL requests : curl 'https://api.commerce7.com/v1/product/for-web?&collectionSlug=red-wines' -H 'tenant: one-stop-wine-shop'
You get an error if you don't add the tenant header.
And that's it, no html parsing, no waiting for the page to load, no javascript running. Much more powerful that the selenium solution.

Is there a way to make a dynamic web page automatically run its JavaScript when webscraping with Python?

I have been getting a lot of issues when trying to do some Python webscraping using BeautifulSoup. Since this particular web page is dynamic, I have been trying to use Selenium first in order to "open" the web page before trying to work with the dynamic content with BeautifulSoup.
The issue I am getting is that the dynamic content is only showing up in my HTML output when I manually scroll through the website upon running the program, otherwise those parts of the HTML remain empty as if I was just using BeautifulSoup by itself without Selenium.
Here is my code:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
if __name__ == "__main__":
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
# options.add_argument('--headless')
driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe", chrome_options=options)
driver.get('https://coinmarketcap.com/')
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
tbody = soup.tbody
trs = tbody.contents
for tr in trs:
print(tr)
driver.close()
Now if I have Selenium open Chrome with the headless option turned on, I get the same output I would normally get without having pre-loaded the page. The same thing happens if I'm not in headless mode and I simply let the page load by itself, without scrolling through the content manually.
Does anyone know why this is? Is there a way to get the dynamic content to load without manually scrolling through each time I run the code?

Actually, data is loaded dynamically by javascipt. So you can grab data easily
from api calls json response:
Here is the working example:
Code:
import requests
import json
url= 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
r = requests.get(url)
for item in r.json()['data']['cryptoCurrencyList']:
name = item['name']
print('crypto_name:' + str(name))
Output:
crypto_name:Bitcoin
crypto_name:Ethereum
crypto_name:Binance Coin
crypto_name:Cardano
crypto_name:Tether
crypto_name:Solana
crypto_name:XRP
crypto_name:Polkadot
crypto_name:USD Coin
crypto_name:Dogecoin
crypto_name:Terra
crypto_name:Uniswap
crypto_name:Wrapped Bitcoin
crypto_name:Litecoin
crypto_name:Avalanche
crypto_name:Binance USD
crypto_name:Chainlink
crypto_name:Bitcoin Cash
crypto_name:Algorand
crypto_name:SHIBA INU
crypto_name:Polygon
crypto_name:Stellar
crypto_name:VeChain
crypto_name:Internet Computer
crypto_name:Cosmos
crypto_name:FTX Token
crypto_name:Filecoin
crypto_name:Axie Infinity
crypto_name:Ethereum Classic
crypto_name:TRON
crypto_name:Bitcoin BEP2
crypto_name:Dai
crypto_name:THETA
crypto_name:Tezos
crypto_name:Fantom
crypto_name:Hedera
crypto_name:NEAR Protocol
crypto_name:Elrond
crypto_name:Monero
crypto_name:Crypto.com Coin
crypto_name:PancakeSwap
crypto_name:EOS
crypto_name:The Graph
crypto_name:Flow
crypto_name:Aave
crypto_name:Klaytn
crypto_name:IOTA
crypto_name:eCash
crypto_name:Quant
crypto_name:Bitcoin SV
crypto_name:Neo
crypto_name:Kusama
crypto_name:UNUS SED LEO
crypto_name:Waves
crypto_name:Stacks
crypto_name:TerraUSD
crypto_name:Harmony
crypto_name:Maker
crypto_name:BitTorrent
crypto_name:Celo
crypto_name:Helium
crypto_name:OMG Network
crypto_name:THORChain
crypto_name:Dash
crypto_name:Amp
crypto_name:Zcash
crypto_name:Compound
crypto_name:Chiliz
crypto_name:Arweave
crypto_name:Holo
crypto_name:Decred
crypto_name:NEM
crypto_name:Theta Fuel
crypto_name:Enjin Coin
crypto_name:Revain
crypto_name:Huobi Token
crypto_name:OKB
crypto_name:Decentraland
crypto_name:SushiSwap
crypto_name:ICON
crypto_name:XDC Network
crypto_name:Qtum
crypto_name:TrueUSD
crypto_name:yearn.finance
crypto_name:Nexo
crypto_name:Celsius
crypto_name:Bitcoin Gold
crypto_name:Curve DAO Token
crypto_name:Mina
crypto_name:KuCoin Token
crypto_name:Zilliqa
crypto_name:Perpetual Protocol
crypto_name:Ren
crypto_name:dYdX
crypto_name:Ravencoin
crypto_name:Synthetix
crypto_name:renBTC
crypto_name:Telcoin
crypto_name:Basic Attention Token
crypto_name:Horizenput:

https://www.realestate.com.au/ not permitting web scraping?

I am trying to extract data from https://www.realestate.com.au/
First I create my url based on the type of property that I am looking for and then I open the url using selenium webdriver, but the page is blank!
Any idea why it happens? Is it because this website doesn't provide web scraping permission? Is there any way to scrape this website?
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PostCode = "2153"
propertyType = "house"
minBedrooms = "3"
maxBedrooms = "4"
page = "1"
url = "https://www.realestate.com.au/sold/property-{p}-with-{mib}-bedrooms-in-{po}/list-{pa}?maxBeds={mab}&includeSurrounding=false".format(p = propertyType, mib = minBedrooms, po = PostCode, pa = page, mab = maxBedrooms)
print(url)
# url should be "https://www.realestate.com.au/sold/property-house-with-3-bedrooms-in-2153/list-1?maxBeds=4&includeSurrounding=false"
driver = webdriver.Edge("./msedgedriver.exe") # edit the address to where your driver is located
driver.get(url)
time.sleep(3)
src = driver.page_source
soup = BeautifulSoup(src, 'html.parser')
print(soup)

you are passing the link incorrectly, try it
driver.get("your link")
api - https://selenium-python.readthedocs.io/api.html?highlight=get#:~:text=ef_driver.get(%22http%3A//www.google.co.in/%22)

I did try to access realestate.com.au through selenium, and in a different use case through scrapy.
I even got the results from scrapy crawling through use of proper user-agent and cookie but after a few days realestate.com.au detects selenium / scrapy and blocks the requests.
Additionally, it it clearly written in their terms & conditions that indexing any content in their website is strictly prohibited.
You can find more information / analysis in these questions:
Chrome browser initiated through ChromeDriver gets detected
selenium isn't loading the page
Bottom line is, you have to surpass their security if you want to scrape the content.

Gathering data from table using Pandas and Beautiful Soup after logging in using Selenium

I'm trying to scrape data from a paginated table. The table can only be accessed by logging in to a user account. I've decided to approach this using Selenium to log in. I then hope to be able to read this into a Pandas DataFrame. I plan on using BeautifulSoup as a go between.
Here is my code:
from selenium import webdriver
import time
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.example.com/userarea"
driver = webdriver.Chrome()
time.sleep(6)
driver.get(url)
time.sleep(6)
username = driver.find_element_by_id("user")
username.clear()
username.send_keys("xyz#email.com")
password = driver.find_element_by_id("password")
password.clear()
password.send_keys('password')
driver.find_element_by_xpath('//button[]').click()
driver.find_element_by_xpath('//button[text()="Log in"]').click()
time.sleep(6)
driver.find_element_by_xpath('//span[text()="Text"]').click()
driver.find_element_by_xpath('//span[text()="Text"]').click()
html = driver.page_source
soup = BeautifulSoup(html,'html.parser')
try:
tables = soup.find_all('th')
print(tables) #Returns an empty list
df = pd.read_html(str(tables))
df.head()
except:
driver.close()
driver.close()
Unfortunately, this is only printing an empty list. I've tried using lxml too but no joy.
Using the inspection tools it does seem that there aren't any table tags, so I tried to find all <th> tags instead (which definitely are present). Again no joy. I've not yet tried to work through the individual pages. I only mention the pagination in case it offers a clue to the issue.
Any idea what I'm missing?

Thank you to those that offered suggestions. In the end furas' suggestion was best placed and it turned out the script was running too quickly. I paused Python for 6 seconds after clicking on the page with the table on. Seems to run on javascript and I can actually see the values pop into place now as the script works through the pagination.
import time
#Navigate to page, then let it load using:
time.sleep(6)

How to get python requests or urllib2 to wait for final page

I'm using python requests and beautifulsoup to verify a html document. However, the server for the landing page has some backend code that delays several seconds before presenting the final html document. I've tried the redirect=true approach but I end up with the original document. When loading the url on a browser, there is a 2-3 second delay while the page is created by the server. I've tried various samples like url2.geturl() after page load but all of these return the original url (and do so well before the 2-3 seconds elapses). I need something that emulates a browser and grabs the final document.
Btw, I am able to view the correct DOM elements in Chrome, just not problematically in python.

Figured this out after a few cycles. This requires a combination of 2 solutions (use python selenium package and time.sleep). Sets the background chrome process to run headless, get the url, wait for server side code to complete, then load the document. Here, I'm using BeautifulSoup to parse the DOM.
from selenium import webdriver
from bs4 import BeautifulSoup
import time
def run():
url = "http://192.168.1.55"
options = webdriver.ChromeOptions()
options.add_argument('headless')
browser = webdriver.Chrome(chrome_options=options)
browser.get(url)
time.sleep(5)
bs = BeautifulSoup(browser.page_source, 'html.parser')
data = bs.find_all('h3')
if __name__ == "__main__":
run()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using Python to scrape without damage to the web - python

Related

Why is the html content I got from inspector different from what I got from Request?

Is there a way to make a dynamic web page automatically run its JavaScript when webscraping with Python?

https://www.realestate.com.au/ not permitting web scraping?

Gathering data from table using Pandas and Beautiful Soup after logging in using Selenium

How to get python requests or urllib2 to wait for final page

Categories

Resources