https://www.realestate.com.au/ not permitting web scraping? - python

I am trying to extract data from https://www.realestate.com.au/
First I create my url based on the type of property that I am looking for and then I open the url using selenium webdriver, but the page is blank!
Any idea why it happens? Is it because this website doesn't provide web scraping permission? Is there any way to scrape this website?
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PostCode = "2153"
propertyType = "house"
minBedrooms = "3"
maxBedrooms = "4"
page = "1"
url = "https://www.realestate.com.au/sold/property-{p}-with-{mib}-bedrooms-in-{po}/list-{pa}?maxBeds={mab}&includeSurrounding=false".format(p = propertyType, mib = minBedrooms, po = PostCode, pa = page, mab = maxBedrooms)
print(url)
# url should be "https://www.realestate.com.au/sold/property-house-with-3-bedrooms-in-2153/list-1?maxBeds=4&includeSurrounding=false"
driver = webdriver.Edge("./msedgedriver.exe") # edit the address to where your driver is located
driver.get(url)
time.sleep(3)
src = driver.page_source
soup = BeautifulSoup(src, 'html.parser')
print(soup)

you are passing the link incorrectly, try it
driver.get("your link")
api - https://selenium-python.readthedocs.io/api.html?highlight=get#:~:text=ef_driver.get(%22http%3A//www.google.co.in/%22)

I did try to access realestate.com.au through selenium, and in a different use case through scrapy.
I even got the results from scrapy crawling through use of proper user-agent and cookie but after a few days realestate.com.au detects selenium / scrapy and blocks the requests.
Additionally, it it clearly written in their terms & conditions that indexing any content in their website is strictly prohibited.
You can find more information / analysis in these questions:
Chrome browser initiated through ChromeDriver gets detected
selenium isn't loading the page
Bottom line is, you have to surpass their security if you want to scrape the content.

Related

Why is the html content I got from inspector different from what I got from Request?

Here is the site I am trying to scrap data from:
https://www.onestopwineshop.com/collection/type/red-wines
import requests
from bs4 import BeautifulSoup
url = "https://www.onestopwineshop.com/collection/type/red-wines"
response = requests.get(url)
#print(response.text)
soup = BeautifulSoup(response.content,'lxml')
The code I have above.
It seems like the HTML content I got from the inspector is different from what I got from BeautifulSoup.
My guess is that they are preventing me from getting their data as they detected I am not accessing the site with a browser. If so, is there any way to bypass that?
(Update) Attempt with selenium:
from selenium import webdriver
import time
path = "C:\Program Files (x86)\chromedriver.exe"
# start web browser
browser=webdriver.Chrome(path)
#navigate to the page
url = "https://www.onestopwineshop.com/collection/type/red-wines"
browser.get(url)
# sleep the required amount to let the page load
time.sleep(3)
# get source code
html = browser.page_source
# close web browser
browser.close()
Update 2:(loaded with devtool)
Any website with content that is loaded after the inital page load is unavailable with BS4 with your current method. This is because the content will be loaded with an AJAX call via javascript and the requests library is unable to parse and run JS code.
To achieve this you will have to look at something like selenium which controls a browser via python or other languages... There is a seperate version of selenium for each browser i.e firefox, chrome etc.
Personally I use chrome so the drivers can be found here...
https://chromedriver.chromium.org/downloads
download the correct driver for your version of chrome
install selenium via pip
create a scrape.py file and put the driver in the same folder.
then to get the html string to use with bs4
from selenium import webdriver
import time
# start web browser
browser=webdriver.Chrome()
#navigate to the page
browser.get('http://selenium.dev/')
# sleep the required amount to let the page load
time.sleep(2)
# get source code
html = browser.page_source
# close web browser
browser.close()
You should then be able to use the html variable with BS4
I'll actually turn my comment to an answer because it is a solution to your problem :
As other said, this page is loaded dynamically, but there are ways to retrieve data without running javascript, in your case you want to look at the "network" tab or your dev tools and filter "fetch" requests.
This could be particularly interesting for you :
You don't need selenium or beautifulsoup at all, you can just use requests and parse the json, if you are good enough ;)
There is a working cURL requests : curl 'https://api.commerce7.com/v1/product/for-web?&collectionSlug=red-wines' -H 'tenant: one-stop-wine-shop'
You get an error if you don't add the tenant header.
And that's it, no html parsing, no waiting for the page to load, no javascript running. Much more powerful that the selenium solution.

Is there a way to make a dynamic web page automatically run its JavaScript when webscraping with Python?

I have been getting a lot of issues when trying to do some Python webscraping using BeautifulSoup. Since this particular web page is dynamic, I have been trying to use Selenium first in order to "open" the web page before trying to work with the dynamic content with BeautifulSoup.
The issue I am getting is that the dynamic content is only showing up in my HTML output when I manually scroll through the website upon running the program, otherwise those parts of the HTML remain empty as if I was just using BeautifulSoup by itself without Selenium.
Here is my code:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
if __name__ == "__main__":
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
# options.add_argument('--headless')
driver = webdriver.Chrome("C:\Program Files (x86)\chromedriver.exe", chrome_options=options)
driver.get('https://coinmarketcap.com/')
time.sleep(5)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
tbody = soup.tbody
trs = tbody.contents
for tr in trs:
print(tr)
driver.close()
Now if I have Selenium open Chrome with the headless option turned on, I get the same output I would normally get without having pre-loaded the page. The same thing happens if I'm not in headless mode and I simply let the page load by itself, without scrolling through the content manually.
Does anyone know why this is? Is there a way to get the dynamic content to load without manually scrolling through each time I run the code?
Actually, data is loaded dynamically by javascipt. So you can grab data easily
from api calls json response:
Here is the working example:
Code:
import requests
import json
url= 'https://api.coinmarketcap.com/data-api/v3/cryptocurrency/listing?start=1&limit=100&sortBy=market_cap&sortType=desc&convert=USD,BTC,ETH&cryptoType=all&tagType=all&audited=false&aux=ath,atl,high24h,low24h,num_market_pairs,cmc_rank,date_added,max_supply,circulating_supply,total_supply,volume_7d,volume_30d'
r = requests.get(url)
for item in r.json()['data']['cryptoCurrencyList']:
name = item['name']
print('crypto_name:' + str(name))
Output:
crypto_name:Bitcoin
crypto_name:Ethereum
crypto_name:Binance Coin
crypto_name:Cardano
crypto_name:Tether
crypto_name:Solana
crypto_name:XRP
crypto_name:Polkadot
crypto_name:USD Coin
crypto_name:Dogecoin
crypto_name:Terra
crypto_name:Uniswap
crypto_name:Wrapped Bitcoin
crypto_name:Litecoin
crypto_name:Avalanche
crypto_name:Binance USD
crypto_name:Chainlink
crypto_name:Bitcoin Cash
crypto_name:Algorand
crypto_name:SHIBA INU
crypto_name:Polygon
crypto_name:Stellar
crypto_name:VeChain
crypto_name:Internet Computer
crypto_name:Cosmos
crypto_name:FTX Token
crypto_name:Filecoin
crypto_name:Axie Infinity
crypto_name:Ethereum Classic
crypto_name:TRON
crypto_name:Bitcoin BEP2
crypto_name:Dai
crypto_name:THETA
crypto_name:Tezos
crypto_name:Fantom
crypto_name:Hedera
crypto_name:NEAR Protocol
crypto_name:Elrond
crypto_name:Monero
crypto_name:Crypto.com Coin
crypto_name:PancakeSwap
crypto_name:EOS
crypto_name:The Graph
crypto_name:Flow
crypto_name:Aave
crypto_name:Klaytn
crypto_name:IOTA
crypto_name:eCash
crypto_name:Quant
crypto_name:Bitcoin SV
crypto_name:Neo
crypto_name:Kusama
crypto_name:UNUS SED LEO
crypto_name:Waves
crypto_name:Stacks
crypto_name:TerraUSD
crypto_name:Harmony
crypto_name:Maker
crypto_name:BitTorrent
crypto_name:Celo
crypto_name:Helium
crypto_name:OMG Network
crypto_name:THORChain
crypto_name:Dash
crypto_name:Amp
crypto_name:Zcash
crypto_name:Compound
crypto_name:Chiliz
crypto_name:Arweave
crypto_name:Holo
crypto_name:Decred
crypto_name:NEM
crypto_name:Theta Fuel
crypto_name:Enjin Coin
crypto_name:Revain
crypto_name:Huobi Token
crypto_name:OKB
crypto_name:Decentraland
crypto_name:SushiSwap
crypto_name:ICON
crypto_name:XDC Network
crypto_name:Qtum
crypto_name:TrueUSD
crypto_name:yearn.finance
crypto_name:Nexo
crypto_name:Celsius
crypto_name:Bitcoin Gold
crypto_name:Curve DAO Token
crypto_name:Mina
crypto_name:KuCoin Token
crypto_name:Zilliqa
crypto_name:Perpetual Protocol
crypto_name:Ren
crypto_name:dYdX
crypto_name:Ravencoin
crypto_name:Synthetix
crypto_name:renBTC
crypto_name:Telcoin
crypto_name:Basic Attention Token
crypto_name:Horizenput:

How to select particular region and scrape all the Jobs from a website

I am trying to web scrape all the Jobs from a Job portal by selecting a particular country.
I am sorry to affix a picture but the intent to show you how the page looks like.
What i tried:
Below is what i tried but i;m not getting anything just started learning web scraping ..
import requests
from bs4 import BeautifulSoup
job_url = 'https://wd3.myworkdayjobs.com/careers/'
out_req = requests.get(job_url)
soup = BeautifulSoup(out_req.text, 'html.parser')
print(soup)
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
any help will be much appreciated.
Try selenium library, Search based on attributes & After search results scrape using beautiful soup.
from selenium import webdriver
#browser exposes an executable file
#Through Selenium test we will invoke the executable file which will then #invoke actual browser
driver = webdriver.Chrome(executable_path="C:\\chromedriver.exe")
# to maximize the browser window
driver.maximize_window()
#get method to launch the URL
driver.get("Website")
#to refresh the browser
driver.refresh()
# identifying the checkboxes with type attribute in a list
chk =driver.find_elements_by_xpath("//input[#type='checkbox']")
# len method is used to get the size of that list
print(len(chk))
# get_attribute method is get the value attribute
for i in chk:
if i.get_attribute("value") == "United states of America":
i.click()
#to close the browser
driver.close()
#############################
#Beautiful soup code here
#############################

Scrape 'li' tags from a data table that changes based on drop-down menu

I am trying to scrape data from a data table on this website: [http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793]
The site has multiple tabs, which changes the html (I am working in the 'matchup' tab). Within that matchup tab, there is a drop-down menu that changes the data table that I am trying to access. The items in the table that I am trying to access are 'li' tags within an unordered list. I just want to scrape the data from the "Overall" category of the drop-down menu.
I have been unable to access the data that I want. The item that I'm trying to access is coming back as a 'noneType'. Is there a way to do this?
url = "http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-
744793"
html_page = requests.get(url)
soup = BeautifulSoup(html_page.content, 'html.parser')
dataList = []
for ultag in soup.find_all('ul', {'class': 'base-list team-stats'}):
print(ultag)
for iltag in ultag.find_all('li'):
dataList.append(iltag.get_text())
So the problem is that the content of the tab you are trying to pull data from is dynamically loaded using React JS. So you have to use the selenium module in Python to open a browser to click the list element "Matchup" programmatically then get the source after clicking it.
On my mac I installed selenium and the chromewebdriver using these instructions:
https://gist.github.com/guylaor/3eb9e7ff2ac91b7559625262b8a6dd5f
Then signed the python file, so that the OS X firewall doesn't complain to us when trying run it, using these instructions:
Add Python to OS X Firewall Options?
Then ran the following python3 code:
import os
import time
from selenium import webdriver
from bs4 import BeautifulSoup as soup
# Setup Selenium Chrome Web Driver
chromedriver = "/usr/local/bin/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
# Navigate in Chrome to specified page.
driver.get("http://www.oddsshark.com/ncaab/lsu-alabama-odds-february-18-2017-744793")
# Find the matchup list element using a css selector and click it.
link = driver.find_element_by_css_selector("li[id='react-tabs-0'").click()
# Wait for content to load
time.sleep(1)
# Get the current page source.
source = driver.page_source
# Parse into soup() the source of the page after the link is clicked and use "html.parser" as the Parser.
soupify = soup(source, 'html.parser')
dataList = []
for ultag in soupify.find_all('ul', {'class': 'base-list team-stats'}):
print(ultag)
for iltag in ultag.find_all('li'):
dataList.append(iltag.get_text())
# We are done with the driver so quit.
driver.quit()
Hope this helps as I noticed this was a similar problem to the one I just solved here - Beautifulsoup doesn't reach a child element

How to retrieve the values of dynamic html content using Python

I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:
url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);
response = request.urlopen(url)
data = str(response.read(10000))
data = data.replace("\\n", "\n")
print(data)
Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".
How can I make it so that I can retrieve the value instead of the placeholder text?
Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}
Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.
The code I have now is:
from bs4 import BeautifulSoup
from selenium import webdriver
#...
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
print "Finding..."
for tag in soup.find_all('formatPrice median'):
print tag.text
Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.
Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)
I used selenium + chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "www.sitetotarget.com"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')`
Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.
Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.
import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()
Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:
url = 'http://www.example.org' ## Any website
driver.get(url)
pageSource = driver.page_source
print(driver.get_window_size())
locations = []
for element in driver.find_elements_by_tag_name("a"):
location = element.location;
size = element.size;
# Collect coordinates of object: left/right, top/bottom
x1 = location['x'];
y1 = location['y'];
x2 = location['x']+size['width'];
y2 = location['y']+size['height'];
locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])
locations = [ (el,x1,y1,x2,y2, width,height)
for el,x1,y1,x2,y2,width,height in locations
if not (
## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
x2 <= x1 or y2 <= y1 or x2<0 or y2<0
## Further restrict if you expect the objects to be around a specific size
## or width<200 or height<100
)
]
for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
try:
print('-'*100,f'({width},{height})')
print(el.text[:100])
element_png = el.screenshot_as_png
with open('/tmp/_pageImage.png', 'wb') as f:
f.write(element_png)
img = Image.open('/tmp/_pageImage.png')
display(img)
except Exception as err:
print(err)
Installation for mac+chrome:
pip install selenium
brew cask install chromedriver
brew cask install google-chrome
I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.
Regarding responsibility, please respect robots.txt on each site.
I know this is an old question, but sometimes there is a better solution than using heavy selenium.
This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal.
Though, sometimes if you have to click elements or sth, I guess selenium is the only option.

Categories

Resources