Crawl a whole website with selenium in python recursive

Crawl a whole website with selenium in python recursive - python

I'm new in python and i try to crawl a whole website recursive with selenium.
I would like to do this with selenium because i want get all cookies which the website is used. I know that other tools can crawl a website easier and faster but other tools can't give me all cookies (first and third party).
Here my code:
from selenium import webdriver
import os, shutil
url = "http://example.com/"
links = set()
def crawl(start_link):
driver.get(start_link)
elements = driver.find_elements_by_tag_name("a")
urls_to_visit = set()
for el in elements:
urls_to_visit.add(el.get_attribute('href'))
for el in urls_to_visit:
if url in el:
if el not in links:
links.add(el)
crawl(el)
else:
return
dir_name = "userdir"
if os.path.isdir(dir_name):
shutil.rmtree(dir_name)
co = webdriver.ChromeOptions()
co.add_argument("--user-data-dir=userdir")
driver = webdriver.Chrome(options = co)
crawl(url)
print(links)
driver.close();
My problem is that the crawl function not open all pages from the website apparently. On some websites i can navigate to pages by hand that the function not reached. Why?

One thing I have noticed while using webdriver is that it needs time to load the page, the elements are not instantly available just like in a regular browser.
You may want to add some delays, or a loop to check for some type of footer to indicate that the page is loaded and you can start crawling.

Related

https://www.realestate.com.au/ not permitting web scraping?

I am trying to extract data from https://www.realestate.com.au/
First I create my url based on the type of property that I am looking for and then I open the url using selenium webdriver, but the page is blank!
Any idea why it happens? Is it because this website doesn't provide web scraping permission? Is there any way to scrape this website?
Here is my code:
from selenium import webdriver
from bs4 import BeautifulSoup
import time
PostCode = "2153"
propertyType = "house"
minBedrooms = "3"
maxBedrooms = "4"
page = "1"
url = "https://www.realestate.com.au/sold/property-{p}-with-{mib}-bedrooms-in-{po}/list-{pa}?maxBeds={mab}&includeSurrounding=false".format(p = propertyType, mib = minBedrooms, po = PostCode, pa = page, mab = maxBedrooms)
print(url)
# url should be "https://www.realestate.com.au/sold/property-house-with-3-bedrooms-in-2153/list-1?maxBeds=4&includeSurrounding=false"
driver = webdriver.Edge("./msedgedriver.exe") # edit the address to where your driver is located
driver.get(url)
time.sleep(3)
src = driver.page_source
soup = BeautifulSoup(src, 'html.parser')
print(soup)

you are passing the link incorrectly, try it
driver.get("your link")
api - https://selenium-python.readthedocs.io/api.html?highlight=get#:~:text=ef_driver.get(%22http%3A//www.google.co.in/%22)

I did try to access realestate.com.au through selenium, and in a different use case through scrapy.
I even got the results from scrapy crawling through use of proper user-agent and cookie but after a few days realestate.com.au detects selenium / scrapy and blocks the requests.
Additionally, it it clearly written in their terms & conditions that indexing any content in their website is strictly prohibited.
You can find more information / analysis in these questions:
Chrome browser initiated through ChromeDriver gets detected
selenium isn't loading the page
Bottom line is, you have to surpass their security if you want to scrape the content.

HTML acquired in Python code is not the same as displayed webpage

I have recently started learning web scraping with Scrapy and as a practice, I decided to scrape a weather data table from this url.
By inspecting the table element of the page, I copy its XPath into my code but I only get an empty list when running the code. I tried to check which tables are present in the HTML using this code:
from scrapy import Selector
import requests
import pandas as pd
url = 'https://www.wunderground.com/history/monthly/OIII/date/2000-5'
html = requests.get(url).content
sel = Selector(text=html)
table = sel.xpath('//table')
It only returns one table and it is not the one I wanted.
After some research, I found out that it might have something to do with JavaScript rendering in the page source code and that Python requests can't handle JavaScript.
After going through a number of SO Q&As, I came upon a certain requests-html library which can apparently handle JS execution so I tried acquiring the table using this code snippet:
from requests_html import HTMLSession
from scrapy import Selector
session = HTMLSession()
resp = session.get('https://www.wunderground.com/history/monthly/OIII/date/2000-5')
resp.html.render()
html = resp.html.html
sel = Selector(text=html)
tables = sel.xpath('//table')
print(tables)
But the result doesn't change. How can I acquire that table?

Problem
Multiple problems may be at play here—not only javascript execution, but HTML5 APIs, cookies, user agent, etc.
Solution
Consider using Selenium with headless Chrome or Firefox web driver. Using selenium with a web driver ensures that page will be loaded as intended. Headless mode ensures that you can run your code without spawning the GUI browser—you can, of course, disable headless mode to see what's being done to the page in realtime and even add a breakpoint so that you can debug beyond pdb in the browser's console.
Example Code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.wunderground.com/history/monthly/OIII/date/2000-5")
tables = driver.find_elements_by_xpath('//table') # There are several APIs to locate elements available.
print(tables)
References
Selenium Github: https://github.com/SeleniumHQ/selenium
Selenium (Python) Documentation: https://selenium-python.readthedocs.io/getting-started.html
Locating Elements: https://selenium-python.readthedocs.io/locating-elements.html

you can use scrapy-splash plugin to work scrapy with Splash (scrapinghub's javascript browser)
Using splash you can render javascript and also execute user events like mouse click

how to load multiple urls in driver.get()?

How to load multiple urls in driver.get() ?
I am trying to load 3 urls in below code, but how to load the other 2 urls?
And afterwards the next challenge is to pass authentication for all the urls as well which is same.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome(executable_path=r"C:/Users/RYadav/AppData/Roaming/Microsoft/Windows/Start Menu/Programs/Python 3.8/chromedriver.exe")
driver.get("https://fleet.my.salesforce.com/reportbuilder/reportType.apexp")#put here the adress of your page
elem = driver.find_elements_by_xpath('//*[#id="ext-gen63"]')#put here the content you have put in Notepad, ie the XPath
button = driver.find_element_by_id('id="ext-gen63"')
print(elem.get_attribute("class"))
driver.close
submit_button.click()

Try below code :
def getUrls(targeturl):
driver = webdriver.Chrome(executable_path=r" path for chromedriver.exe")
driver.get("http://www."+targeturl+".com")
# perform your taks here
driver.quit()
for i in range(3):
webPage = ['google','facebook','gmail']
for i in webPage:
print i;
getUrls(i)

You can't load more than 1 url at a time for each Webdriver. If you want to do so, you maybe need some multiprocessing module. If you want to do an iterative solution, just create a list with every url you need and loop through it. With that you won't have the credential problem neither.

Clicking multiple items on one page using selenium

My main purpose is to go to this specific website, to click each of the products, have enough time to scrape the data from the clicked product, then go back to click another product from the page until all the products are clicked through and scraped (The scraping code I have not included).
My code opens up chrome to redirect to my desired website, generates a list of links to click by class_name. This is the part I am stuck on, I would believe I need a for-loop to iterate through the list of links to click and go back to the original. But, I can't figure out why this won't work.
Here is my code:
import csv
import time
from selenium import webdriver
import selenium.webdriver.chrome.service as service
import requests
from bs4 import BeautifulSoup
url = "https://www.vatainc.com/infusion/adult-infusion.html?limit=all"
service = service.Service('path to chromedriver')
service.start()
capabilities = {'chrome.binary': 'path to chrome'}
driver = webdriver.Remote(service.service_url, capabilities)
driver.get(url)
time.sleep(2)
links = driver.find_elements_by_class_name('product-name')
for link in links:
link.click()
driver.back()
link.click()

I have another solution to your problem.
When I tested your code it showed a strange behaviour. Fixed all problems that I had using xpath.
url = "https://www.vatainc.com/infusion/adult-infusion.html?limit=all"
driver.get(url)
links = [x.get_attribute('href') for x in driver.find_elements_by_xpath("//*[contains(#class, 'product-name')]/a")]
htmls = []
for link in links:
driver.get(link)
htmls.append(driver.page_source)
Instead of going back and forward I saved all links (named as links) and iterate over this list.

How to retrieve the values of dynamic html content using Python

I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:
url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);
response = request.urlopen(url)
data = str(response.read(10000))
data = data.replace("\\n", "\n")
print(data)
Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".
How can I make it so that I can retrieve the value instead of the placeholder text?
Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}
Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.
The code I have now is:
from bs4 import BeautifulSoup
from selenium import webdriver
#...
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
print "Finding..."
for tag in soup.find_all('formatPrice median'):
print tag.text
Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.

Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)

I used selenium + chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "www.sitetotarget.com"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')`

Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.
Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.
import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()
Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:
url = 'http://www.example.org' ## Any website
driver.get(url)
pageSource = driver.page_source
print(driver.get_window_size())
locations = []
for element in driver.find_elements_by_tag_name("a"):
location = element.location;
size = element.size;
# Collect coordinates of object: left/right, top/bottom
x1 = location['x'];
y1 = location['y'];
x2 = location['x']+size['width'];
y2 = location['y']+size['height'];
locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])
locations = [ (el,x1,y1,x2,y2, width,height)
for el,x1,y1,x2,y2,width,height in locations
if not (
## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
x2 <= x1 or y2 <= y1 or x2<0 or y2<0
## Further restrict if you expect the objects to be around a specific size
## or width<200 or height<100
)
]
for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
try:
print('-'*100,f'({width},{height})')
print(el.text[:100])
element_png = el.screenshot_as_png
with open('/tmp/_pageImage.png', 'wb') as f:
f.write(element_png)
img = Image.open('/tmp/_pageImage.png')
display(img)
except Exception as err:
print(err)
Installation for mac+chrome:
pip install selenium
brew cask install chromedriver
brew cask install google-chrome
I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.
Regarding responsibility, please respect robots.txt on each site.

I know this is an old question, but sometimes there is a better solution than using heavy selenium.
This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal.
Though, sometimes if you have to click elements or sth, I guess selenium is the only option.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawl a whole website with selenium in python recursive - python

One thing I have noticed while using webdriver is that it needs time to load the page, the elements are not instantly available just like in a regular browser. You may want to add some delays, or a loop to check for some type of footer to indicate that the page is loaded and you can start crawling.

Related

https://www.realestate.com.au/ not permitting web scraping?

HTML acquired in Python code is not the same as displayed webpage

how to load multiple urls in driver.get()?

Clicking multiple items on one page using selenium

How to retrieve the values of dynamic html content using Python

Categories

Resources