Response fails to update with selenium scroll - python

The script is supposed to get all the links from the base_url which displays a subset of results and when scrolled down more results are added to the subset until the list is exhausted. I am able to do that but the issue is that I am only able to retrieve only those few links that load up initially when the web page shows up without performing any scroll. The response should be able to update alongside scroll by web driver. However, this is my code so far.
import re
import requests
import time
from bs4 import BeautifulSoup
from selenium import webdriver
mybrowser = webdriver.Chrome("E:\chromedriver.exe")
base_url = "https://genius.com/search?q="+"drake"
myheader = {'User-Agent':''}
mybrowser.get(base_url)
t_end = time.time() + 60 * 1
while(time.time()<t_end):
mybrowser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
response = requests.get(base_url, headers = myheader)
soup = BeautifulSoup(response.content, "lxml")
pattern = re.compile("[\S]+-lyrics$")
for link in soup.find_all('a',href=True):
if pattern.match(link['href']):
print (link['href'])
Only displays first few links. The links that load up when selenium scrolls the page are not retrieved.

You need to parse the HTML from Selenium itself (this changes when Selenium scrolls the webpage), and not use requests to download the page.
Change:
response = requests.get(base_url, headers = myheader)
soup = BeautifulSoup(response.content, "lxml")
to:
html = mybrowser.page_source
soup = BeautifulSoup(html, "lxml")
And it should work just fine.

Related

Parsing data scraped from Javascript rendered webpage with python

I am trying to use .find off of a soup variable but when I go to the webpage and try to find the right class it returns none.
from bs4 import *
import time
import pandas as pd
import pickle
import html5lib
from requests_html import HTMLSession
s = HTMLSession()
url = "https://cryptoli.st/lists/fixed-supply"
def get_data(url):
r = s.get(url)
global soup
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def get_next_page(soup):
page = soup.find('div', {'class': 'dataTables_paginate paging_simple_numbers'})
return page
get_data(url)
print(get_next_page(soup))
The "page" variable returns "None" even though I pulled it from the website element inspector. I suspect it has something to do with the fact that the website is rendered with javascript but can't figure out why. If I take away the {'class' : ''datatables_paginate paging_simple_numbers'} and just try to find 'div' then it works and returns the first div tag so I don't know what else to do.
So you want to scrape dynamic page content , You can use beautiful soup with selenium webdriver. This answer is based on explanation here https://www.geeksforgeeks.org/scrape-content-from-dynamic-websites/
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://cryptoli.st/lists/fixed-supply"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
# this is just to ensure that the page is loaded
time.sleep(5)
html = driver.page_source
# this renders the JS code and stores all
# of the information in static HTML code.
# Now, we could simply apply bs4 to html variable
soup = BeautifulSoup(html, "html.parser")

Parsing HTML using beautifulsoup gives "None"

I can clearly see the tag I need in order to get the data I want to scrape.
According to multiple tutorials I am doing exactly the same way.
So why it gives me "None" when I simply want to display code between li class
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.governmentjobs.com/careers/sdcounty")
soup = BeautifulSoup(response.text,'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)
Whilst the page does dynamically update (it makes additional requests from browser to update content which you don't capture with your single request) you can find the source URI in the network tab for the content of interest. You also need to add the expected header.
import requests
from bs4 import BeautifulSoup as bs
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = bs(r.content, 'lxml')
print(len(soup.select('.list-item')))
There is no such content in the original page. The search results which you're referring to, are loaded dynamically/asynchronously using JavaScript.
Print the variable response.text to verify that. I got the result using ReqBin. You'll find that there's no text list-item inside.
Unfortunately, you can't run JavaScript with BeautifulSoup .
Another way to handle dynamically loading data is to use selenium instead of requests to get the page source. This should wait for the Javascript to load the data correctly and then give you the according html. This can be done like so:
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
url = "<URL>"
chrome_options = Options()
chrome_options.add_argument("--headless") # Opens the browser up in background
with Chrome(options=chrome_options) as browser:
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)

My python script does not print table from html

I am trying to get table data from below code but surprisingly the script shows a "none" output for table, though I could clearly see it in my HTML doc.
Look forward for help..
from urllib2 import urlopen, Request
from bs4 import BeautifulSoup
site = 'http://www.altrankarlstad.com/wisp'
hdr = {'User-Agent': 'Chrome/78.0.3904.108'}
req = Request(site, headers=hdr)
res = urlopen(req)
rawpage = res.read()
page = rawpage.replace("<!-->", "")
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table", {"class":"table workitems-table mt-2"})
print (table)
Also here comes the code with Selenium Script as suggested:
import time
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www.altrankarlstad.com/wisp'
driver = webdriver.Chrome('C:\\Users\\rugupta\\AppData\\Roaming\\Microsoft\\Windows\\Start Menu\\Programs\\Python 3.7\\chromedriver.exe')
driver.get(url)
driver.find_element_by_id('root').click() #click on search button to fetch list of bus schedule
time.sleep(10) #depends on how long it will take to go to next page after button click
for i in range(1,50):
url = "http://www.altrankarlstad.com/wisp".format(pagenum = i)
text_field = driver.find_elements_by_xpath("//*[#id="root"]/div/div/div/div[2]/table")
for h3Tag in text_field:
print(h3Tag.text)
The page wasn't fully loaded when you use Request. you can debug by printing res.
It seems the page is using javascript to load the table.
You should use selenium, load the page with driver (eg: chromedriver, Firefoxdriver). Sleep a while until the page is loaded (you define it, it take quite a bit to load fully). Then get the table using selenium
import time
from bs4 import BeautifulSoup
from selenium import webdriver
url = 'http://www.altrankarlstad.com/wisp'
driver = webdriver.Chrome('/path/to/chromedriver)
driver.get(url)
# I dont understand what's the purpose when clicking that button
time.sleep(100)
text_field = driver.find_elements_by_xpath('//*[#id="root"]/div/div/div/div[2]/table')
print (text_field[0].text)
You code worked fine with a bit of modifying, this will print all the text from the table. You should learn to debug and change it to get what you want.
This is my output running above scripts

Can't fetch links connected to different exhibitors from a webpage

I've been trying to fetch the links connected to different exhibitors from this webpage using python script but I get nothing as result, no error either. The class name m-exhibitors-list__items__item__name__link I've used within my script is available in the page source so they are not generated dynamically.
What change should I bring about within my script to get the links?
This is what I've tried with:
from bs4 import BeautifulSoup
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
with requests.Session() as s:
s.headers['User-Agent']='Mozilla/5.0'
response = s.get(link)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("a.m-exhibitors-list__items__item__name__link"):
print(item.get("href"))
One such links I'm after (the first one):
https://www.topdrawer.co.uk/exhibitors/alessi-1
#Life is complex is right that site you used to scrape is protected by Incapsula service to protect site from web scraping and other attacks, it checks for request header whether it is from browser or from robot(you or bot), However it is more likely site has proprietary data, or they might preventing from other threats
However there is option to achieve what you want using Selenium and BS4
following is code snip for your reference
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
response = wd.get(link)
html_page = wd.page_source
soup = BeautifulSoup(html_page,"lxml")
results = soup.findAll("a", {"class" : "m-exhibitors-list__items__item__name__link"})
#interate list of anchor tags to get href attribute
for item in results:
print(item.get("href"))
wd.quit()
The site that you are attempting to scrape is protected with Incapsula.
target_url = 'https://www.topdrawer.co.uk/exhibitors?page=1'
response = requests.get(target_url,
headers=http_headers, allow_redirects=True, verify=True, timeout=30)
raw_html = response.text
soupParser = BeautifulSoup(raw_html, 'lxml')
pprint (soupParser.text)
**OUTPUTS**
soupParser = BeautifulSoup(raw_html, 'html')
('Request unsuccessful. Incapsula incident ID: '
'438002260604590346-1456586369751453219')
Read through this: https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula
and these: https://stackoverflow.com/search?q=Incapsula

Selenium and BeautifulSoup: sharing and pulling session data resources to multiple libraries in python

I have problems comparing two libraries in Python 3.6. I use Selenium Firefox WebDriver to log into a website, but when I want BeautifulSoup or Requests to read that website, it reads the link, but differently (reads that page as if I have not logged in). How can I tell Requests that I have already logged in?
Below is the code I have written so far ---
from selenium import webdriver
import config
import requests
from bs4 import BeautifulSoup
#choose webdriver
browser=webdriver.Firefox(executable_path="C:\\Users\\myUser\\geckodriver.exe")
browser.get("https://www.mylink.com/")
#log in
timeout = 1
login = browser.find_element_by_name("sf-login")
login.send_keys(config.USERNAME)
password = browser.find_element_by_name("sf-password")
password.send_keys(config.PASSWORD)
button_log = browser.find_element_by_xpath("/html/body/div[2]/div[1]/div/section/div/div[2]/form/p[2]/input")
button_log.click()
name = "https://www.policytracker.com/auctions/page/"
browser.get(name)
name2 = "/html/body/div[2]/div[1]/div/section/div/div[2]/div[3]/div[" + str(N) + "]/a"
#next page loaded
title1 = browser.find_element_by_xpath(name2)
title1.click()
page = browser.current_url -------> this save url from website that i want to download content (i've already logged in that page)
r = requests.get(page) ---------> i want requests to go to this page, he goes, but not included logged in proceder.... WRONG
r.content
soup = BeautifulSoup(r.content, 'lxml')
print (soup)
If you simply want to pass the page source to BeautifulSoup, you can get the page source from selenium and then pass it to BeautifulSoup directly (no need of requests module).
Instead of
page = browser.current_url
r = requests.get(page)
soup = BeautifulSoup(r.content, 'lxml')
you can do
page = browser.page_source
soup = BeautifulSoup(page, 'html.parser')

Categories

Resources