Parsing HTML using beautifulsoup gives "None" - python

I can clearly see the tag I need in order to get the data I want to scrape.
According to multiple tutorials I am doing exactly the same way.
So why it gives me "None" when I simply want to display code between li class
from bs4 import BeautifulSoup
import requests
response = requests.get("https://www.governmentjobs.com/careers/sdcounty")
soup = BeautifulSoup(response.text,'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)

Whilst the page does dynamically update (it makes additional requests from browser to update content which you don't capture with your single request) you can find the source URI in the network tab for the content of interest. You also need to add the expected header.
import requests
from bs4 import BeautifulSoup as bs
headers = {'X-Requested-With': 'XMLHttpRequest'}
r = requests.get('https://www.governmentjobs.com/careers/home/index?agency=sdcounty&sort=PositionTitle&isDescendingSort=false&_=', headers=headers)
soup = bs(r.content, 'lxml')
print(len(soup.select('.list-item')))

There is no such content in the original page. The search results which you're referring to, are loaded dynamically/asynchronously using JavaScript.
Print the variable response.text to verify that. I got the result using ReqBin. You'll find that there's no text list-item inside.
Unfortunately, you can't run JavaScript with BeautifulSoup .

Another way to handle dynamically loading data is to use selenium instead of requests to get the page source. This should wait for the Javascript to load the data correctly and then give you the according html. This can be done like so:
from bs4 import BeautifulSoup
from selenium.webdriver import Chrome
from selenium.webdriver.chrome.options import Options
url = "<URL>"
chrome_options = Options()
chrome_options.add_argument("--headless") # Opens the browser up in background
with Chrome(options=chrome_options) as browser:
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
job = soup.find('li', attrs = {'class':'list-item'})
print(job)

Related

How can I get information from a web site using BeautifulSoup in python?

I have to take the publication date displayed in the following web page with BeautifulSoup in python:
https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410
The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().
I tried this code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')
but it gives me [], while in the 'inspect' code of the online page, there is this tag.
What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?
How can I solve this issue and get the number?
The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.
For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()
Prints:
<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>
Umberto if you are looking for an html element span use the following code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]
if you are looking for an html with the id 'biblio-publication-number-content' use the following code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')
in first case you are fetching all span html elements
in second case you are fetching all elements with an id 'biblio-publication-number-content'
I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.

Parsing data scraped from Javascript rendered webpage with python

I am trying to use .find off of a soup variable but when I go to the webpage and try to find the right class it returns none.
from bs4 import *
import time
import pandas as pd
import pickle
import html5lib
from requests_html import HTMLSession
s = HTMLSession()
url = "https://cryptoli.st/lists/fixed-supply"
def get_data(url):
r = s.get(url)
global soup
soup = BeautifulSoup(r.text, 'html.parser')
return soup
def get_next_page(soup):
page = soup.find('div', {'class': 'dataTables_paginate paging_simple_numbers'})
return page
get_data(url)
print(get_next_page(soup))
The "page" variable returns "None" even though I pulled it from the website element inspector. I suspect it has something to do with the fact that the website is rendered with javascript but can't figure out why. If I take away the {'class' : ''datatables_paginate paging_simple_numbers'} and just try to find 'div' then it works and returns the first div tag so I don't know what else to do.
So you want to scrape dynamic page content , You can use beautiful soup with selenium webdriver. This answer is based on explanation here https://www.geeksforgeeks.org/scrape-content-from-dynamic-websites/
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
url = "https://cryptoli.st/lists/fixed-supply"
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
# this is just to ensure that the page is loaded
time.sleep(5)
html = driver.page_source
# this renders the JS code and stores all
# of the information in static HTML code.
# Now, we could simply apply bs4 to html variable
soup = BeautifulSoup(html, "html.parser")

scraping table from a website result as empty

I am trying to scrape the main table with tag :
<table _ngcontent-jna-c4="" class="rayanDynamicStatement">
from following website using 'BeautifulSoup' library, but the code returns empty [] while printing soup returns html string and request status is 200. I found out that when i use browser 'inspect element' tool i can see the table tag but in "view page source" the table tag which is part of "app-root" tag is not shown. (you see <app-root></app-root> which is empty). Besides there is no "json" file in the webpage's components to extract data from it. Please help me how can I scrape the table data.
import urllib.request
import pandas as pd
from urllib.parse import unquote
from bs4 import BeautifulSoup
yurl='https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
req=urllib.request.urlopen(yurl)
print(req.status)
#get response
response = req.read()
html = response.decode("utf-8")
#make html readable
soup = BeautifulSoup(html, features="html")
table_body=soup.find_all("table")
print(table_body)
The table is in the source HTML but kinda hidden and then rendered by JavaScript. It's in one of the <script> tags. This can be located with bs4 and then parsed with regex. Finally, the table data can be dumped to json.loads then to a pandas and to a .csv file, but since I don't know any Persian, you'd have to see if it's of any use.
Just by looking at some values, I think it is.
Oh, and this can be done without selenium.
Here's how:
import pandas as pd
import json
import re
import requests
from bs4 import BeautifulSoup
url = "https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0"
scripts = BeautifulSoup(
requests.get(url, verify=False).content,
"lxml",
).find_all("script", {"type": "text/javascript"})
table_data = json.loads(
re.search(r"var datasource = ({.*})", scripts[-5].string).group(1),
)
pd.DataFrame(
table_data["sheets"][0]["tables"][0]["cells"],
).to_csv("huge_table.csv", index=False)
This outputs a huge file that looks like this:
Might not the best solution, but with webdriver in headless mode you can get all what you want:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
option = Options()
option.add_argument('--headless')
url = 'https://www.codal.ir/Reports/Decision.aspx?LetterSerial=T1hETjlDjOQQQaQQQfaL0Mb7uucg%3D%3D&rt=0&let=6&ct=0&ft=-1&sheetId=0'
driver = webdriver.Chrome(options=option)
driver.get(url)
bs = BeautifulSoup(driver.page_source, 'html.parser')
print(bs.find('table'))
driver.quit()
It looks like the elements your're trying to get are rendered by some JavaScript code. You will need to use something like Selenium instead in order to get the fully rendered HTML.

Not getting all the information when scraping bet365.com

I am having problem when trying to scrape https://www.bet365.com/ using urllib.request and BeautifulSoup.
The problem is, the code below doesn't get all the information on the page, for example players' names don't appear. Maybe another framework or configuration to extract the information?
My code is:
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.bet365.com/"
try:
page = urllib.request.urlopen(url)
except:
print("An error occured.")
soup = BeautifulSoup(page, 'html.parser')
soup = str(soup)
Looking at the source code for the page in question it looks like essentially all of the data is populated by Javascript. BeautifulSoup isn't a headless client, it's just something that downloads and parses HTML, so anything that's populated with Javascript it can't see. You'd need a headless browser like selenium to scrape something like that.
You need to use selenium instead of requests, along with Beautifulsoup as well.
from selenium import webdriver
url = "https://www.bet365.com"
driver = webdriver.Chrome(executable_path=r"the_path_of_driver")
driver.get(url)
driver.maximize_window() #optional, if you want to maximize the browser
driver.implicitly_wait(60) ##Optional, Wait the loading if error
soup = BeautifulSoup(driver.page_source, 'html.parser') #get the soup

Can't fetch links connected to different exhibitors from a webpage

I've been trying to fetch the links connected to different exhibitors from this webpage using python script but I get nothing as result, no error either. The class name m-exhibitors-list__items__item__name__link I've used within my script is available in the page source so they are not generated dynamically.
What change should I bring about within my script to get the links?
This is what I've tried with:
from bs4 import BeautifulSoup
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
with requests.Session() as s:
s.headers['User-Agent']='Mozilla/5.0'
response = s.get(link)
soup = BeautifulSoup(response.text,"lxml")
for item in soup.select("a.m-exhibitors-list__items__item__name__link"):
print(item.get("href"))
One such links I'm after (the first one):
https://www.topdrawer.co.uk/exhibitors/alessi-1
#Life is complex is right that site you used to scrape is protected by Incapsula service to protect site from web scraping and other attacks, it checks for request header whether it is from browser or from robot(you or bot), However it is more likely site has proprietary data, or they might preventing from other threats
However there is option to achieve what you want using Selenium and BS4
following is code snip for your reference
from bs4 import BeautifulSoup
from selenium import webdriver
import requests
link = 'https://www.topdrawer.co.uk/exhibitors?page=1'
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
response = wd.get(link)
html_page = wd.page_source
soup = BeautifulSoup(html_page,"lxml")
results = soup.findAll("a", {"class" : "m-exhibitors-list__items__item__name__link"})
#interate list of anchor tags to get href attribute
for item in results:
print(item.get("href"))
wd.quit()
The site that you are attempting to scrape is protected with Incapsula.
target_url = 'https://www.topdrawer.co.uk/exhibitors?page=1'
response = requests.get(target_url,
headers=http_headers, allow_redirects=True, verify=True, timeout=30)
raw_html = response.text
soupParser = BeautifulSoup(raw_html, 'lxml')
pprint (soupParser.text)
**OUTPUTS**
soupParser = BeautifulSoup(raw_html, 'html')
('Request unsuccessful. Incapsula incident ID: '
'438002260604590346-1456586369751453219')
Read through this: https://www.quora.com/How-can-I-scrape-content-with-Python-from-a-website-protected-by-Incapsula
and these: https://stackoverflow.com/search?q=Incapsula

Categories

Resources