Trouble Scraping site with BS4 - python

usually I'm able to write a script that works for scraping, but I've been having some difficulty scraping this site for the table enlisted for this research project I'm working on. I'm planning to verify the script working on one State before entering the URL of my targeted states.
import requests
import bs4 as bs
url = ("http://programs.dsireusa.org/system/program/detail/284")
dsire_get = requests.get(url)
soup = bs.BeautifulSoup(dsire_get.text,'lxml')
table = soup.findAll('div', {'data-ng-controller': 'DetailsPageCtrl'})
print(table)
#I'm printing "Table" just to ensure that the table information I'm looking for is within this sections
I'm not sure if the site is attempting to block people from scraping, but all the info that I'm looking to grab is within "&quot"if you look what Table outputs.

The text is rendered with JavaScript.
First render the page with dryscrape
(If you don't want to use dryscrape see Web-scraping JavaScript page with Python )
Then the text can be extracted, after it has been rendered, from a different position on the page i.e the place it has been rendered to.
As an example this code will extract HTML from the summary.
import bs4 as bs
import dryscrape
url = ("http://programs.dsireusa.org/system/program/detail/284")
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'html.parser')
table = soup.findAll('div', {'class': 'programSummary ng-binding'})
print(table[0])
Outputs:
<div class="programSummary ng-binding" data-ng-bind-html="program.summary"><p>
<strong>Eligibility and Availability</strong></p>
<p>
Net metering is available to all "qualifying facilities" (QFs), as defined by the federal <i>Public Utility Regulatory Policies Act of 1978</i> (PURPA), which pertains to renewable energy systems and combined heat and power systems up to 80 megawatts (MW) in capacity. There is no statewide cap on the aggregate capacity of net-metered systems.</p>
<p>
All utilities subject to Public ...

So I finally managed to solve the issue, and successfuly grab the data from the Javascript page the code as follows worked for me if anyone encounters a same issue when trying to use python to scrape a javascript webpage using windows (dryscrape incompatible).
import bs4 as bs
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
browser = webdriver.Chrome()
url = ("http://programs.dsireusa.org/system/program/detail/284")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "html.parser")
table = soup.find('div', {'class': 'programOverview'})
data = []
for n in table.findAll("div", {"class": "ng-binding"}):
trip = str(n.text)
data.append(trip)

Related

How can I get information from a web site using BeautifulSoup in python?

I have to take the publication date displayed in the following web page with BeautifulSoup in python:
https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410
The point is that when I search in the html code from 'inspect' the web page, I find the publication date fast, but when I search in the html code got with python, I cannot find it, even with the functions find() and find_all().
I tried this code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content)
soup.find_all('span', id_= 'biblio-publication-number-content')
but it gives me [], while in the 'inspect' code of the online page, there is this tag.
What am I doing wrong to have the 'inspect' code that is different from the one I get with BeautifulSoup?
How can I solve this issue and get the number?
The problem I believe is due to the content you are looking for being loaded by JavaScript after the initial page is loaded. requests will only show what the initial page content looked like before the DOM was modified by JavaScript.
For this you might try to install selenium and to then download a Selenium web driver for your specific browser. Install the driver in some directory that is in your path and then (here I am using Chrome):
from selenium import webdriver
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup as bs
options = webdriver.ChromeOptions()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
driver.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
# Wait (for up to 10 seconds) for the element we want to appear:
driver.implicitly_wait(10)
elem = driver.find_element(By.ID, 'biblio-publication-number-content')
# Now we can use soup:
soup = bs(driver.page_source, "html.parser")
print(soup.find("span", {"id": "biblio-publication-number-content"}))
finally:
driver.quit()
Prints:
<span id="biblio-publication-number-content"><span class="search">CN105030410</span>A·2015-11-11</span>
Umberto if you are looking for an html element span use the following code:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
results = soup.find_all('span')
[r for r in results]
if you are looking for an html with the id 'biblio-publication-number-content' use the following code
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://worldwide.espacenet.com/patent/search/family/054437790/publication/CN105030410A?q=CN105030410')
soup = bs(r.content, 'html.parser')
soup.find_all(id='biblio-publication-number-content')
in first case you are fetching all span html elements
in second case you are fetching all elements with an id 'biblio-publication-number-content'
I suggest you look into html tags and elements for deeper understanding on how they work and what are the semantics behind them.

Web scraping with Python and beautifulsoup: What is saved by the BeautifulSoup function?

This question follows this previous question. I want to scrape data from a betting site using Python. I first tried to follow this tutorial, but the problem is that the site tipico is not available from Switzerland. I thus chose another betting site: Winamax. In the tutorial, the webpage tipico is first inspected, in order to find where the betting rates are located in the html file. In the tipico webpage, they were stored in buttons of class “c_but_base c_but". By writing the following lines, the rates could therefore be saved and printed using the Beautiful soup module:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.tipico.de/de/live-wetten/"
try:
page = urllib.request.urlopen(url)
except:
print(“An error occured.”)
soup = BeautifulSoup(page, ‘html.parser’)
regex = re.compile(‘c_but_base c_but’)
content_lis = soup.find_all(‘button’, attrs={‘class’: regex})
print(content_lis)
I thus tried to do the same with the webpage Winamax. I inspected the page and found that the betting rates were stored in buttons of class "ui-touchlink-needsclick price odd-price". See the code below:
from bs4 import BeautifulSoup
import urllib.request
import re
url = "https://www.winamax.fr/paris-sportifs/sports/1/7/4"
try:
page = urllib.request.urlopen(url)
except Exception as e:
print(f"An error occurred: {e}")
soup = BeautifulSoup(page, 'html.parser')
regex = re.compile('ui-touchlink-needsclick price odd-price')
content_lis = soup.find_all('button', attrs={'class': regex})
print(content_lis)
The problem is that it prints nothing: Python does not find elements of such class (right?). I thus tried to print the soup object in order to see what the BeautifulSoup function was exactly doing. I added this line
print(soup)
When printing it (I do not show it the print of soup because it is too long), I notice that this is not the same text as what appears when I do a right click "inspect" of the Winamax webpage. So what is the BeautifulSoup function exactly doing? How can I store the betting rates from the Winamax website using BeautifulSoup?
EDIT: I have never coded in html and I'm a beginner in Python, so some terminology might be wrong, that's why some parts are in italics.
That's because the website is using JavaScript to display these details and BeautifulSoup does not interact with JS on it's own.
First try to find out if the element you want to scrape is present in the page source, if so you can scrape, pretty much everything! In your case the button/span tag's were not in the page source(meaning hidden or it's pulled through a script)
No <button> tag in the page source :
So I suggest using Selenium as the solution, and I tried a basic scrape of the website.
Here is the code I used :
from selenium import webdriver
option = webdriver.ChromeOptions()
option.add_argument('--headless')
option.binary_location = r'Your chrome.exe file path'
browser = webdriver.Chrome(executable_path=r'Your chromedriver.exe file path', options=option)
browser.get(r"https://www.winamax.fr/paris-sportifs/sports/1/7/4")
span_tags = browser.find_elements_by_tag_name('span')
for span_tag in span_tags:
print(span_tag.text)
browser.quit()
This is the output:
There are some junk data present in this output, but that's for you to figure out what you need and what you don't!

Why div returns empty when web-scraping Steam Game List?

I'm new in web-scraping and using BeautifulSoup4, so I'm sorry if my question is obvious.
I'm trying to get the hours played from Steam, but <div id="games_list_rows" style="position: relative"> returns None when it should return a lot of differents <div class="gameListRow" id="game_730"> with stuff inside.
I've tried with a friend's profile who has a few games because I was thinking that working with a lot of data could make BS4 ignore the div, but it keeps showing the div empty.
Here's my code:
import bs4 as bs
import urllib.request
# Retrieve profile
profile = "chubaquin"#input("enter profile: >")
search = "https://steamcommunity.com/id/"+profile+"/games/?tab=all"
sauce = urllib.request.urlopen(search)
soup = bs.BeautifulSoup(sauce, "lxml")
a = soup.find("div", id="games_list_rows")
print(a)
Thanks for your help!
The website is loaded dynamically, therefore requests doesn't support it. Try using Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
from time import sleep
from bs4 import BeautifulSoup
from selenium import webdriver
url = "https://steamcommunity.com/id/chubaquin/games/?tab=all"
driver = webdriver.Chrome(r"c:\path\to\chromedriver.exe")
driver.get(url)
# Wait for the page to fully render before parsing it
sleep(5)
soup = BeautifulSoup(driver.page_source, "html.parser")
print(soup.find("div", id="games_list_rows"))
Have you tried the official Steam Web API? (xPaw docs are better than their own)
You need an API key, but they're free, and it's much easier to process the JSON result than to scrape the page(s), especially because the page can change occasionally whereas the JSON is unlikely to do so often at all.

Extract data from BSE website

How can I extract the value of Security ID, Security Code, Group / Index, Wtd.Avg Price, Trade Date, Quantity Traded, % of Deliverable Quantity to Traded Quantity using Python 3 and save it to an XLS file. Below is the link.
https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/
PS: I am completely new to the python. I know there are few libs which make scrapping easier like BeautifulSoup, selenium, requests, lxml etc. Don't have much idea about them.
Edit 1:
I tried something
from bs4 import BeautifulSoup
import requests
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'id':'newheaddivgrey'})
print(table)
Its output is None. I was expecting all tables in the webpage and filter them further to get required data.
import requests
import lxml.html
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
root = lxml.html.fromstring(r.content)
title = root.xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(title)
Tried another code. Same problem.
Edit 2:
Tried selenium. But I am not getting the table contents.
from selenium import webdriver
driver = webdriver.Chrome(r"C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\bin\chromedriver.exe")
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
table=driver.find_elements_by_xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(table)
driver.quit()
Output is [<selenium.webdriver.remote.webelement.WebElement (session="befdd4f01e6152942c9cfc7c563a6bf2", element="0.13124528538297953-1")>]
After loading the page with Selenium, you can get the Javascript modified page source using driver.page_source. You can then pass this page source in the BeautifulSoup object.
driver = webdriver.Chrome()
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('div', id='SecuritywiseDeliveryPosition')
This code will give you the Securitywise Delivery Position table in the table variable. You can then parse this BeautifulSoup object to get the different values you want.
The soup object contains the full page source including the elements that were dynamically added. Now, you can parse this to get all the things you mentioned.

Web Scraping Python (BeautifulSoup,Requests)

I am learning web scraping using python but I can't get the desired result. Below is my code and the output
code
import bs4,requests
url = "https://twitter.com/24x7chess"
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text,"html.parser")
soup.find_all("span",{"class":"account-group-inner"})
[]
Here is what I was trying to scrape
https://i.stack.imgur.com/tHo5S.png
I keep on getting an empty array. Please Help.
Sites like Twitter load the content dynamically, which sometimes depends upon the browser you are using etc. And due to dynamic loading there could be some elements in the webpage which are lazily loaded, which means that the DOM is inflated dynamically, depending upon the user actions, The tag you are inspecting in your browser Inspect element, is inspected the fully dynamically inflated HTML, But the response you are getting using requests, is inflated HTML, or a simple DOM waiting to load the elements dynamically on the user actions which in your case while fetching from requests module is None.
I would suggest you to use selenium webdriver for scraping dynamic javascript web pages.
Try this. It will give you the items you probably look for. Selenium with BeautifulSoup is easy to handle. I've written it that way. Here it is.
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://twitter.com/24x7chess")
soup = BeautifulSoup(driver.page_source,"lxml")
driver.quit()
for title in soup.select("#page-container"):
name = title.select(".ProfileHeaderCard-nameLink")[0].text.strip()
location = title.select(".ProfileHeaderCard-locationText")[0].text.strip()
tweets = title.select(".ProfileNav-value")[0].text.strip()
following = title.select(".ProfileNav-value")[1].text.strip()
followers = title.select(".ProfileNav-value")[2].text.strip()
likes = title.select(".ProfileNav-value")[3].text.strip()
print(name,location,tweets,following,followers,likes)
Output:
akul chhillar New Delhi, India 214 44 17 5
You could have done the whole thing with requests rather than selenium
import requests
from bs4 import BeautifulSoup as bs
import re
r = requests.get('https://twitter.com/24x7chess')
soup = bs(r.content, 'lxml')
bio = re.sub(r'\n+',' ', soup.select_one('[name=description]')['content'])
stats_headers = ['Tweets', 'Following', 'Followers', 'Likes']
stats = [item['data-count'] for item in soup.select('[data-count]')]
data = dict(zip(stats_headers, stats))
print(bio, data)

Categories

Resources