Scraping a table with beautiful soup

Scraping a table with beautiful soup - python

I'm trying to scrape the price table (buy yes, prices and contracts available) from this site: https://www.predictit.org/Contract/7069/Will-the-Senate-pass-the-Better-Care-Reconciliation-Act-by-July-31#prices.
This is my (obviously very preliminary) code, structured now just to find the table:
from bs4 import BeautifulSoup
import requests
from lxml import html
import json, re
url = "https://www.predictit.org/Contract/7069/Will-the-Senate-pass-the-Better-Care-Reconciliation-Act-by-July-31#prices"
ret = requests.get(url).text
soup = BeautifulSoup(ret, "lxml")
try:
table = soup.find('table')
print table
except AttributeError as e:
print 'No tables found, exiting'
The code finds and parses a table; however, it's the wrong one (the data table on a different tab https://www.predictit.org/Contract/7069/Will-the-Senate-pass-the-Better-Care-Reconciliation-Act-by-July-31#data).
How do I resolve this error to ensure the code identifies the correct table?

As #downshift mentioned in the comments the table is js generated using xhr request.
So you can either use Selenium or make a direct request to the site's api.
Using the 2nd option:
url = "https://www.predictit.org/PrivateData/GetPriceListAjax?contractId=7069"
ret = requests.get(url).text
soup = BeautifulSoup(ret, "lxml")
table = soup.find('table')

Related

web scraping infogol.net attributerror with beautiful soup

I am trying to scrape the information for ajax matches from infogol. When I inspect the webpage I find that the table class = 'teamstats-summary-matches ng-scope' but whenI try this i find nothing. So far I have come up with the following code:
import requests
from bs4 import BeautifulSoup
# Set the URL of the webpage you want to scrape
url = 'https://www.infogol.net/en/team/ajax/62'
# Make a request to the webpage
response = requests.get(url)
# Parse the HTML of the webpage
soup = BeautifulSoup(response.text, 'html.parser')
# Find the table containing the data
table = soup.find('table', class_='teamstats-summary-matches ng-scope')
if not table:
print('Cannot find table')

Check that you have found what you are expecting before proceeding
# Find the table containing the data
table = soup.find('table', class_='stats-table')
if not table:
print('Cannot find table')
sys.exit(1)

Python Beautiful Soup not pulling all the data

I'm currently looking to pull specific issuer data from URL html with a specific class and ID from the Luxembourg Stock Exchange using Beautiful Soup.
The example link I'm using is here: https://www.bourse.lu/security/XS1338503920/234821
And the data I'm trying to pull is the name under 'Issuer' stored as text; in this case it's 'BNP Paribas Issuance BV'.
I've tried using the class vignette-description-content-text, but it can't seem to find any data, as when looking through the soup, not all of the html is being pulled.
I've found that my current code only pulls some of the html, and I don't know how to expand the data it's pulling.
import requests
from bs4 import BeautifulSoup
URL = "https://www.bourse.lu/security/XS1338503920/234821"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='ResultsContainer', class_="vignette-description-content-text")
I have found similar problems and followed guides shown in link 1, link 2 and link 3, but the example html used seems very different to the webpage I'm looking to scrape.
Is there something I'm missing to pull and scrape the data?

Based on your code, I suspect you are trying to get element which has class=vignette-description-content-text and id=ResultsContaine.
The class_ is correct way to use ,but not with the id
Try this:
import requests
from bs4 import BeautifulSoup
URL = "https://www.bourse.lu/security/XS1338503920/234821"
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
def applyFilter(element):
if element.has_attr('id') and element.has_attr('class'):
if "vignette-description-content-text" in element['class'] and element['id'] == "ResultsContainer":
return True
results = soup.find_all(applyFilter)
for result in results:
#Each result is an element here

Python Error: 'NoneType' object has no attribute 'find_all' using Beautiful Soup

I'm having a problem with some webscraping code that I'm trying to run. To scrape information from a series of links like the following:
http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument
I am trying to scrape certain elements from the table, but I received the following error:
Python Error: 'NoneType' object has no attribute 'find_all'
I know this has to do with the fact that it's not actually finding the table because when I run the following simplified code:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)
It returns a 'None' for the printed table, meaning the code cannot scrape any of the features of the table. I've been running similar code for similar pages and I am able to find the table just fine so I'm not sure why this is not working? I'm new to webscraping but I'd appreciate any help!

I think the html contains some flaws that made the html parser fails to properlly parse your html, you can verify that by printing page.text and then print soup, you will find that the document has some parts removed by parser.
However lxml parser successfully parsed it with its flaw as lxml is better on ill-formatted html documents:
rom bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import time
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)
that should catch the table tag correctly

So the soup doesn't parse the website content correctly, because one tag is incorrect and break the structure. You have to fix it before parse it:
url = 'http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument'
page = requests.get(url)
soup = BeautifulSoup(page.text.replace("</script\n", "</script>"), 'html.parser')
table = soup.find('table', {'bordercolor' : '#6583A0'})
print(table)

import pandas as pd
df = pd.read_html(
"http://www2.congreso.gob.pe/Sicr/TraDocEstProc/CLProLey2006.nsf/ec97fee42a2412d5052578bb001539ee/89045fe8ae896e2e0525751c005544cd?OpenDocument")[0]
print(df)
df.to_csv("Data.csv", index=False, header=None)
Output: view online

How to scrape a specific table form website using python (beautifulsoup4 and requests or any other library)?

https://en.wikipedia.org/wiki/Economy_of_the_European_Union
Above is the link to website and I want to scrape table: Fortune top 10 E.U. corporations by revenue (2016).
Please, share the code for the same:
import requests
from bs4 import BeautifulSoup
def web_crawler(url):
page = requests.get(url)
plain_text = page.text
soup = BeautifulSoup(plain_text,"html.parser")
tables = soup.findAll("tbody")[1]
print(tables)
soup = web_crawler("https://en.wikipedia.org/wiki/Economy_of_the_European_Union")

following what #FanMan said , this is simple code to help you get started, keep in mind that you will need to clean it and also perform the rest of the work on your own.
import requests
from bs4 import BeautifulSoup
url='https://en.wikipedia.org/wiki/Economy_of_the_European_Union'
r=requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
temp_datastore=list()
for text in soup.findAll('p'):
w=text.findAll(text=True)
if(len(w)>0):
temp_datastore.append(w)
Some documentation
beautiful soup:https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests: http://docs.python-requests.org/en/master/user/intro/
urllib: https://docs.python.org/2/library/urllib.html

You're first issue is that your url is not properly defined. After that you need to find the table to extract and it's class. In this case the class was "wikitable" and it was a the first table. I have started your code for you so it gives you the extracted data from the table. Web-scraping is good to learn but if your are just starting to program, practice with some simpler stuff first.
import requests
from bs4 import BeautifulSoup
def webcrawler():
url = "https://en.wikipedia.org/wiki/Economy_of_the_European_Union"
page = requests.get(url)
soup = BeautifulSoup(page.text,"html.parser")
tables = soup.findAll("table", class_='wikitable')[0]
print(tables)
webcrawler()

Extract data from BSE website

How can I extract the value of Security ID, Security Code, Group / Index, Wtd.Avg Price, Trade Date, Quantity Traded, % of Deliverable Quantity to Traded Quantity using Python 3 and save it to an XLS file. Below is the link.
https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/
PS: I am completely new to the python. I know there are few libs which make scrapping easier like BeautifulSoup, selenium, requests, lxml etc. Don't have much idea about them.
Edit 1:
I tried something
from bs4 import BeautifulSoup
import requests
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'id':'newheaddivgrey'})
print(table)
Its output is None. I was expecting all tables in the webpage and filter them further to get required data.
import requests
import lxml.html
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
root = lxml.html.fromstring(r.content)
title = root.xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(title)
Tried another code. Same problem.
Edit 2:
Tried selenium. But I am not getting the table contents.
from selenium import webdriver
driver = webdriver.Chrome(r"C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\bin\chromedriver.exe")
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
table=driver.find_elements_by_xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(table)
driver.quit()
Output is [<selenium.webdriver.remote.webelement.WebElement (session="befdd4f01e6152942c9cfc7c563a6bf2", element="0.13124528538297953-1")>]

After loading the page with Selenium, you can get the Javascript modified page source using driver.page_source. You can then pass this page source in the BeautifulSoup object.
driver = webdriver.Chrome()
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('div', id='SecuritywiseDeliveryPosition')
This code will give you the Securitywise Delivery Position table in the table variable. You can then parse this BeautifulSoup object to get the different values you want.
The soup object contains the full page source including the elements that were dynamically added. Now, you can parse this to get all the things you mentioned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping a table with beautiful soup - python

Related

web scraping infogol.net attributerror with beautiful soup

Python Beautiful Soup not pulling all the data

Python Error: 'NoneType' object has no attribute 'find_all' using Beautiful Soup

How to scrape a specific table form website using python (beautifulsoup4 and requests or any other library)?

Extract data from BSE website

Categories

Resources