Getting a CSS table with beautiful soup

Getting a CSS table with beautiful soup - python

I have tried to get the data from this table but I have been unable to do so: https://datagolf.ca/player-trends
I have tried many things for the last few hours, below is my most recent when just returns an empty list.
import bs4
import requests
res = requests.get('https://datagolf.ca/player-trends')
soup = bs4.BeautifulSoup(res.text, 'lxml')
table = soup.find_all("div", class_ = "table")
table
Is the issue something similar to this:
Scrape of table with only 'div's

This page is java script rendered. Take a closer look at what requests.get('https://datagolf.ca/player-trends') actually returns. It does not contain the table.
Despite that it pulls the data from https://dl.dropboxusercontent.com/s/hprrfklqs5q0oge/player_trends_dev.csv?dl=1

Related

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]

It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)

Try simply:
soup.select_one('div.field-redshift > div.value>b').text

If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

How do I scrape a 'td' in web scraping

I am learning web scraping and I'm scraping in this following website: ivmp servers. I have trouble with scraping the number of players in the server, can someone help me? I will send the code of what I've done so far
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find('table')
summary = players.find('div', class_ ='players')
print(summary)

Looking at the page you provided, i can assume that the table you want to extract information from is the one with server names and ip adresses.
There are actually 4 "table" element on this page.
Luckily for you, this table has an id (serverlist). You can easily find it with right click > inspect on Chrome
players = soup.select_one('table#serverlist')
Now you want to get the td.
You can print all of them using :
for td in players.select("td"):
print(td)
Or you can select the one you are interested in :
players.select("td.hostname")
for example.
Hope this helps.

Looking at the structure of the page, there are a few table cells (td) with the class "players", it looks like two of them are for sorting the table, so we'll assume you don't want those.
In order to extract the one(s) you do want, I would first query for all the td elements with the class "players", and then loop through them adding only the ones we do want to an array.
Something like this:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find_all('td', class_='players')
summary = []
for cell in players:
# Exclude the cells which are for sorting
if cell.get_text() != 'Players':
summary.append(cell.get_text())
print(summary)

Unable to Access a div, using BeautifulSoup

I'm unable to parse past the div id= "id="divTradeHaltResults". When I try to return the table within this div I get None. Thanks in advance!
from bs4 import BeautifulSoup
import requests
my_url = "https://www.nasdaqtrader.com/Trader.aspx?id=TradeHalts"
r = requests.get(url=my_url)
page_text = r.text
soup = BeautifulSoup(page_text, "lxml")
table = soup.table
print(table)

If you select that tag inside the soup, you get the tag but it's empty. If you look on the webpage, you can see the table in the tag. My guess is that this table is generated with JS (in some form), thus it does not come with the HTML. My solution would be to turn to something like Selenium.
This is the code I ran to select that tag:
soup.find('div', {'id':'divTradeHaltResults'})
# <div id="divTradeHaltResults"></div>
If you look into the JS on the page, you can actually find the function that generates the table, as I mentioned above:
function GetTradeHalts()
{
document.getElementById('divTradeHaltResults').innerHTML = "updating....";
Server.BL_TradeHalt.GetTradeHalts(cb_GetTradeHalts);
setTimeout(GetTradeHalts, 1000 * 60);
}

python bs4 scrape table gets wrong results

I am trying to scrape this site : http://stcw.marina.gov.ph/find/?c_n=14-111112&opt=stcw and get the table at the bottom. When I try to scrape it, I get some elements of the first row, but nothing from the rest of the table. Here is my code
urlText = "http://stcw.marina.gov.ph/find/?c_n=14-111112&opt=stcw"
url = urlopen(urlText)
soup = bs.BeautifulSoup(url,"html.parser")
certificates = soup.find('table',class_='table table-bordered')
for row in certificates.find_all('tr'):
for td in row.find_all('td'):
print td.text
What I get as an output is:
22-20353
SHIP SECURITY OFFICER
Rather than the whole table.
What am I missing ?

It is yet another case of when an underlying parser makes a difference. Switch to lxml or html5lib to see the complete table parsed:
soup = bs.BeautifulSoup(url, "lxml")
soup = bs.BeautifulSoup(url, "html5lib")

Scraping a table with beautiful soup

I'm trying to scrape the price table (buy yes, prices and contracts available) from this site: https://www.predictit.org/Contract/7069/Will-the-Senate-pass-the-Better-Care-Reconciliation-Act-by-July-31#prices.
This is my (obviously very preliminary) code, structured now just to find the table:
from bs4 import BeautifulSoup
import requests
from lxml import html
import json, re
url = "https://www.predictit.org/Contract/7069/Will-the-Senate-pass-the-Better-Care-Reconciliation-Act-by-July-31#prices"
ret = requests.get(url).text
soup = BeautifulSoup(ret, "lxml")
try:
table = soup.find('table')
print table
except AttributeError as e:
print 'No tables found, exiting'
The code finds and parses a table; however, it's the wrong one (the data table on a different tab https://www.predictit.org/Contract/7069/Will-the-Senate-pass-the-Better-Care-Reconciliation-Act-by-July-31#data).
How do I resolve this error to ensure the code identifies the correct table?

As #downshift mentioned in the comments the table is js generated using xhr request.
So you can either use Selenium or make a direct request to the site's api.
Using the 2nd option:
url = "https://www.predictit.org/PrivateData/GetPriceListAjax?contractId=7069"
ret = requests.get(url).text
soup = BeautifulSoup(ret, "lxml")
table = soup.find('table')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting a CSS table with beautiful soup - python

This page is java script rendered. Take a closer look at what requests.get('https://datagolf.ca/player-trends') actually returns. It does not contain the table. Despite that it pulls the data from https://dl.dropboxusercontent.com/s/hprrfklqs5q0oge/player_trends_dev.csv?dl=1

Related

How to get CData from html using beautiful soup

How do I scrape a 'td' in web scraping

Unable to Access a div, using BeautifulSoup

python bs4 scrape table gets wrong results

Scraping a table with beautiful soup

Categories

Resources