python bs4 scrape table gets wrong results

python bs4 scrape table gets wrong results - python

I am trying to scrape this site : http://stcw.marina.gov.ph/find/?c_n=14-111112&opt=stcw and get the table at the bottom. When I try to scrape it, I get some elements of the first row, but nothing from the rest of the table. Here is my code
urlText = "http://stcw.marina.gov.ph/find/?c_n=14-111112&opt=stcw"
url = urlopen(urlText)
soup = bs.BeautifulSoup(url,"html.parser")
certificates = soup.find('table',class_='table table-bordered')
for row in certificates.find_all('tr'):
for td in row.find_all('td'):
print td.text
What I get as an output is:
22-20353
SHIP SECURITY OFFICER
Rather than the whole table.
What am I missing ?

It is yet another case of when an underlying parser makes a difference. Switch to lxml or html5lib to see the complete table parsed:
soup = bs.BeautifulSoup(url, "lxml")
soup = bs.BeautifulSoup(url, "html5lib")

Related

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]

It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)

Try simply:
soup.select_one('div.field-redshift > div.value>b').text

If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

How can I scrape the contents inside the 'sorting_1' class with Python?

I've been given a project to make covid tracker. I decided to scrape some elements through the site (https://www.worldometers.info/coronavirus/). I'm very new to python so decided to go with BeautifulSoup. I was able to scrape the basic elements, like the total cases, active cases and so on. However, whenever I try to grab the country names or the numbers, it returns an empty list. Even though there exists a class 'sorting_1', it still returns an empty list. Could someone guide me where am I going wrong?
This is something which I am trying to grab:
<td style="font-weight: bold; text-align:right" class="sorting_1">4,918,420</td>
Here is my current code:
import requests
import bs4
#making a request and a soup
res = requests.get('https://www.worldometers.info/coronavirus/')
soup = bs4.BeautifulSoup(res.text, 'lxml')
#scraping starts here
total_cases = soup.select('.maincounter-number')[0].text
total_deaths = soup.select('.maincounter-number')[1].text
total_recovered = soup.select('.maincounter-number')[2].text
active_cases = soup.select('.number-table-main')[0].text
country_cases = soup.find_all('td', {'class': 'sorting_1'})

You can get sorting_1 class because it not present in page source.
You have found all rows from the table and then read information from the required columns.
So, to get total cases for each country, you can use following code:
import requests
import bs4
res = requests.get('https://www.worldometers.info/coronavirus/')
soup = bs4.BeautifulSoup(res.text, 'lxml')
country_cases = soup.find_all('td', {'class': 'sorting_1'})
rows = soup.select('table#main_table_countries_today tr')
for row in rows[8:18]:
tds = row.find_all('td')
print(tds[1].text.strip(), '=', tds[2].text.strip())

Welcome to SO!
Looking at their website, it seems that the sorting_X classes are added by javascript, so they don't exist in the raw html.
The table does exist, however, so i'd advise to loop over the table rows similar to this:
table_rows = soup.find("table", id="main_table_countries_today").find_all("tr")
for row in table_rows:
name = "unknown"
# Find country name
for td in row.find_all("td"):
if td.find("mt_a"): # This kind of link apparently only exists in the "name" column
name = td.find("a").text
# Do some more scraping
Warning, i didn't work with soup for a while so this may not be 100% correct. You get the idea.

How do I scrape a 'td' in web scraping

I am learning web scraping and I'm scraping in this following website: ivmp servers. I have trouble with scraping the number of players in the server, can someone help me? I will send the code of what I've done so far
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find('table')
summary = players.find('div', class_ ='players')
print(summary)

Looking at the page you provided, i can assume that the table you want to extract information from is the one with server names and ip adresses.
There are actually 4 "table" element on this page.
Luckily for you, this table has an id (serverlist). You can easily find it with right click > inspect on Chrome
players = soup.select_one('table#serverlist')
Now you want to get the td.
You can print all of them using :
for td in players.select("td"):
print(td)
Or you can select the one you are interested in :
players.select("td.hostname")
for example.
Hope this helps.

Looking at the structure of the page, there are a few table cells (td) with the class "players", it looks like two of them are for sorting the table, so we'll assume you don't want those.
In order to extract the one(s) you do want, I would first query for all the td elements with the class "players", and then loop through them adding only the ones we do want to an array.
Something like this:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find_all('td', class_='players')
summary = []
for cell in players:
# Exclude the cells which are for sorting
if cell.get_text() != 'Players':
summary.append(cell.get_text())
print(summary)

Getting a CSS table with beautiful soup

I have tried to get the data from this table but I have been unable to do so: https://datagolf.ca/player-trends
I have tried many things for the last few hours, below is my most recent when just returns an empty list.
import bs4
import requests
res = requests.get('https://datagolf.ca/player-trends')
soup = bs4.BeautifulSoup(res.text, 'lxml')
table = soup.find_all("div", class_ = "table")
table
Is the issue something similar to this:
Scrape of table with only 'div's

This page is java script rendered. Take a closer look at what requests.get('https://datagolf.ca/player-trends') actually returns. It does not contain the table.
Despite that it pulls the data from https://dl.dropboxusercontent.com/s/hprrfklqs5q0oge/player_trends_dev.csv?dl=1

Extract data from BSE website

How can I extract the value of Security ID, Security Code, Group / Index, Wtd.Avg Price, Trade Date, Quantity Traded, % of Deliverable Quantity to Traded Quantity using Python 3 and save it to an XLS file. Below is the link.
https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/
PS: I am completely new to the python. I know there are few libs which make scrapping easier like BeautifulSoup, selenium, requests, lxml etc. Don't have much idea about them.
Edit 1:
I tried something
from bs4 import BeautifulSoup
import requests
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('div', attrs = {'id':'newheaddivgrey'})
print(table)
Its output is None. I was expecting all tables in the webpage and filter them further to get required data.
import requests
import lxml.html
URL = 'https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/'
r = requests.get(URL)
root = lxml.html.fromstring(r.content)
title = root.xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(title)
Tried another code. Same problem.
Edit 2:
Tried selenium. But I am not getting the table contents.
from selenium import webdriver
driver = webdriver.Chrome(r"C:\Program Files\JetBrains\PyCharm Community Edition 2017.3.3\bin\chromedriver.exe")
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
table=driver.find_elements_by_xpath('//*[#id="SecuritywiseDeliveryPosition"]/table/tbody/tr/td/table/tbody/tr[1]/td')
print(table)
driver.quit()
Output is [<selenium.webdriver.remote.webelement.WebElement (session="befdd4f01e6152942c9cfc7c563a6bf2", element="0.13124528538297953-1")>]

After loading the page with Selenium, you can get the Javascript modified page source using driver.page_source. You can then pass this page source in the BeautifulSoup object.
driver = webdriver.Chrome()
driver.get('https://www.bseindia.com/stock-share-price/smartlink-network-systems-ltd/smartlink/532419/')
html = driver.page_source
driver.quit()
soup = BeautifulSoup(html, 'lxml')
table = soup.find('div', id='SecuritywiseDeliveryPosition')
This code will give you the Securitywise Delivery Position table in the table variable. You can then parse this BeautifulSoup object to get the different values you want.
The soup object contains the full page source including the elements that were dynamically added. Now, you can parse this to get all the things you mentioned.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python bs4 scrape table gets wrong results - python

It is yet another case of when an underlying parser makes a difference. Switch to lxml or html5lib to see the complete table parsed: soup = bs.BeautifulSoup(url, "lxml") soup = bs.BeautifulSoup(url, "html5lib")

Related

How to get CData from html using beautiful soup

How can I scrape the contents inside the 'sorting_1' class with Python?

How do I scrape a 'td' in web scraping

Getting a CSS table with beautiful soup

Extract data from BSE website

Categories

Resources