How do I scrape a 'td' in web scraping - python

I am learning web scraping and I'm scraping in this following website: ivmp servers. I have trouble with scraping the number of players in the server, can someone help me? I will send the code of what I've done so far
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find('table')
summary = players.find('div', class_ ='players')
print(summary)

Looking at the page you provided, i can assume that the table you want to extract information from is the one with server names and ip adresses.
There are actually 4 "table" element on this page.
Luckily for you, this table has an id (serverlist). You can easily find it with right click > inspect on Chrome
players = soup.select_one('table#serverlist')
Now you want to get the td.
You can print all of them using :
for td in players.select("td"):
print(td)
Or you can select the one you are interested in :
players.select("td.hostname")
for example.
Hope this helps.

Looking at the structure of the page, there are a few table cells (td) with the class "players", it looks like two of them are for sorting the table, so we'll assume you don't want those.
In order to extract the one(s) you do want, I would first query for all the td elements with the class "players", and then loop through them adding only the ones we do want to an array.
Something like this:
import requests
from bs4 import BeautifulSoup
source = requests.get('https://www.game-state.com/index.php?game=ivmp').text
soup = BeautifulSoup(source, 'html.parser')
players = soup.find_all('td', class_='players')
summary = []
for cell in players:
# Exclude the cells which are for sorting
if cell.get_text() != 'Players':
summary.append(cell.get_text())
print(summary)

Related

How can I scrape the contents inside the 'sorting_1' class with Python?

I've been given a project to make covid tracker. I decided to scrape some elements through the site (https://www.worldometers.info/coronavirus/). I'm very new to python so decided to go with BeautifulSoup. I was able to scrape the basic elements, like the total cases, active cases and so on. However, whenever I try to grab the country names or the numbers, it returns an empty list. Even though there exists a class 'sorting_1', it still returns an empty list. Could someone guide me where am I going wrong?
This is something which I am trying to grab:
<td style="font-weight: bold; text-align:right" class="sorting_1">4,918,420</td>
Here is my current code:
import requests
import bs4
#making a request and a soup
res = requests.get('https://www.worldometers.info/coronavirus/')
soup = bs4.BeautifulSoup(res.text, 'lxml')
#scraping starts here
total_cases = soup.select('.maincounter-number')[0].text
total_deaths = soup.select('.maincounter-number')[1].text
total_recovered = soup.select('.maincounter-number')[2].text
active_cases = soup.select('.number-table-main')[0].text
country_cases = soup.find_all('td', {'class': 'sorting_1'})
You can get sorting_1 class because it not present in page source.
You have found all rows from the table and then read information from the required columns.
So, to get total cases for each country, you can use following code:
import requests
import bs4
res = requests.get('https://www.worldometers.info/coronavirus/')
soup = bs4.BeautifulSoup(res.text, 'lxml')
country_cases = soup.find_all('td', {'class': 'sorting_1'})
rows = soup.select('table#main_table_countries_today tr')
for row in rows[8:18]:
tds = row.find_all('td')
print(tds[1].text.strip(), '=', tds[2].text.strip())
Welcome to SO!
Looking at their website, it seems that the sorting_X classes are added by javascript, so they don't exist in the raw html.
The table does exist, however, so i'd advise to loop over the table rows similar to this:
table_rows = soup.find("table", id="main_table_countries_today").find_all("tr")
for row in table_rows:
name = "unknown"
# Find country name
for td in row.find_all("td"):
if td.find("mt_a"): # This kind of link apparently only exists in the "name" column
name = td.find("a").text
# Do some more scraping
Warning, i didn't work with soup for a while so this may not be 100% correct. You get the idea.

how to scrape multiple pages in python with bs4

I have a query as I have been scraping a website "https://www.zaubacorp.com/company-list" as not able to scrape the email id from the given link in the table. Although the need to scrape Name, Email and Directors from the link in the given table. Can anyone please, resolve my issue as I am a newbie to web scraping using python with beautiful soup and requests.
Thank You
Dieksha
#Scraping the website
#Import a liabry to query a website
import requests
#Specify the URL
companies_list = "https://www.zaubacorp.com/company-list"
link = requests.get("https://www.zaubacorp.com/company-list").text
#Import BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(link,'lxml')
soup.table.find_all('a')
all_links = soup.table.find_all('a')
for link in all_links:
print(link.get("href"))
Well let's break down the website and see what we can do.
First off, I can see that this website is paginated. This means that we have to deal with something as simple as the website using part of the GET query string to determine what page we are requesting to some AJAX call that is filling the table with new data when you click next. From clicking on the next page and subsequent pages, we are in some luck that the website uses the GET query parameter.
Our URL for requesting the webpage to scrape is going to be
https://www.zaubacorp.com/company-list/p-<page_num>-company.html
We are going to write a bit of code that will fill that page num with values ranging from 1 to the last page you want to scrape. In this case, we do not need to do anything special to determine the last page of the table since we can skip to the end and find that it will be page 13,333. This means that we will be making 13,333 page requests to this website to fully collect all of its data.
As for gathering the data from the website we will need to find the table that holds the information and then iteratively select the elements to pull out the information.
In this case we can actually "cheat" a little since there appears to be only a single tbody on the page. We want to iterate over all the and pull out the text. I'm going to go ahead and write the sample.
import requests
import bs4
def get_url(page_num):
page_num = str(page_num)
return "https://www.zaubacorp.com/company-list/p-1" + page_num + "-company.html"
def scrape_row(tr):
return [td.text for td in tr.find_all("td")]
def scrape_table(table):
table_data = []
for tr in table.find_all("tr"):
table_data.append(scrape_row(tr))
return table_data
def scrape_page(page_num):
req = requests.get(get_url(page_num))
soup = bs4.BeautifulSoup(req.content, "lxml")
data = scrape_table(soup)
for line in data:
print(line)
for i in range(1, 3):
scrape_page(i)
This code will scrape the first two pages of the website and by just changing the for loop range you can get all 13,333 pages. From here you should be able to just modify the printout logic to save to a CSV.

Getting a CSS table with beautiful soup

I have tried to get the data from this table but I have been unable to do so: https://datagolf.ca/player-trends
I have tried many things for the last few hours, below is my most recent when just returns an empty list.
import bs4
import requests
res = requests.get('https://datagolf.ca/player-trends')
soup = bs4.BeautifulSoup(res.text, 'lxml')
table = soup.find_all("div", class_ = "table")
table
Is the issue something similar to this:
Scrape of table with only 'div's
This page is java script rendered. Take a closer look at what requests.get('https://datagolf.ca/player-trends') actually returns. It does not contain the table.
Despite that it pulls the data from https://dl.dropboxusercontent.com/s/hprrfklqs5q0oge/player_trends_dev.csv?dl=1

Scraping page with BS in python only captures first column of splitColumn

I'm trying to scrape the last part of this page through BeautifulSoup in python.
I want to retrieve all the companies listed in the bottom. Furthermore, the companies are ordered alphabetically, where the companies with titles starting with "A-F" appear under the first tab, then "G-N" under the second tab and so on. You have to click the tabs for the names to appear, so I'll loop through the different "name pages" and apply the same code.
I'm having trouble retrieving all the names of a single page, however.
When looking at the companies named "A-F" I can only retrieve the names of the first column of the table.
My code is:
from bs4 import BeautifulSoup as Soup
import requests
incl_page_url = "https://www.triodos.com/en/investment-management/socially-
responsible-investment/sustainable-investment-universe/companies-atmf1/"
page = requests.get(incl_page_url)
soup = Soup(page.content, "html.parser")
for header in soup.find("h2").next_siblings:
try:
for a in header.childGenerator():
if str(type(a)) == "<class 'bs4.element.NavigableString'>":
print(str(a))
except:
pass
As can be seen by running this, I only get the names from the first column.
Any help is very much appreciated.
Give this a shot and tell me this is not what you wanted:
from bs4 import BeautifulSoup
import requests
incl_page_url = "https://www.triodos.com/en/investment-management/socially-responsible-investment/sustainable-investment-universe/companies-atmf1/"
page = requests.get(incl_page_url).text
soup = BeautifulSoup(page, "lxml")
for items in soup.select(".splitColumn p"):
title = '\n'.join([item for item in items.strings])
print(title)
Result:
3iGroup
8point3 Energy Partners 
A
ABN AMRO
Accell Group
Accsys Technologies
Achmea
Acuity Brands
Adecco
Adidas
Adobe Systems

python bs4 scrape table gets wrong results

I am trying to scrape this site : http://stcw.marina.gov.ph/find/?c_n=14-111112&opt=stcw and get the table at the bottom. When I try to scrape it, I get some elements of the first row, but nothing from the rest of the table. Here is my code
urlText = "http://stcw.marina.gov.ph/find/?c_n=14-111112&opt=stcw"
url = urlopen(urlText)
soup = bs.BeautifulSoup(url,"html.parser")
certificates = soup.find('table',class_='table table-bordered')
for row in certificates.find_all('tr'):
for td in row.find_all('td'):
print td.text
What I get as an output is:
22-20353
SHIP SECURITY OFFICER
Rather than the whole table.
What am I missing ?
It is yet another case of when an underlying parser makes a difference. Switch to lxml or html5lib to see the complete table parsed:
soup = bs.BeautifulSoup(url, "lxml")
soup = bs.BeautifulSoup(url, "html5lib")

Categories

Resources