Python - super simple scraping with request and bs4 - python

I am trying to get the data from the main table from this page:
https://www.interactivebrokers.com/en/index.php?f=2222&exch=globex&showcategories=FUTGRP#productbuffer
I tried:
import requests
from bs4 import BeautifulSoup
address="https://www.interactivebrokers.com/en/index.php?f=2222&exch=globex&showcategories=FUTGRP#productbuffer"
r=requests.get(address)
soup=(r.text,"html_parser")
I know this is super basic but somehow i'm blocked here.
I tried soup.find_all('table') but couldn't identify correctly the table i'm looking for (it seems to have no ID or distinguishable attribute).
I tried soup.find_all('tr') then i can see the rows i am looking for but there is also some undesired rows in the result that i don't know how to separate.
Anyone can help me with my first step with bs4?

It seems the problem is, that the data you want actually resides outside the table tag, but in a tbody-tag. The site has 3 of these.
So a working code to grab the tds would look like this:
import requests
from bs4 import BeautifulSoup
url = 'https://www.interactivebrokers.com/en/index.php?f=2222&exch=globex&showcategories=FUTGRP#productbuffer'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
table = soup.find_all('tbody')[2]
trs = table.find_all('tr')
Then you just have to iterate over the trs to get the content, you're after. tds are in a list which has four elements. You are after nr. 0, 2 and 3. Normally you could go with that. Since nr 1 has the same link text ('linkexternal') I went by that instead.
outfile = r'C:\output_file.txt'
with open(outfile, 'a', encoding='utf-8') as fd:
for tr in trs:
try:
tds = tr.find_all('td')
print_elements = ",".join([td.text for td in tds if 'linkexternal' not in str(td)])
fd.write(print_elements+'\n')
except:
#some exception handling, perhaps logging
pass

Related

How to get CData from html using beautiful soup

I am trying to get a value from a webpage. In the source code of the webpage, the data is in CDATA format and also comes from a jQuery. I have managed to write the below code which gets a large amount of text, where the index 21 contains the information I need. However, this output is large and not in a format I understand. Within the output I need to isolate and output "redshift":"0.06" but dont know how. what is the best way to solve this.
import requests
from bs4 import BeautifulSoup
link = "https://wis-tns.weizmann.ac.il/object/2020aclx"
html = requests.get(link).text
soup = BeautifulSoup(html, "html.parser")
res = soup.findAll('b')
print soup.find_all('script')[21]
It can be done using the current approach you have. However, I'd advise against it. There's a neater way to do it by observing that the redshift value is present in a few convenient places on the page itself.
The following approach should work for you. It looks for tables on the page with the class "atreps-results-table" -- of which there are two. We take the second such table and look for the table cell with the class "cell-redshift". Then, we just print out its text content.
from bs4 import BeautifulSoup
import requests
link = 'https://wis-tns.weizmann.ac.il/object/2020aclx'
html = requests.get(link).text
soup = BeautifulSoup(html, 'html.parser')
tab = soup.find_all('table', {'class': 'atreps-results-table'})[1]
redshift = tab.find('td', {'class': 'cell-redshift'})
print(redshift.text)
Try simply:
soup.select_one('div.field-redshift > div.value>b').text
If you view the Page Source of the URL, you will find that there are two script elements that are having CDATA. But the script element in which you are interested has jQuery in it. So you have to select the script element based on this knowledge. After that, you need to do some cleaning to get rid of CDATA tags and jQuery. Then with the help of json library, convert JSON data to Python Dictionary.
import requests
from bs4 import BeautifulSoup
import json
page = requests.get('https://wis-tns.weizmann.ac.il/object/2020aclx')
htmlpage = BeautifulSoup(page.text, 'html.parser')
scriptelements = htmlpage.find_all('script')
for script in scriptelements:
if 'CDATA' in script.text and 'jQuery' in script.text:
scriptcontent = script.text.replace('<!--//--><![CDATA[//>', '').replace('<!--', '').replace('//--><!]]>', '').replace('jQuery.extend(Drupal.settings,', '').replace(');', '')
break
jsondata = json.loads(scriptcontent)
print(jsondata['objectFlot']['plotMain1']['params']['redshift'])

Web Scraping - Extract list of text from multiple pages

I want to extract a list of names from multiple pages of a website.
The website has over 200 pages and i want to save all the names to a text file. I have wrote some code but it's giving me index error.
CODE:
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
for book in books:
data = book.find_all('b')[0].get_text()
print(data)
OUTPUT:
Aabbaz
Aabid
Aabideen
Aabinus
Aadam
Aadeel
Aadil
Aadroop
Aafandi
Aafaq
Aaki
Aakif
Aalah
Aalam
Aalamgeer
Aalif
Traceback (most recent call last):
File "C:\Users\Mujtaba\Documents\names.py", line 15, in <module>
data = book.find_all('b')[0].get_text()
IndexError: list index out of range
>>>
The reason for getting the error is since it can't find a <b> tag.
Try this code to request each page and save the data to a file:
import requests
from bs4 import BeautifulSoup as bs
MAIN_URL = "https://hamariweb.com/names/muslim/boy/"
URL = "https://hamariweb.com/names/muslim/boy/page-{}"
with open("output.txt", "a", encoding="utf-8") as f:
for page in range(203):
if page == 0:
req = requests.get(MAIN_URL.format(page))
else:
req = requests.get(URL.format(page))
soup = bs(req.text, "html.parser")
print(f"page # {page}, Getting: {req.url}")
book_name = (
tag.get_text(strip=True)
for tag in soup.select(
"tr.bottom-divider:nth-of-type(n+2) td:nth-of-type(1)"
)
)
f.seek(0)
f.write("\n".join(book_name) + "\n")
I suggest to change your parser to html5lib #pip install html5lib. I just think it's better. Second It's better NOT to do a .find() from your soup object DIRECTLY since it might cause some problems where the tags and classes might have duplicates. SO you might be finding data on a html tag where your data isn't even there. So it's better to check everything and inspect element the the tags you want to get and see on what block of code they might be in cause it is easier that way to scrape, also to avoid more errors.
What I did there is I inspected the elements first and FIND the BLOCK of code where you want to get your data and I found that it is on a div and its class is mb-40 content-box that is where all the names you are trying to get are. Luckily the class is UNIQUE and there are no other elements with the same tag and class so we can just directly .find() it.
Then the value of trs are simply the tr tags inside of that block
(Take note also that those <tr> tags are inside of a <table> tag but the good thing is those are the only <tr> tags that exist so there wouldn't be much of a problem like if there would be another <table> tag with the same class value)
which the <tr> tags contains the names you want to get. You may ask why is there [1:] it's because to start at index 1 to NOT include the Header from the table on the website.
Then just loop through those tr tags and get the text. With regards to your error on why is it happening it is simply because of index out of range you are trying to access a .find_all() result list item where it is out of bounds and this might happen if cases that there are no such data that is being found and that also might happen if you DIRECTLY do a .find() function on your soup variable, because there would be times that there are tags and their respective class values are the same BUT! WITH DIFFERENT CONTENT WITHIN IT. So what happens is you're expecting to scrape that particular part of the website but what actually happening is you're scraping a different part, that's why you might not get any data and wonder why it is happening.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
#for page in range(1, 203):
page = 1
req = requests.get(URL + str(page))
soup = bs(req.content, 'html5lib')
div_container = soup.find('div', class_='mb-40 content-box')
trs = div_container.find_all("tr",class_="bottom-divider")[1:]
for tr in trs:
text = tr.find("td").find("a").text
print(text)
The issue you're having with the IndexError means that in this case the b-tag you found doesn't contains the information that you are looking for.
You can simply wrap that piece of code in a try-except clause.
for book in books:
try:
data = book.find_all('b')[0].get_text()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
This will catch you error and move on. But not break the code.
Below I have also added some extra lines to save your title to a text-file.
Take a look at the inline comments.
import requests
from bs4 import BeautifulSoup as bs
URL = 'https://hamariweb.com/names/muslim/boy/page-'
# Theres is where your titles will be saved. Changes as needed
PATH = '/tmp/title_file.txt'
page = 1
req = requests.get(URL + str(page))
soup = bs(req.text, 'html.parser')
row = soup.find('div', attrs={'class', 'row'})
books = row.find_all('a')
# Here your title will be stored before writing to file
all_titles = []
for book in books:
try:
# Add strip() to cleanup the input
data = book.find_all('b')[0].get_text().strip()
print(data)
# Add data to the all_titles list
all_titles.append(data)
except IndexError:
pass # There was no element available
# Open path to write
with open(PATH, 'w') as f:
# Write all titles on a new line
f.write('\n'.join(all_titles))

How can I scrape the contents inside the 'sorting_1' class with Python?

I've been given a project to make covid tracker. I decided to scrape some elements through the site (https://www.worldometers.info/coronavirus/). I'm very new to python so decided to go with BeautifulSoup. I was able to scrape the basic elements, like the total cases, active cases and so on. However, whenever I try to grab the country names or the numbers, it returns an empty list. Even though there exists a class 'sorting_1', it still returns an empty list. Could someone guide me where am I going wrong?
This is something which I am trying to grab:
<td style="font-weight: bold; text-align:right" class="sorting_1">4,918,420</td>
Here is my current code:
import requests
import bs4
#making a request and a soup
res = requests.get('https://www.worldometers.info/coronavirus/')
soup = bs4.BeautifulSoup(res.text, 'lxml')
#scraping starts here
total_cases = soup.select('.maincounter-number')[0].text
total_deaths = soup.select('.maincounter-number')[1].text
total_recovered = soup.select('.maincounter-number')[2].text
active_cases = soup.select('.number-table-main')[0].text
country_cases = soup.find_all('td', {'class': 'sorting_1'})
You can get sorting_1 class because it not present in page source.
You have found all rows from the table and then read information from the required columns.
So, to get total cases for each country, you can use following code:
import requests
import bs4
res = requests.get('https://www.worldometers.info/coronavirus/')
soup = bs4.BeautifulSoup(res.text, 'lxml')
country_cases = soup.find_all('td', {'class': 'sorting_1'})
rows = soup.select('table#main_table_countries_today tr')
for row in rows[8:18]:
tds = row.find_all('td')
print(tds[1].text.strip(), '=', tds[2].text.strip())
Welcome to SO!
Looking at their website, it seems that the sorting_X classes are added by javascript, so they don't exist in the raw html.
The table does exist, however, so i'd advise to loop over the table rows similar to this:
table_rows = soup.find("table", id="main_table_countries_today").find_all("tr")
for row in table_rows:
name = "unknown"
# Find country name
for td in row.find_all("td"):
if td.find("mt_a"): # This kind of link apparently only exists in the "name" column
name = td.find("a").text
# Do some more scraping
Warning, i didn't work with soup for a while so this may not be 100% correct. You get the idea.

python bs4 scrape table gets wrong results

I am trying to scrape this site : http://stcw.marina.gov.ph/find/?c_n=14-111112&opt=stcw and get the table at the bottom. When I try to scrape it, I get some elements of the first row, but nothing from the rest of the table. Here is my code
urlText = "http://stcw.marina.gov.ph/find/?c_n=14-111112&opt=stcw"
url = urlopen(urlText)
soup = bs.BeautifulSoup(url,"html.parser")
certificates = soup.find('table',class_='table table-bordered')
for row in certificates.find_all('tr'):
for td in row.find_all('td'):
print td.text
What I get as an output is:
22-20353
SHIP SECURITY OFFICER
Rather than the whole table.
What am I missing ?
It is yet another case of when an underlying parser makes a difference. Switch to lxml or html5lib to see the complete table parsed:
soup = bs.BeautifulSoup(url, "lxml")
soup = bs.BeautifulSoup(url, "html5lib")

Unable to iterate over the "tr" element of a table using beautiful soup

from bs4 import BeautifulSoup
import re
import urllib2
url = 'http://sports.yahoo.com/nfl/players/5228/gamelog'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page)
table = soup.find(id='player-game_log-season').find('tbody').find_all('tr')
for rows in tr:
data = raws.find_all("td")
print data
I'm trying to go through the table for a certain player's stats last year and grab their stats, however, I get a AttributeError: 'NoneType' object has no attribute 'find_all' When I try to run this code. I'm new to beautiful soup so I'm not really sure what the problem is.
Also if anyone has any good tutorials to recommend me that would be awesome. Reading through the documentation is sort of confusing as I am fairly new to programming.
There's no tbody in the table under div#player-game_log-season. And your code has some typos.
raws -> rows
table -> tr
...
tr = soup.find(id='player-game_log-season').find_all('tr')
for rows in tr:
data = rows.find_all("td")
print data

Categories

Resources