Learning with Beautifulsoup

Learning with Beautifulsoup - python

I'm trying to pull data form a website and have been looking and trying to learn for weeks. I'm trying
from bs4 import BeautifulSoup as Soup
req = requests.get('http://www.rushmore.tv/schedule')
soup = Soup(req.text, "html.parser")
soup.find('home-section-wrap center', id="section-home")
print soup.find
but it's returning something do to with Steam that's completely random considering that nothing I am doing is related to Steam.
<bound method BeautifulSoup.find of \n<td class="listtable_1" height="16">\n\n 76561198134729239\n \n</td>>
What I'm trying to do is scrape a div ID and print the contents. Extremely new. Cheers

Use this:
import requests
from bs4 import BeautifulSoup
r = requests.get('http://www.rushmore.tv/schedule')
soup = BeautifulSoup(r.text, "html.parser")
for row in soup.find('ul', id='myUL').findAll('li'):
print(row.text)
Partial Output:
10:30 - 13:30 Olympics: Women's Curling, Canada vs China (CA Coverage) - Channel 21
10:30 - 11:30 Olympics: Freestyle, Men's Half Pipe (US Coverage) - Channel 34
11:30 - 14:45 Olympics: BBC Coverage - Channel 92
11:30 - 19:30 Olympics: BBC Red Button Coverage - Channel 103
11:30 - 13:30 Olympics: Women's Curling, Great Britain vs Japan - Channel 105
13:00 - 15:30 Olympics: Men's Ice Hockey: Slovenia vs Norway - Channel 11
13:30 - 15:30 Olympics: Men's Ice Hockey: Slovenia vs Norway (JIP) - Channel 21
13:30 - 21:30 Olympics: DE Coverage - Channel 88
14:45 - 18:30 Olympics: BBC Coverage - Channel 91

Try to run following code:
import urllib2
from bs4 import BeautifulSoup
quote_page='http://www.rushmore.tv/schedule'
def page_scrapper(quote_page):
print(quote_page+' is being processed... ')
page = urllib2.urlopen(quote_page) #Let's open the page...
soup = BeautifulSoup(page,'html.parser') #And now we parse it with BSoup parser..
box = soup.find('ul', attrs = {'id': 'myUL'}) #Save the contents of the 'ul' tag with id myUL(it contains schedule)
print(box) #and print it!
page_scrapper(quote_page)
This should do the trick.
EDIT - added some lines of code

Related

How do you web-scrape past a "show more" button using BeautifulSoup Python?

I am using BeautifulSoup on python to scrape football statistics from this website: https://www.skysports.com/premier-league-results/2020-21. Yet the site only shows the first 200 games of the season and the rest of the 180 games are behind a "show more" button. The button does not change the url so I can't just replace the url.
This is my code:
from bs4 import BeautifulSoup
import requests
scores_html_text = requests.get('https://www.skysports.com/premier-league-results/2020-21').text
scores_soup = BeautifulSoup(scores_html_text, 'lxml')
fixtures = scores_soup.find_all('div', class_ = 'fixres__item')
This only gets the first 200 fixtures.
How would I access the html past the show more button?

The hidden results are inside <script> tag, so to get all 380 results you need to parse it additionally:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.skysports.com/premier-league-results/2020-21"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
script = soup.select_one('[type="text/show-more"]')
script.replace_with(BeautifulSoup(script.contents[0], "html.parser"))
all_data = []
for item in soup.select(".fixres__item"):
all_data.append(item.get_text(strip=True, separator="|").split("|")[:5])
all_data[-1].append(
item.find_previous(class_="fixres__header2").get_text(strip=True)
)
df = pd.DataFrame(
all_data, columns=["Team 1", "Score 1", "Score 2", "Time", "Team 2", "Date"]
)
print(df)
df.to_csv("data.csv", index=False)
Prints:
Team 1 Score 1 Score 2 Time Team 2 Date
0 Arsenal 2 0 16:00 Brighton and Hove Albion Sunday 23rd May
1 Aston Villa 2 1 16:00 Chelsea Sunday 23rd May
2 Fulham 0 2 16:00 Newcastle United Sunday 23rd May
3 Leeds United 3 1 16:00 West Bromwich Albion Sunday 23rd May
...
377 Crystal Palace 1 0 15:00 Southampton Saturday 12th September
378 Liverpool 4 3 17:30 Leeds United Saturday 12th September
379 West Ham United 0 2 20:00 Newcastle United Saturday 12th September
and saves data.csv (screenshot from LibreOffice):

I am not aware of how to do this with BeautifulSoup, but this is how I would do it using Selenium (note that I am very new to Selenium, so there are probably better ways of doing this).
The imports used are:
from selenium import webdriver
import time
You will also need to download the Chrome webdriver (assuming that you are on Chrome), and place it in the same directory as your script, or in your library path.
There will be a cookies popup which you have to workaround:
# prepare the driver
URL = "https://www.skysports.com/premier-league-results/2020-21"
driver = webdriver.Chrome()
driver.get(URL)
# wait so that driver has loaded before we look for the cookies popup
time.sleep(2)
# accept cookies popup, which occurs in an iframe
# begin by locating iframe
frame = driver.find_element_by_id('sp_message_iframe_533903')
# find the accept button (inspect element and copy Xpath of button)
driver.find_element_by_xpath('//*[#id="notice"]/div[3]/button[1]').click()
time.sleep(2)
driver.refresh()
# find "show more text" button and click
driver.find_element_by_class_name("plus-more__text").click()

i tried to go up a few levels and this worked , u might need to process it a wee bit more.
from bs4 import BeautifulSoup
import requests
scores_html_text = requests.get('https://www.skysports.com/premier-league-results/2020-21').text
scores_soup = BeautifulSoup(scores_html_text,'lxml')
fixtures = scores_soup.find(class_ = 'site-layout-secondary block page-nav__offset grid')
print(fixtures)

None output while parsing inside a class

I'm kinda new to python and i'm trying to parse https://rustavi2.ge/ka/schedule <-- website using the following code,the content might be on georgian but i dont think it matters.
When you open the page you will see 07:15 ანიმაცია "სონიკ ბუმი" <- text in front.via inspect i can see the elements tag and class also but the following code returns only None.I know i'm doing something terribly wrong but cant really figure it out.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://rustavi2.ge/ka/schedule')
c = r.content
soup = BeautifulSoup(c,'html.parser')
a = soup.find("div", {"class": "sch_cont"}).find("div",{"class": "bade_line"})
print((a).encode("utf-8"))

The data is loaded via Ajax from external site:
import requests
import urllib3
requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = 'ALL:#SECLEVEL=1'
from bs4 import BeautifulSoup
url = 'https://rustavi2.ge/includes/bade_ajax.php?dt=2020-08-17&lang=ka'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tm, title in zip(soup.select('.b_time'), soup.select('.b_title')):
print(tm.text, title.text)
Prints:
07:15 ანიმაცია "სონიკ ბუმი"
08:00 მხ/ფილმი
10:00 კურიერი
10:15 სერიალი "ქალური ბედნიერება"
12:00 კურიერი
12:30 სერიალი "ქალური ბედნიერება"
13:55 სერიალი "მე ვიცი რა გელის"
15:00 კურიერი
15:50 დღის კურიერი
16:30 სერიალი "უცხო მშობლიურ მხარეში"
18:00 კურიერი
18:50 სერიალი "მარიამ მაგდალინელი"
20:30 ლოტო
20:40 სერიალი "მარიამ მაგდალინელი"
21:00 კურიერი
22:00 ფარული კონვერტი
23:00 გააცინე და მოიგე
00:00 სერიალი "სენდიტონი"
00:30 მხ/ფილმი
01:00 მხ/ფილმი
03:30 კურიერის დაიჯესტი
04:00 სერიალი "ქალური ბედნიერება"
05:00 სერიალი "უცხო მშობლიურ მხარეში"

Navigating Through HTML of a website using beautiful soup on python to select specific tag

Below is the HTML of the site:
I am trying to get the python code to return the tag after the product-card__title and product-card__price, where i want it to return the name and price of the shoe.
I have tried to run the code below however, im not getting exactly what i want.
url = 'https://kith.com/collections/mens-footwear-sneakers'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
ex = soup.find('ul',{'class': 'collection-products'})
for i in ex.find_all('a'):
print(i.text)
This is what is being returned
Nike Air Force 1 '07 LV8
NY vs NY
$110.00
And so on. I just want to be able to soup.select to the very specific tag after the class "product-card__title" or "product-card__price", for example to the adidas x Pharrell Williams Boost Slide and the $100.

This script will print title and price of products on the page:
import requests
from bs4 import BeautifulSoup
url = 'https://kith.com/collections/mens-footwear-sneakers'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for i in soup.select('.product-card__information'):
title = i.select_one('.product-card__title').get_text(strip=True)
price = i.select_one('.product-card__price').get_text(strip=True)
print(title)
print(price)
print('-' * 80)
Prints:
Nike Air Force 1 '07 LV8
$110.00
--------------------------------------------------------------------------------
Nike Daybreak SP
$110.00
--------------------------------------------------------------------------------
Nike Killshot OG SP
$90.00
--------------------------------------------------------------------------------
Puma Roma '68 R. Dassler Legacy
$110.00
--------------------------------------------------------------------------------
Puma Oslo-City R. Dassler Legacy
$120.00
--------------------------------------------------------------------------------
Puma Ralph Sampson Mid R. Dassler Legacy
$110.00
--------------------------------------------------------------------------------
Puma Ralph Sampson Lo R. Dassler Legacy
$100.00
--------------------------------------------------------------------------------
Puma Mirage OG R. Dassler Legacy
$100.00
--------------------------------------------------------------------------------
Puma Fast Rider R. Dassler Legacy
$100.00
--------------------------------------------------------------------------------
Y-3 Shiku Run
$350.00
--------------------------------------------------------------------------------
Y-3 Runner 4D
$500.00
--------------------------------------------------------------------------------
Y-3 Runner 4D
$500.00
--------------------------------------------------------------------------------

How to scrape 2nd <div> tag of same class without unique distinctive mark

I am trying to read the content of the second div class from the code:
div class="eds-event-card-content__sub eds-text-bm eds-text-color--ui-600 eds-l-mar-top-1 eds-event-card-content__sub--cropped">Starts at RM15.75
using python 3
<div class="eds-event-card-content__sub-content">
<div class="eds-event-card-content__sub eds-text-bm eds-text-color--ui-600 eds-l-mar-top-1
eds-event-card-content__sub--cropped">
<div class="card-text--truncated__one">Found8 KL Sentral • Kuala Lumpur, Kuala
Lumpur</div>
</div>
<div class="eds-event-card-content__sub eds-text-bm eds-text-color--ui-600 eds-l-mar-top-1
eds-event-card-content__sub--cropped">Starts at RM15.75</div></div>
My python code:
url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events/?page=2'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
# Select all the 20 event containers from a single page
event_containers = html_soup.find_all('div', class_='search-event-card-square-image')
# Getting price of ticket
price = container.find_all('div', class_= "eds-event-card-content__sub eds-text-bm eds-text-color--ui-600 eds-l-mar-top-1 eds-event-card-content__sub--cropped").text
print("price: ", price[1])
However my code does not works
it gives me the output:
IndexError: list index out of range
but I wanted
Starts at RM15.75
Can anyone help me with this? Thank you

I can't see any price thing in the html Source code. I guess they are generated using js script.
So for this case you need to use Selenium.
Code:
# import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
from webdriver_manager.chrome import ChromeDriverManager
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)
driver.set_window_size(1024, 600)
driver.maximize_window()
url = 'https://www.eventbrite.com/d/malaysia--kuala-lumpur--85675181/all-events/?page=2'
# response = requests.get(url)
driver.get(url)
time.sleep(4)
html_soupdf = BeautifulSoup(driver.page_source, 'html.parser')
# Select all the 20 event containers from a single page
event_containers = html_soup.find('ul', class_='search-main-content__events-list')
for event in event_containers.find_all('li'):
event_time = event.find('div', class_= "eds-text-color--primary-brand eds-l-pad-bot-1 eds-text-weight--heavy eds-text-bs").text
event_name = event.find('div', class_= "eds-event-card__formatted-name--is-clamped eds-event-card__formatted-name--is-clamped-three eds-text-weight--heavy").text
event_price_place = event.find('div', class_ = "eds-event-card-content__sub-content")
event_pp = event_price_place.find_all('div')
event_place = event_pp[0].text
try:
event_price = event_pp[2].text
except:
event_price = None
print(f"{event_name}\n{event_time}\n{event_place}\n{event_price}\n\n")
Result:
KL International Flea Market 2020 / Bazaar Antarabangsa Kuala Lumpur
Mon, Oct 5, 10:00 AM
VIVA Shopping Mall • Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur
Free
FGTSD Physical Church Service
Sun, Jul 19, 9:30 AM + 105 more events
Full Gospel Tabernacle Sri Damansara • Kuala Lumpur
Free
EFE 2020 - 16th Export Furniture Exhibition Malaysia
Thu, Aug 27, 9:00 AM
Kuala Lumpur Convention Centre • Kuala Lumpur, Kuala Lumpur
Free
International Beauty Expo (IBE) 2020
Sat, Sep 12, 11:00 AM
Malaysia International Trade and Exhibition Centre • Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur
Free
Learn How To Earn USD3500 In 4 Week Using Your SmartPhone
Today at 8:00 PM + 2 more events
KL Online Event • Kuala Lumpur, Bangkok
None
Turn Customers into Raving Fans of Your Brand via Equity Crowdfunding
Thu, Aug 27, 4:00 PM
Found8 KL Sentral • Kuala Lumpur, Kuala Lumpur
Starts at RM15.75
.
.
.
.
.
Edit:
I have added option for making it headerless.
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(ChromeDriverManager().install(), chrome_options=chrome_options)

Scraping pagination via "page=" midway in url

I'm trying to scrape data from this webpage, and all 900+ pages that follow: https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page=1&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True
It's important that the scraper does not target the pagination link, but rather iterates through the "page=" number in the url. This is because the data present is loaded dynamically in the original webpage, which the pagination links point back to.
I've tried writing something that loops through the page numbers in the url, via the "last" class of the pagination ul, to find the final page, but I am not sure how to target the specific part of the url, whilst keeping the search query the same for each result
r = requests.get(url_pagination)
soup = BeautifulSoup(r.content, "html.parser")
page_url = "https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page={}" + "&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True"
last_page = soup.find('ul', class_='pagination').find('li', class_='last').a['href'].split('=')[1]
dept_page_url = [page_url.format(i) for i in range(1, int(last_page)+1)]
print(dept_page_url)
I would ideally like to scrape just the name from class "secondaryTitle", and the 2nd unnamed div that contains the date, per row.
I keep getting an error: ValueError: invalid literal for int() with base 10: '2019-07-11&searchTerm'

You could try this script, but beware, it goes from page 1 all the way to last page 966:
import requests
from bs4 import BeautifulSoup
next_page_url = 'https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page=1&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True'
# this goes to page '966'
while True:
print('Scrapping {} ...'.format(next_page_url))
r = requests.get(next_page_url)
soup = BeautifulSoup(r.content, "html.parser")
for secondary_title, date in zip(soup.select('.secondaryTitle'), soup.select('.secondaryTitle + *')):
print('{: >20} - {}'.format(date.get_text(strip=True), secondary_title.get_text(strip=True)))
next_link = soup.select_one('a:has(span:contains(Next))')
if next_link:
next_page_url = 'https://hansard.parliament.uk' + next_link['href'] + '&partial=True'
else:
break
Prints:
Scrapping https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page=1&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True ...
17 January 2007 - Ian Pearson
21 December 2017 - Baroness Vere of Norbiton
2 May 2019 - Lord Parekh
4 February 2013 - Baroness Hanham
21 December 2017 - Baroness Walmsley
9 February 2010 - Colin Challen
6 February 2002 - Baroness Farrington of Ribbleton
24 April 2007 - Barry Gardiner
17 January 2007 - Rob Marris
7 March 2002 - The Parliamentary Under-Secretary of State, Department for Environment, Food and Rural Affairs (Lord Whitty)
27 October 1999 - Mr. Tom Brake (Carshalton and Wallington)
9 February 2004 - Baroness Miller of Chilthorne Domer
7 March 2002 - The Secretary of State for Environment, Food and Rural Affairs (Margaret Beckett)
27 February 2007 -
8 October 2008 - Baroness Andrews
24 March 2011 - Lord Henley
21 December 2017 - Lord Krebs
21 December 2017 - Baroness Young of Old Scone
16 June 2009 - Mark Lazarowicz
14 July 2006 - Lord Rooker
Scrapping https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&searchTerm=%22climate+change%22&startDate=1800-01-01&page=2&partial=True ...
12 October 2006 - Lord Barker of Battle
29 January 2009 - Lord Giddens
... and so on.

Your error is because you are using the wrong number from your split. You want -1. Observe:
last_page = soup.find('ul', class_='pagination').find('li', class_='last').a['href']
print(last_page)
print(last_page.split('=')[1])
print(last_page.split('=')[-1])
Gives:
/search/Contributions?endDate=2019-07-11&searchTerm=%22climate+change%22&startDate=1800-01-01&page=966
when split and use 1
2019-07-11&searchTerm
versus -1
966
To get the info from each page you want I would do pretty much what the other answer does in terms of css selectors and zipping. Some other looping constructs below and use of Session for efficiency given number of requests.
You could make an initial request and extract the number of pages then loop for those. Use Session object for efficiency of connection re-use.
import requests
from bs4 import BeautifulSoup as bs
def make_soup(s, page):
page_url = "https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page={}&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True"
r = s.get(page_url.format(page))
soup = bs(r.content, 'lxml')
return soup
with requests.Session() as s:
soup = make_soup(s, 1)
pages = int(soup.select_one('.last a')['href'].split('page=')[1])
for page in range(2, pages + 1):
soup = make_soup(s, page)
#do something with soup
You could loop until class last ceases to appear
import requests
from bs4 import BeautifulSoup as bs
present = True
page = 1
#results = {}
def make_soup(s, page):
page_url = "https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page={}&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True"
r = s.get(page_url.format(page))
soup = bs(r.content, 'lxml')
return soup
with requests.Session() as s:
while present:
soup = make_soup(s, page)
present = len(soup.select('.last')) > 0
#results[page] = soup.select_one('.pagination-total').text
#extract info
page+=1

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Learning with Beautifulsoup - python

Related

How do you web-scrape past a "show more" button using BeautifulSoup Python?

None output while parsing inside a class

Navigating Through HTML of a website using beautiful soup on python to select specific tag

How to scrape 2nd <div> tag of same class without unique distinctive mark

Scraping pagination via "page=" midway in url

Categories

Resources