Scraping pagination via "page=" midway in url

Scraping pagination via "page=" midway in url - python

I'm trying to scrape data from this webpage, and all 900+ pages that follow: https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page=1&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True
It's important that the scraper does not target the pagination link, but rather iterates through the "page=" number in the url. This is because the data present is loaded dynamically in the original webpage, which the pagination links point back to.
I've tried writing something that loops through the page numbers in the url, via the "last" class of the pagination ul, to find the final page, but I am not sure how to target the specific part of the url, whilst keeping the search query the same for each result
r = requests.get(url_pagination)
soup = BeautifulSoup(r.content, "html.parser")
page_url = "https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page={}" + "&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True"
last_page = soup.find('ul', class_='pagination').find('li', class_='last').a['href'].split('=')[1]
dept_page_url = [page_url.format(i) for i in range(1, int(last_page)+1)]
print(dept_page_url)
I would ideally like to scrape just the name from class "secondaryTitle", and the 2nd unnamed div that contains the date, per row.
I keep getting an error: ValueError: invalid literal for int() with base 10: '2019-07-11&searchTerm'

You could try this script, but beware, it goes from page 1 all the way to last page 966:
import requests
from bs4 import BeautifulSoup
next_page_url = 'https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page=1&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True'
# this goes to page '966'
while True:
print('Scrapping {} ...'.format(next_page_url))
r = requests.get(next_page_url)
soup = BeautifulSoup(r.content, "html.parser")
for secondary_title, date in zip(soup.select('.secondaryTitle'), soup.select('.secondaryTitle + *')):
print('{: >20} - {}'.format(date.get_text(strip=True), secondary_title.get_text(strip=True)))
next_link = soup.select_one('a:has(span:contains(Next))')
if next_link:
next_page_url = 'https://hansard.parliament.uk' + next_link['href'] + '&partial=True'
else:
break
Prints:
Scrapping https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page=1&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True ...
17 January 2007 - Ian Pearson
21 December 2017 - Baroness Vere of Norbiton
2 May 2019 - Lord Parekh
4 February 2013 - Baroness Hanham
21 December 2017 - Baroness Walmsley
9 February 2010 - Colin Challen
6 February 2002 - Baroness Farrington of Ribbleton
24 April 2007 - Barry Gardiner
17 January 2007 - Rob Marris
7 March 2002 - The Parliamentary Under-Secretary of State, Department for Environment, Food and Rural Affairs (Lord Whitty)
27 October 1999 - Mr. Tom Brake (Carshalton and Wallington)
9 February 2004 - Baroness Miller of Chilthorne Domer
7 March 2002 - The Secretary of State for Environment, Food and Rural Affairs (Margaret Beckett)
27 February 2007 -
8 October 2008 - Baroness Andrews
24 March 2011 - Lord Henley
21 December 2017 - Lord Krebs
21 December 2017 - Baroness Young of Old Scone
16 June 2009 - Mark Lazarowicz
14 July 2006 - Lord Rooker
Scrapping https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&searchTerm=%22climate+change%22&startDate=1800-01-01&page=2&partial=True ...
12 October 2006 - Lord Barker of Battle
29 January 2009 - Lord Giddens
... and so on.

Your error is because you are using the wrong number from your split. You want -1. Observe:
last_page = soup.find('ul', class_='pagination').find('li', class_='last').a['href']
print(last_page)
print(last_page.split('=')[1])
print(last_page.split('=')[-1])
Gives:
/search/Contributions?endDate=2019-07-11&searchTerm=%22climate+change%22&startDate=1800-01-01&page=966
when split and use 1
2019-07-11&searchTerm
versus -1
966
To get the info from each page you want I would do pretty much what the other answer does in terms of css selectors and zipping. Some other looping constructs below and use of Session for efficiency given number of requests.
You could make an initial request and extract the number of pages then loop for those. Use Session object for efficiency of connection re-use.
import requests
from bs4 import BeautifulSoup as bs
def make_soup(s, page):
page_url = "https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page={}&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True"
r = s.get(page_url.format(page))
soup = bs(r.content, 'lxml')
return soup
with requests.Session() as s:
soup = make_soup(s, 1)
pages = int(soup.select_one('.last a')['href'].split('page=')[1])
for page in range(2, pages + 1):
soup = make_soup(s, page)
#do something with soup
You could loop until class last ceases to appear
import requests
from bs4 import BeautifulSoup as bs
present = True
page = 1
#results = {}
def make_soup(s, page):
page_url = "https://hansard.parliament.uk/search/Contributions?endDate=2019-07-11&page={}&searchTerm=%22climate+change%22&startDate=1800-01-01&partial=True"
r = s.get(page_url.format(page))
soup = bs(r.content, 'lxml')
return soup
with requests.Session() as s:
while present:
soup = make_soup(s, page)
present = len(soup.select('.last')) > 0
#results[page] = soup.select_one('.pagination-total').text
#extract info
page+=1

Related

Extract elements between two tags with Beautiful Soup and Python

I want to crawl this website http://www.truellikon.ch/freizeit-kultur/anlaesse-agenda.html .
I want to extract date and time of each event.
You can see that date is listed above events. In order to extract date and time I need to combine different divs, but the problem is that I do not have 'container' for group of events that are on the same date.
So the only thing that I can do is to extract all events that are between two divs that refer to date.
This is the code for extracting the event info:
from bs4 import BeautifulSoup
import requests
domain = 'truellikon.ch'
url = 'http://www.truellikon.ch/freizeit-kultur/anlaesse-agenda.html'
def get_website_news_links_truellikonCh():
response = requests.get(url, allow_redirects=True)
print("Response for", url, response)
soup = BeautifulSoup(response.content, 'html.parser')
all_events = soup.select('div.eventItem')
for i in all_events:
print(i)
print()
input()
x = get_website_news_links_truellikonCh()
Class name for date is 'listThumbnailMonthName'
My question is how can I combine these divs, how can I write the selectors so that I can get exact date and time, title and body of each event

you have one parent container which is #tx_nezzoagenda_list and then you have to read the children one by one
import re
from bs4 import BeautifulSoup
import requests
url = 'http://www.truellikon.ch/freizeit-kultur/anlaesse-agenda.html'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
container = soup.select_one('#tx_nezzoagenda_list')
for child in container.children:
if not child.name:
continue
if 'listThumbnailMonthName' in child.get('class'):
base_date=child.text.strip()
else:
day=child.select_one('.dateDayNumber').text.strip()
title=child.select_one('.titleText').text.strip()
locationDate=child.select_one('.locationDateText').children
time=list(locationDate)[-1].strip()
time=re.sub('\s','', time)
print(title, day, base_date, time)
which outputs
Abendunterhaltung TV Trüllikon 10 Dezember 2021 19:00Uhr-3:00Uhr
Christbaum-Verkauf 18 Dezember 2021 9:30Uhr-11:00Uhr
Silvester Party 31 Dezember 2021 22:00Uhr
Neujahrsapéro 02 Januar 2022 16:00Uhr-18:00Uhr
Senioren-Zmittag 21 Januar 2022 12:00Uhr-15:00Uhr
Theatergruppe "Nume Hüür", Aufführung 23 Januar 2022 13:00Uhr-16:00Uhr
Elektroschrottsammlung 29 Januar 2022 9:00Uhr-12:00Uhr
Senioren Z'mittag 18 Februar 2022 12:00Uhr-15:00Uhr
Frühlingskonzert 10 April 2022 12:17Uhr
Weinländer Musiktag 22 Mai 2022 8:00Uhr
Auffahrtskonzert Altersheim 26 Mai 2022 10:30Uhr
Feierabendmusik und Jubilarenehrung 01 Juli 2022 19:00Uhr
Feierabendmusik 15 Juli 2022 12:24Uhr
Feierabendmusik 19 August 2022 19:00Uhr
Herbstanlass 19 November 2022 20:00Uhr

Beautiful soup: Extract text and urls from a list, but only under specific headings

I am looking to use Beautiful Soup to scrape the Fujitsu news update page: https://www.fujitsu.com/uk/news/pr/2020/
I only want to extract the information under the headings of the current month and previous month.
For a particular month (e.g. November), I am trying to extract into a list
the Title
the URL
the text
for each news briefing (so a list of lists).
My attempt so far is as follow (showing only previous month for simplicity):
today = datetime.datetime.today()
year_str = str(today.year)
current_m = today.month
previous_m = current_m - 1
current_m_str = calendar.month_name[current_m]
previous_m_str = calendar.month_name[previous_m]
URL = 'https://www.fujitsu.com/uk/news/pr/' + year_str + '/'
resp = requests.get(URL)
soup = BeautifulSoup(resp.text, 'lxml')
previous_m_body = soup.find('h3', text=previous_m_str)
if previous_m_body is not None:
for sib in previous_m_body.find_next_siblings():
if sib.name == "h3":
break
else:
previous_m_text = str(sib.text)
print(previous_m_text)
However, this generates one long string with newlines, and no separation between Title, text, url:
Fujitsu signs major contract with Scottish Government to deliver election e-Counting solution London, United Kingdom, November 30, 2020 - Fujitsu, a leading digital transformation company, has today announced a major contract with the Scottish Government and Scottish Local...
Fujitsu Introduces Ultra-Compact, 50A PCB Relay for Medium-to-Heavy Automotive Loads Hoofddorp, EMEA, November 11, 2020 - Fujitsu Components Europe has expanded its automotive relay offering with a new 12VDC PCB relay featuring.......
I have attached an image of the page DOM.

Try this:
import requests
from bs4 import BeautifulSoup
html = requests.get("https://www.fujitsu.com/uk/news/pr/2020/").text
all_lists = BeautifulSoup(html, "html.parser").find_all("ul", class_="filterlist")
news = []
for unordered_list in all_lists:
for list_item in unordered_list.find_all("li"):
news.append(
[
list_item.find("a").getText(),
f"https://www.fujitsu.com{list_item.find('a')['href']}",
list_item.getText(strip=True)[len(list_item.find("a").getText()):],
]
)
for news_item in news:
print("\n".join(news_item))
print("-" * 80)
Output (shortened for brevity):
Fujitsu signs major contract with Scottish Government to deliver election e-Counting solution
https://www.fujitsu.com/uk/news/pr/2020/fs-20201130.html
London, United Kingdom, November 30, 2020- Fujitsu, a leading digital transformation company, has today announced a major contract with the Scottish Government and Scottish Local Authorities to support the electronic counting (e-Counting) of ballot papers at the Scottish Local Government elections in May 2022.Fujitsu Introduces Ultra-Compact, 50A PCB Relay for Medium-to-Heavy Automotive LoadsHoofddorp, EMEA, November 11, 2020- Fujitsu Components Europe has expanded its automotive relay offering with a new 12VDC PCB relay featuring a switching capacity of 50A at 14VDC. The FBR53-HC offers a higher contact rating than its 40A FBR53-HW counterpart, yet occupies the same 12.1 x 15.5 x 13.7mm footprint and weighs the same 6g.
--------------------------------------------------------------------------------
and more ...
EDIT:
To get just the last two months, all you need is the first two ul items from the soup. So, add [:2] to the first for loop, like this:
for unordered_list in all_lists[:2]:
# the rest of the loop body goes here

here I modified your code. I combined your bs4 code with selenium. Selenium is very powerful for scrape dynamic or JavaScript based website. You can use selenium with BeautifulSoup for make your life easier. Now it will give you output for all months.
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Firefox()
driver.maximize_window()
url = "https://www.fujitsu.com/uk/news/pr/2020/" #change the url if you want to get result for different year
driver.get(url)
# now your bs4 code start. It will give you output from current month to previous all month
soup = BeautifulSoup(driver.page_source, "html.parser")
#here I am getting all month name from January to november.
months = soup.find_all('h3')
for month in months:
month = month.text
print(f"month_name : {month}\n")
#here we are getting all description text from current month to all previous months
description_texts = soup.find_all('ul',class_='filterlist')
for description_text in description_texts:
description_texts = description_text.text.replace('\n','')
print(f"description_text: {description_texts}")
output:

I want to scrape urls of all titles using python

I wrote a code to get all the title urls but have some issues like it displays None values. So could you please help me out?
Here is my code:
import requests
from bs4 import BeautifulSoup
import csv
def get_page(url):
response = requests.get(url)
if not response.ok:
print('server responded:', response.status_code)
else:
soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser
return soup
def get_index_data(soup):
try:
titles_link = soup.find_all('div',class_="marginTopTextAdjuster")
except:
titles_link = []
urls = [item.get('href') for item in titles_link]
print(urls)
def main():
#url = "http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2653/rec/1"
mainurl = "http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1"
#get_page(url)
get_index_data(get_page(mainurl))
#write_csv(data,url)
if __name__ == '__main__':
main()

You are trying to get the href attribute of the div tag. Instead try selecting all the a tags. They seem to have a common class attribute body_link_11.
Use titles_link = soup.find_all('a',class_="body_link_11") instead of titles_link = soup.find_all('div',class_="marginTopTextAdjuster")

url = "http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1"
req = requests.get(url)
soup = BeautifulSoup(req.text, "lxml")
titles_link = []
titles_div = soup.find_all('div', attrs={'class': 'marginTopTextAdjuster'})
for link in titles_div:
tag = link.find_all('a', href=True)
try:
if tag[0].attrs.get('item_id', None):
titles_link.append({tag[0].text: tag[0].attrs.get('href', None)})
except IndexError:
continue
print(titles_link)
output:
[{'Civil Affairs Handbook, Japan, section 1a: population statistics.': '/cdm/singleitem/collection/p4013coll8/id/2653/rec/1'}, {'Army Air Forces Program 1943.': '/cdm/singleitem/collection/p4013coll8/id/2385/rec/2'}, {'Casualty report number II.': '/cdm/singleitem/collection/p4013coll8/id/3309/rec/3'}, {'Light armored division, proposed March 1943.': '/cdm/singleitem/collection/p4013coll8/id/2425/rec/4'}, {'Tentative troop list by type units for Blacklist operations.': '/cdm/singleitem/collection/p4013coll8/id/150/rec/5'}, {'Chemical Warfare Service: history of training, part 2, schooling of commissioned officers.': '/cdm/compoundobject/collection/p4013coll8/id/2501/rec/6'}, {'Horses in the German Army (1941-1945).': '/cdm/compoundobject/collection/p4013coll8/id/2495/rec/7'}, {'Unit history: 38 (MECZ) cavalry rcn. sq.': '/cdm/singleitem/collection/p4013coll8/id/3672/rec/8'}, {'Operations in France: December 1944, 714th Tank Battalion.': '/cdm/singleitem/collection/p4013coll8/id/3407/rec/9'}, {'G-3 Reports : Third Infantry Division. (22 Jan- 30 Mar 44)': '/cdm/singleitem/collection/p4013coll8/id/4393/rec/10'}, {'Summary of operations, 1 July thru 31 July 1944.': '/cdm/singleitem/collection/p4013coll8/id/3445/rec/11'}, {'After action report 36th Armored Infantry Regiment, 3rd Armored Division, Nov 1944 thru April 1945.': '/cdm/singleitem/collection/p4013coll8/id/3668/rec/12'}, {'Unit history, 38th Mechanized Cavalry Reconnaissance Squadron, 9604 thru 9665.': '/cdm/singleitem/collection/p4013coll8/id/3703/rec/13'}, {'Redeployment: occupation forces in Europe series, 1945-1946.': '/cdm/singleitem/collection/p4013coll8/id/2952/rec/14'}, {'Twelfth US Army group directives. Annex no. 1.': '/cdm/singleitem/collection/p4013coll8/id/2898/rec/15'}, {'After action report, 749th Tank Battalion: Jan, Feb, Apr - 8 May 45.': '/cdm/singleitem/collection/p4013coll8/id/3502/rec/16'}, {'743rd Tank Battalion, S3 journal history.': '/cdm/singleitem/collection/p4013coll8/id/3553/rec/17'}, {'History of military training, WAAC / WAC training.': '/cdm/singleitem/collection/p4013coll8/id/4052/rec/18'}, {'After action report, 756th Tank Battalion.': '/cdm/singleitem/collection/p4013coll8/id/3440/rec/19'}, {'After action report 92nd Cavalry Recon Squadron Mechanized 12th Armored Division, Jan thru May 45.': '/cdm/singleitem/collection/p4013coll8/id/3583/rec/20'}]

An easy way to do it with requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup
req = requests.get(url) # url stands for the page's url you want to find
soup = BeautifulSoup(req.text, "html.parser") # req.text is the complete html of the page
print(soup.title.string) # soup.title will give you the title of the page but with the <title> tags so .string removes them

Try this.
from simplified_scrapy import SimplifiedDoc,req,utils
url = 'http://cgsc.cdmhost.com/cdm/search/collection/p4013coll8/searchterm/1/field/all/mode/all/conn/and/order/nosort/page/1'
html = req.get(url)
doc = SimplifiedDoc(html)
lst = doc.selects('div.marginTopTextAdjuster').select('a')
titles_link = [(utils.absoluteUrl(url,a.href),a.text) for a in lst if a]
print (titles_link)
Result:
[('http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2653/rec/1', 'Civil Affairs Handbook, Japan, section 1a: population statistics.'), ('http://cgsc.cdmhost.com/cdm/landingpage/collection/p4013coll8', 'World War II Operational Documents'), ('http://cgsc.cdmhost.com/cdm/singleitem/collection/p4013coll8/id/2385/rec/2', 'Army Air Forces Program 1943.'),...

In python how to fix the code when it is executing fine (exit code 0) but with no results (nothing printing)?

I am trying to scrape the webpage of the new york times. My code is running fine as it is showing exit code 0 but giving no results.
import time
import requests
from bs4 import BeautifulSoup
url = 'https://www.nytimes.com/search?endDate=20190331&query=cybersecurity&sort=newest&startDate=20180401={}'
pages = [0]
for page in pages:
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for item in soup.select("#search-results li > a"):
resp = requests.get(item.get("href"))
sauce = BeautifulSoup(resp.text, "lxml")
date = sauce.select(".css-1vkm6nb ehdk2mb0 h1")
date = date.text
print(date)
time.sleep(3)
with this code, I am hoping to get the publish date from each article.

Nice attempt--you're pretty close. The problem is the selectors:
#search-results asks for an id that doesn't exist. The element is a <ol data-testid="search-results">, so we'll need other means to grab this anchor tag.
.css-1vkm6nb ehdk2mb0 h1 doesn't make much sense. It asks for an element h1 that is inside of a ehdk2mb0 element which is inside of an element with the class .css-1vkm6nb. What's actually on the page is an <h1 class="css-1vkm6nb ehdk2mb0"> element. Select this with h1.css-1vkm6nb.ehdk2mb0.
Having said that, this is not the time data you're after--it's the title. We can get the time element (<time>) with a simple sauce.find("time").
Full example:
import requests
from bs4 import BeautifulSoup
base = "https://www.nytimes.com"
url = "https://www.nytimes.com/search?endDate=20190331&query=cybersecurity&sort=newest&startDate=20180401={}"
pages = [0]
for page in pages:
res = requests.get(url.format(page))
soup = BeautifulSoup(res.text,"lxml")
for link in soup.select(".css-138we14 a"):
resp = requests.get(base + link.get("href"))
sauce = BeautifulSoup(resp.text, "lxml")
title = sauce.select_one("h1.css-1j5ig2m.e1h9rw200")
time = sauce.find("time")
print(time.text, title.text.encode("utf-8"))
Output:
March 30, 2019 b'Bezos\xe2\x80\x99 Security Consultant Accuses Saudis of Hacking the Amazon C.E.O.\xe2\x80\x99s Phone'
March 29, 2019 b'In Ukraine, Russia Tests a New Facebook Tactic in Election Tampering'
March 28, 2019 b'Huawei Shrugs Off U.S. Clampdown With a $100 Billion Year'
March 28, 2019 b'N.S.A. Contractor Arrested in Biggest Breach of U.S. Secrets Pleads Guilty'
March 28, 2019 b'Grindr Is Owned by a Chinese Firm, and the U.S. Is Trying to Force It to Sell'
March 28, 2019 b'DealBook Briefing: Saudi Arabia Wanted Cash. Aramco Just Obliged.'
March 28, 2019 b'Huawei Security \xe2\x80\x98Defects\xe2\x80\x99 Are Found by British Authorities'
March 25, 2019 b'As Special Counsel, Mueller Kept Such a Low Profile He Seemed Almost Invisible'
March 21, 2019 b'Quotation of the Day: In New Age of Digital Warfare, Spies for Any Nation\xe2\x80\x99s Budget'
March 21, 2019 b'Coast Guard\xe2\x80\x99s Top Officer Pledges \xe2\x80\x98Dedicated Campaign\xe2\x80\x99 to Improve Diversity'

webscrape and splitting of retrieved data into different lines

I am trying to collect the event date, time and venue. They came out successfully but then it is not reader friendly. How do I get the date, time and venue to appear separately like:
- event
Date:
Time:
Venue:
- event
Date:
Time:
Venue:
I was thinking of splitting but I ended up with lots of [ ] which made it looked even uglier. I thought of stripping but my regular expression but it does not appear to do anything. Any suggestions?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"
response = urllib.request.urlopen(url_toscrape)
info_type = response.info()
responseData = response.read()
soup = BeautifulSoup(responseData, 'lxml')
events_absFirst = soup.find_all("div",{"class": "ntu_event_summary_title_first"})
date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})
events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})
for first in events_absFirst:
print('-',first.text.strip())
print (' ',date)
for tr in soup.find_all("div",{"class":"ntu_event_detail"}):
date_absAll = tr.find_all("div",{"class": "ntu_event_summary_date"})
events_absAll = tr.find_all("div",{"class": "ntu_event_summary_title"})
for events in events_absAll:
events = events.text.strip()
for date in date_absAll:
date = date.text.strip('^Time.*')
print ('-',events)
print (' ',date)

You can iterate over the divs containing the event information, store the results, and then print each:
import requests, re
from bs4 import BeautifulSoup as soup
d = soup(requests.get('https://www.ntu.edu.sg/events/Pages/default.aspx').text, 'html.parser')
results = [[getattr(i.find('div', {'class':re.compile('ntu_event_summary_title_first|ntu_event_summary_title')}), 'text', 'N/A'), getattr(i.find('div', {'class':'ntu_event_summary_detail'}), 'text', 'N/A')] for i in d.find_all('div', {'class':'ntu_event_articles'})]
new_results = [[a, re.findall('Date : .*?(?=\sTime)|Time : .*?(?=Venue)|Time : .*?(?=$)|Venue: [\w\W]+', b)] for a, b in results]
print('\n\n'.join('-{}\n{}'.format(a, '\n'.join(f' {h}:{i}' for h, i in zip(['Date', 'Time', 'Venue'], b))) for a, b in new_results))
Output:
-7th ASEF Rectors' Conference and Students' Forum (ARC7)
Date:Date : 29 Nov 2018 to 14 May 2019
Time:Time : 9:00am to 5:00pm
-Be a Youth Corps Leader
Date:Date : 1 Dec 2018 to 31 Mar 2019
Time:Time : 9:00am to 5:00pm
-NIE Visiting Artist Programme January 2019
Date:Date : 14 Jan 2019 to 11 Apr 2019
Time:Time : 9:00am to 8:00pm
Venue:Venue: NIE Art gallery
-Exercise Classes for You: Healthy Campus#NTU
Date:Date : 21 Jan 2019 to 18 Apr 2019
Time:Time : 6:00pm to 7:00pm
Venue:Venue: The Wave # Sports & Recreation Centre
-[eLearning Course] Information & Media Literacy (From January 2019)
Date:Date : 23 Jan 2019 to 31 May 2019
Time:Time : 9:00am to 5:00pm
Venue:Venue: NTULearn
...

You could use requests and test the length of stripped_strings
import requests
from bs4 import BeautifulSoup
import pandas as pd
url_toscrape = "https://www.ntu.edu.sg/events/Pages/default.aspx"
response = requests.get(url_toscrape)
soup = BeautifulSoup(response.content, 'lxml')
events = [item.text for item in soup.select("[class^='ntu_event_summary_title']")]
data = soup.select('.ntu_event_summary_date')
dates = []
times = []
venues = []
for item in data:
strings = [string for string in item.stripped_strings]
if len(strings) == 3:
dates.append(strings[0])
times.append(strings[1])
venues.append(strings[2])
elif len(strings) == 2:
dates.append(strings[0])
times.append(strings[1])
venues.append('N/A')
elif len(strings) == 1:
dates.append(strings[0])
times.append('N/A')
venues.append('N/A')
results = list(zip(events, dates, times, venues))
df = pd.DataFrame(results)
print(df)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping pagination via "page=" midway in url - python

Related

Extract elements between two tags with Beautiful Soup and Python

Beautiful soup: Extract text and urls from a list, but only under specific headings

I want to scrape urls of all titles using python

In python how to fix the code when it is executing fine (exit code 0) but with no results (nothing printing)?

webscrape and splitting of retrieved data into different lines

Categories

Resources