Using Python and BeautifulSoup to scrape list from an URL - python

I am new to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape an url and want to store list of movies under one date.
Below is the code I have so far:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=ul.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
I am getting "AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
Expected result in list or dataframe
29th May 2020 Romantic
29th May 2020 Sohreyan Da Pind Aa Gaya
5th June 2020 Lakshmi Bomb
and so on
Thanks in advance for help.

This script will get all movies and corresponding dates to a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/calendar?region=IN&ref_=rlm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out, last = [], ''
for tag in soup.select('#main h4, #main li'):
if tag.name == 'h4':
last = tag.get_text(strip=True)
else:
out.append({'Date':last, 'Movie':tag.get_text(strip=True).rsplit('(', maxsplit=1)[0]})
df = pd.DataFrame(out)
print(df)
Prints:
Date Movie
0 29 May 2020 Romantic
1 29 May 2020 Sohreyan Da Pind Aa Gaya
2 05 June 2020 Laxmmi Bomb
3 05 June 2020 Roohi Afzana
4 05 June 2020 Nikamma
.. ... ...
95 26 March 2021 Untitled Luv Ranjan Film
96 02 April 2021 F9
97 02 April 2021 Bell Bottom
98 02 April 2021 NTR Trivikiram Untitled Movie
99 09 April 2021 Manje Bistre 3
[100 rows x 2 columns]

I think you should replace "ul" with "h1" on the 10th line. And add definition of variable "movielist" ahead.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
# add code here
movielist = []
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
# replace ul with h1 here
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
print(movielist)

I didn't specify a list to receive, and I changed it from 'h1' to 'text capture' instead of 'h4'.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
movielist = []
date = soup.find_all("h4")
ul = soup.find_all("ui")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
The reason the date doesn't match in the output result is that the 'date' retrieved looks like the following, so you need to fix the logic.
There are multiple titles on the same release date, so the release date and number of titles don't match up. I can't help you that much because I don't have the time. Have a good night.
29 May 2020
05 June 2020
07 June 2020
07 June 2020 Romantic
12 June 2020
12 June 2020 Sohreyan Da Pind Aa Gaya
18 June 2020
18 June 2020 Laxmmi Bomb
19 June 2020
19 June 2020 Roohi Afzana
25 June 2020
25 June 2020 Nikamma
26 June 2020
26 June 2020 Naandhi
02 July 2020
02 July 2020 Mandela
03 July 2020
03 July 2020 Medium Spicy
10 July 2020
10 July 2020 Out of the Blue

Related

Web Scraping ESPN NFL webpage with Python

I am trying to perform web scraping using Python on the ESPN website to extract historical NFL football game results scores only into a csv file. I’m unable to find a way to add the dates as displayed in the desired output. Could someone help me a way to get the desired output from the current output. The website I am using to scrape the data and the desired output is below:
NFL Website:
https://www.espn.com/nfl/scoreboard/_/week/17/year/2022/seasontype/2
Current Output:
Week #, Away Team, Away Score, Home Team, Home Score
Week 17, Cowboys, 27, Titans, 13
Week 17, Cardinals, 19, Falcons, 20
Week 17, Bears, 10, Lions, 41
Desired Game Results Output:
Week #, Date, Away Team, Away Score, Home Team, Home Score
Week 17, 12/29/2022, Cowboys, 27, Titans, 13
Week 17, 1/1/2023, Cardinals, 19, Falcons, 20
Week 17, 1/1/2023, Bears, 10, Lions, 41
Code:
import bs4
from bs4 import BeautifulSoup
import requests
import pandas as pd
import numpy as np
daterange = 1
url_list = []
while daterange < 19:
url = "https://www.espn.com/nfl/scoreboard/_/week/"+str(daterange)+"/year/2022/seasontype/2"
url_list.append(url)
daterange = daterange + 1
j = 1
away_team = []
home_team = []
away_team_score = []
home_team_score = []
week = []
for url in url_list:
response = urlopen(url)
urlname = requests.get(url)
bs = bs4.BeautifulSoup(urlname.text,'lxml')
print(response.url)
i = 0
while True:
try:
name = bs.findAll('div',{'class':'ScoreCell__TeamName ScoreCell__TeamName--shortDisplayName truncate db'})[i]
except Exception:
break
name = name.get_text()
try:
score = bs.findAll('div',{'class':'ScoreCell__Score h4 clr-gray-01 fw-heavy tar ScoreCell_Score--scoreboard pl2'})[i]
except Exception:
break
score = score.get_text()
if i%2 == 0:
away_team.append(name)
away_team_score.append(score)
else:
home_team.append(name)
home_team_score.append(score)
week.append("week "+str(j))
i = i + 1
j = j + 1
web_scraping = list (zip(week, home_team, home_team_score, away_team, away_team_score))
web_scraping_df = pd.DataFrame(web_scraping, columns = ['week','home_team','home_team_score','away_team','away_team_score'])
web_scraping_df
Try:
import requests
import pandas as pd
from bs4 import BeautifulSoup
week = 17
url = f'https://www.espn.com/nfl/scoreboard/_/week/{week}/year/2022/seasontype/2'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
all_data = []
for board in soup.select('.ScoreboardScoreCell'):
title = board.find_previous(class_='Card__Header__Title').text
teams = [t.text for t in board.select('.ScoreCell__TeamName')]
scores = [s.text for s in board.select('.ScoreCell__Score')] or ['-', '-']
all_data.append((week, title, teams[0], scores[0], teams[1], scores[1]))
df = pd.DataFrame(all_data, columns=['Week', 'Date', 'Team 1', 'Score 1', 'Team 2', 'Score 2'])
print(df.to_markdown(index=False))
Prints:
Week
Date
Team 1
Score 1
Team 2
Score 2
17
Thursday, December 29, 2022
Cowboys
27
Titans
13
17
Sunday, January 1, 2023
Cardinals
19
Falcons
20
17
Sunday, January 1, 2023
Bears
10
Lions
41
17
Sunday, January 1, 2023
Broncos
24
Chiefs
27
17
Sunday, January 1, 2023
Dolphins
21
Patriots
23
17
Sunday, January 1, 2023
Colts
10
Giants
38
17
Sunday, January 1, 2023
Saints
20
Eagles
10
17
Sunday, January 1, 2023
Panthers
24
Buccaneers
30
17
Sunday, January 1, 2023
Browns
24
Commanders
10
17
Sunday, January 1, 2023
Jaguars
31
Texans
3
17
Sunday, January 1, 2023
49ers
37
Raiders
34
17
Sunday, January 1, 2023
Jets
6
Seahawks
23
17
Sunday, January 1, 2023
Vikings
17
Packers
41
17
Sunday, January 1, 2023
Rams
10
Chargers
31
17
Sunday, January 1, 2023
Steelers
16
Ravens
13
17
Monday, January 2, 2023
Bills
-
Bengals
-

BeautifulSoup scraper doesn't retrieve any information

I am trying to retrieve football squads data from multiple wikipedia pages and put it in a Pandas Data frame. One example of the source is this [link][1], but I want too do this for links between 1930-2018.
The code that I will show used to work in Python 2 and I'm trying to adapt it to Python 3. The information in every page are multiple tables with 7 columns. All of the tables have the same format.
The code used to crash but now is running. The only problem is that it throws an empty .csv file.
Just to put more context I made some specific changes :
Python 2
path = os.path.join('.cache', hashlib.md5(url).hexdigest() + '.html')
Python 3
path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
Python 2
open(path, 'w') as fd:
Python 3
open(path, 'wb') as fd:
Python 2
years = range(1930,1939,4) + range(1950,2015,4)
Python 3: Yes here I also changed the range so I could get World Cup 2018
years = list(range(1930,1939,4)) + list(range(1950,2019,4))
This is the whole chunk of code. If somebody can spot where in the world is the problem and give a solution I would be very thankful.
import hashlib
import requests
from bs4 import BeautifulSoup
import pandas as pd
if not os.path.exists('.cache'):
os.makedirs('.cache')
ua = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.116 Safari/15612.1.29.41.4'
session = requests.Session()
def get(url):
'''Return cached lxml tree for url'''
path = os.path.join('.cache', hashlib.sha256(url.encode('utf-8')).hexdigest() + '.html')
if not os.path.exists(path):
print(url)
response = session.get(url, headers={'User-Agent': ua})
with open(path, 'wb') as fd:
fd.write(response.text.encode('utf-8'))
return BeautifulSoup(open(path), 'html.parser')
def squads(url):
result = []
soup = get(url)
year = url[29:33]
for table in soup.find_all('table','sortable'):
if "wikitable" not in table['class']:
country = table.find_previous("span","mw-headline").text
for tr in table.find_all('tr')[1:]:
cells = [td.text.strip() for td in tr.find_all('td')]
cells += [country, td.a.get('title') if td.a else 'none', year]
result.append(cells)
return result
years = list(range(1930,1939,4)) + list(range(1950,2019,4))
result = []
for year in years:
url = "http://en.wikipedia.org/wiki/"+str(year)+"_FIFA_World_Cup_squads"
result += squads(url)
Final_result = pd.DataFrame(result)
Final_result.to_csv('/Users/home/Downloads/data.csv', index=False, encoding='iso-8859-1')```
[1]: https://en.wikipedia.org/wiki/2018_FIFA_World_Cup_squads
To get information about each team for years 1930-2018 you can use next example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/{}_FIFA_World_Cup_squads"
dfs = []
for year in range(1930, 2019):
print(year)
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
tables = soup.find_all(
lambda tag: tag.name == "table"
and tag.select_one('th:-soup-contains("Pos.")')
)
for table in tables:
for tag in table.select('[style="display:none"]'):
tag.extract()
df = pd.read_html(str(table))[0]
df["Year"] = year
df["Country"] = table.find_previous(["h3", "h2"]).span.text
dfs.append(df)
df = pd.concat(dfs)
print(df)
df.to_csv("data.csv", index=False)
Prints:
...
13 14 FW Moussa Konaté 3 April 1993 (aged 25) 28 Amiens 2018 Senegal 10.0
14 15 FW Diafra Sakho 24 December 1989 (aged 28) 12 Rennes 2018 Senegal 3.0
15 16 GK Khadim N'Diaye 5 April 1985 (aged 33) 26 Horoya 2018 Senegal 0.0
16 17 MF Badou Ndiaye 27 October 1990 (aged 27) 20 Stoke City 2018 Senegal 1.0
17 18 FW Ismaïla Sarr 25 February 1998 (aged 20) 16 Rennes 2018 Senegal 3.0
18 19 FW M'Baye Niang 19 December 1994 (aged 23) 7 Torino 2018 Senegal 0.0
19 20 FW Keita Baldé 8 March 1995 (aged 23) 19 Monaco 2018 Senegal 3.0
20 21 DF Lamine Gassama 20 October 1989 (aged 28) 36 Alanyaspor 2018 Senegal 0.0
21 22 DF Moussa Wagué 4 October 1998 (aged 19) 10 Eupen 2018 Senegal 0.0
22 23 GK Alfred Gomis 5 September 1993 (aged 24) 1 SPAL 2018 Senegal 0.0
and saves data.csv (screenshot from LibreOffice):
just tested, you have no data because "wikitable" class is present in every table
You can replace "not in" by "in":
if "wikitable" in table["class"]:
...
And your BeautifulSoup data will be there
Once you changed this condition, you will have a problem with this list comprehension:
cells += [country, td.a.get('title') if td.a else 'none', year]
This is because td is not defined in this list, not quite sure what is the aim of these lines but you can define tds before and use them after:
tds = tr.find_all('td')
cells += ...
In general you can add breakpoints in your code to easily identify where is the problem

Error "6 columns passed, passed data had 286 columns "

I am web-scraping the table that is found on this website : " https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html "
Everything was good, but I had a small issue with the "Price" label and was unable to fix it. I've been trying for the past few hours and this is the last error that I ran into : " https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html "
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests
page = requests.get("https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html")
soup = BeautifulSoup(page.content, "lxml")
gdp = soup.find_all("table", attrs={"class": "table flight-detail hidden-xs"})
print("Number of tables on site: ",len(gdp))
table1 = gdp[0]
# the head will form our column names
body = table1.find_all("tr")
print(len(body))
# Head values (Column names) are the first items of the body list
head = body[0] # 0th item is the header row
body_rows = body[1:] # All other items becomes the rest of the rows
# Lets now iterate through the head HTML code and make list of clean headings
# Declare empty list to keep Columns names
headings = []
for item in head.find_all("th"): # loop through all th elements
# convert the th elements to text and strip "\n"
item = (item.text).rstrip("\n")
# append the clean column name to headings
headings.append(item)
print(headings)
import re
all_rows = [] # will be a list for list for all rows
for row_num in range(len(body_rows)): # A row at a time
row = [] # this will old entries for one row
for row_item in body_rows[row_num].find_all("td")[:-1]: #loop through all row entries
# row_item.text removes the tags from the entries
# the following regex is to remove \xa0 and \n and comma from row_item.text
# xa0 encodes the flag, \n is the newline and comma separates thousands in numbers
aa = re.sub("(\xa0)|(\n)|(\t),","",row_item.text)
#append aa to row - note one row entry is being appended
row.append(aa)
# append one row to all_rows
all_rows.append(row)
for row_item in body_rows[row_num].find_all("td")[-1].find("span").text: #loop through the last row entry, price.
aa = re.sub("(\xa0)|(\n)|(\t),","",row_item)
row.append(aa)
all_rows.append(row)
# We can now use the data on all_rowsa and headings to make a table
# all_rows becomes our data and headings the column names
df = pd.DataFrame(data=all_rows,columns=headings)
#df.head()
#print(df)
df["Date"]=pd.to_datetime(df["Date"]).dt.strftime("%d/%m/%Y")
print(df)
If you could please run the code and tell me how to solve this issue so I could print everything when I am using " print(df) ".
Previusly, I was able to print eveything, except the price, who had "\t\t\t\t\t\t\t" instead of the price.
Thank you.
To get the table into panda DataFrame, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = (
"https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html"
)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = []
for tr in soup.select("tr:has(td)"):
row = [td.get_text(strip=True, separator=" ") for td in tr.select("td")]
data.append(row)
df = pd.DataFrame(data, columns="From To Aircraft Seats Date Price".split())
print(df)
df.to_csv("data.csv", index=False)
Prints:
From To Aircraft Seats Date Price
0 Prague Vaclav Havel Airport Bratislava M R Stefanik Citation XLS+ 9 Thu Jun 03 00:00:00 UTC 2021 €3 300 (RRP €6 130)
1 Billund Odense Learjet 45 / 45XR 8 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €7 100)
2 La Roche/yon Les Ajoncs Nantes Atlantique Embraer Phenom 100 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €4 820)
3 London Biggin Hill Paris Le Bourget Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €6 980)
4 Prague Vaclav Havel Airport Salzburg (mozart) Gulfstream G200 9 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 800)
5 Palma De Mallorca Edinburgh Cessna C525 Citation CJ2 5 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €18 680)
6 Linz Blue Danube Linz Munich Munchen Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €3 600)
7 Geneva Cointrin Paris Le Bourget Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €9 240)
8 Vienna Schwechat Cologne-bonn Koln Bonn Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 590)
9 Cannes Mandelieu Geneva Cointrin Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 220)
10 Brussels National Cologne-bonn Koln Bonn Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €3 790)
11 Split Bari Palese Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 220)
12 Copenhagen Roskilde Aalborg Challenger 604 11 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €16 750)
13 Brussels National Leipzig Halle Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €6 690)
...
And saves data.csv (screenshot from LibreOffice):

extract just date from beautifulsoup result

I am trying to scrape a date from a web-site using BeautifulSoup:
how do I extract only the date-time from this? I only want : May 21, 2021 19:47
You can use this example how to extract the date-time from the <ctag>s:
from bs4 import BeautifulSoup
html_doc = """
<ctag class="">May 21, 2021 19:47 Source: <span>BSE</span> </ctag>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for ctag in soup.find_all("ctag"):
dt = ctag.get_text(strip=True).rsplit(maxsplit=1)[0]
print(dt)
Prints:
May 21, 2021 19:47
Or:
for ctag in soup.find_all("ctag"):
dt = ctag.contents[0].rsplit(maxsplit=1)[0]
print(dt)
Or:
for ctag in soup.find_all("ctag"):
dt = ctag.find_next(text=True).rsplit(maxsplit=1)[0]
print(dt)
EDIT: To get dataframe of articles, you can do:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.moneycontrol.com/company-notices/reliance-industries/notices/RI"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = []
for ctag in soup.select("li ctag"):
data.append(
{
"title": ctag.find_next("a").get_text(strip=True),
"date": ctag.find_next(text=True).rsplit(maxsplit=1)[0],
"desc": ctag.find_next("p", class_="MT2").get_text(strip=True),
}
)
df = pd.DataFrame(data)
print(df)
Prints:
title date desc
0 Reliance Industries - Compliances-Reg. 39 (3) ... May 21, 2021 19:47 Pursuant to Regulation 39(3) of the Securities...
1 Reliance Industries - Announcement under Regul... May 19, 2021 21:20 We refer to Regulation 5 of the SEBI (Prohibit...
2 Reliance Industries - Announcement under Regul... May 17, 2021 17:18 In continuation of our letter dated May 15, 20...
3 Reliance Industries - Announcement under Regul... May 17, 2021 16:06 Please find attached a media release by Relian...
4 Reliance Industries - Announcement under Regul... May 15, 2021 15:15 The Company has, on May 15, 2021, published in...
5 Reliance Industries - Compliances-Reg. 39 (3) ... May 14, 2021 19:44 Pursuant to Regulation 39(3) of the Securities...
6 Reliance Industries - Notice For Payment Of Fi... May 13, 2021 22:57 We refer to our letter dated May 01, 2021. A...
7 Reliance Industries - Announcement under Regul... May 12, 2021 21:20 We wish to inform you that the Company partici...
8 Reliance Industries - Compliances-Reg. 39 (3) ... May 12, 2021 19:39 Pursuant to Regulation 39(3) of the Securities...
9 Reliance Industries - Compliances-Reg. 39 (3) ... May 11, 2021 19:49 Pursuant to Regulation 39(3) of the Securities...

Finding links with beautifulsoup in Python

I am having a hard time trying to extract the hyperlinks from a page with beatifulsoup. I have tried many different tags and classes but cant seem to get it without a whole bunch of other html I don't want. Is anyone able to tell me where i'm going wrong? Code below:
from bs4 import BeautifulSoup
import requests
page_link = url
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
pagecode = soup.find(class_='infinite-scroll-container')
title = pagecode.findAll('i')
artist = pagecode.find_all('h1', "exhibition-title")
links = pagecode.find_all('article', "teaser infinite-scroll-item")
printcount=0
while printcount < len(title):
titlestring = title[printcount].text
artiststring = artist[printcount].text
artiststring = artiststring.replace(titlestring, '')
artiststring = artiststring.strip()
titlestring = titlestring.strip()
print(artiststring)
print(titlestring)
print("----------------------------------------")
printcount = printcount+1
You could directly target the all the links in that page and then filter it to get get links within an article. Note that this page is fully loaded only on scroll, you may have to use selenium to get all the links. For now i will answer on how to filter the links.
from bs4 import BeautifulSoup
import requests
import re
page_link = 'https://hopkinsonmossman.com/exhibitions/past/'
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
links= soup.find_all('a')
for link in links:
if link.parent.name=='article':#only article links
print(re.sub(r"\s\s+", " ", link.text).strip())#replace multiple spaces with one
print(link['href'])
print()
Output
Nicola Farquhar A Holotype Heart 22 Nov – 21 Dec 2018 Wellington
https://hopkinsonmossman.com/exhibitions/nicola-farquhar-5/
Bill Culbert Desk Lamp, Crash 19 Oct – 17 Nov 2018 Wellington
https://hopkinsonmossman.com/exhibitions/bill-culbert-2/
Nick Austin, Ammon Ngakuru Many Happy Returns 18 Oct – 15 Nov 2018 Auckland
https://hopkinsonmossman.com/exhibitions/nick-austin-ammon-ngakuru/
Dane Mitchell Tuning 13 Sep – 13 Oct 2018 Wellington
https://hopkinsonmossman.com/exhibitions/dane-mitchell-4/
Shannon Te Ao my life as a tunnel 08 Sep – 13 Oct 2018 Auckland
https://hopkinsonmossman.com/exhibitions/shannon-te-ao/
Tilt Anoushka Akel, Ruth Buchanan, Meg Porteous 16 Aug – 08 Sep 2018 Wellington
https://hopkinsonmossman.com/exhibitions/anoushka-akel-ruth-buchanan-meg-porteous/
Shadow Work Fiona Connor, Oliver Perkins 02 Aug – 01 Sep 2018 Auckland
https://hopkinsonmossman.com/exhibitions/group-show/
Emma McIntyre Rose on red 13 Jul – 11 Aug 2018 Wellington
https://hopkinsonmossman.com/exhibitions/emma-mcintyre-2/
Tahi Moore Incomprehensible public fictions: Writer fights politician in car park 04 Jul – 28 Jul 2018 Auckland
https://hopkinsonmossman.com/exhibitions/tahi-moore-2/
Oliver Perkins Bleeding Edge 01 Jun – 07 Jul 2018 Wellington
https://hopkinsonmossman.com/exhibitions/oliver-perkins-2/
Spinning Phillip Lai, Peter Robinson 19 May – 23 Jun 2018 Auckland
https://hopkinsonmossman.com/exhibitions/1437/
Milli Jannides Cavewoman 19 Apr – 26 May 2018 Wellington
https://hopkinsonmossman.com/exhibitions/milli-jannides/
Oscar Enberg Taste & Power, a prologue 06 Apr – 12 May 2018 Auckland
https://hopkinsonmossman.com/exhibitions/oscar-enberg/
Fiona Connor Closed Down Clubs & Monochromes 09 Mar – 14 Apr 2018 Wellington
https://hopkinsonmossman.com/exhibitions/closed-down-clubs-and-monochromes/
Bill Culbert Colour Theory, Window Mobile 02 Mar – 29 Mar 2018 Auckland
https://hopkinsonmossman.com/exhibitions/colour-theory-window-mobile/
Role Models Curated by Rob McKenzie
Robert Bittenbender, Ellen Cantor, Jennifer McCamley, Josef Strau 26 Jan – 24 Feb 2018 Auckland
https://hopkinsonmossman.com/exhibitions/role-models/
Emma McIntyre Pink Square Sways 24 Nov – 23 Dec 2017 Auckland
https://hopkinsonmossman.com/exhibitions/emma-mcintyre/
My initial thought was to use the "ajax-link" class, but turns out the 'HOPKINSON MOSSMAN' link also has that class. You could also use that approach and filter out the first link in the find_all, which will give you the same result.
from bs4 import BeautifulSoup
import requests
import re
page_link = 'https://hopkinsonmossman.com/exhibitions/past/'
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
links= soup.find_all('a',class_='ajax-link')
for link in links[1:]:
print(re.sub(r"\s\s+", " ", link.text).strip())#replace multiple spaces with one
print(link['href'])
print()

Categories

Resources