I am working on a project on web scraping and I am asked to scrape all the pdf links from a website:
https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=3&s .
The website has 397 pages but every page has the same URL. I tried the inspect element tool and found out that a javascript code helps to navigate to different pages. But still I am not able to figure out how to run my script for all the pages.
Below is my code.
from bs4 import BeautifulSoup
import lxml
url = 'https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=3&s'
conn = urllib2.urlopen(url)
html = conn.read()
soup = BeautifulSoup(html)
links = soup.find_all('a')
urls=[]
for tag in links:
link = tag.get('href',None)
if link is not None and link.endswith('html'):
#urls.append(link)
purl=link
new=urllib2.urlopen(purl)
htm=new.read()
sp=BeautifulSoup(htm)
nl=sp.find_all('a')
nm=sp.find_all('iframe')
for i in nl:
q=i.get('href',None)
title=i.get('title',None)
if q is not None and q.endswith('pdf'):
print(q)
urls.append(q)
for j in nm:
z=j.get('src',None)
title=j.get('title',None)
if z is not None and z.endswith('pdf')and title is not None:
print(z)
print(title)
urls.append(z)
print(len(urls))
You can use their API located on https://www.sebi.gov.in/sebiweb/ajax/home/getnewslistinfo.jsp to load the data.
For example:
from bs4 import BeautifulSoup
from requests import get
api_url = 'https://www.sebi.gov.in/sebiweb/ajax/home/getnewslistinfo.jsp'
payload = {
'nextValue': "1",
'next': "n",
'search': "",
'fromDate': "",
'toDate': "",
'fromYear': "",
'toYear': "",
'deptId': "",
'sid': "3",
'ssid': "-1",
'smid': "0",
'intmid': "-1",
'sText': "Filings",
'ssText': "-- All Sub Section --",
'smText': "",
'doDirect': "1",
}
page = 0
while True:
print('Page {}...'.format(page))
payload['doDirect'] = page
soup = BeautifulSoup(requests.post(api_url, data=payload).content, 'html.parser')
rows = soup.select('tr:has(td)')
if not rows:
break
for tr in rows:
row = [td.get_text(strip=True) for td in tr.select('td')] + [tr.a['href']]
print(*row, sep='\t')
page += 1
Prints:
...
Page 1...
Jun 25, 2020 Mindspace Business Parks REIT – Addendum to Draft Prospectus https://www.sebi.gov.in/filings/reit-issues/jun-2020/mindspace-business-parks-reit-addendum-to-draft-prospectus_46928.html
Jun 25, 2020 Amrit Corp. Ltd. - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/amrit-corp-ltd-public-announcement_46927.html
Jun 24, 2020 NIIT Technologies Buyback - Post Buyback - Public Advertisement https://www.sebi.gov.in/filings/buybacks/jun-2020/niit-technologies-buyback-post-buyback-public-advertisement_46923.html
Jun 23, 2020 Addendum to Letter of Offer of Arvind Fashions Limited https://www.sebi.gov.in/filings/rights-issues/jun-2020/addendum-to-letter-of-offer-of-arvind-fashions-limited_46941.html
Jun 23, 2020 Genesis Exports Limited - Draft letter of Offer https://www.sebi.gov.in/filings/buybacks/jun-2020/genesis-exports-limited-draft-letter-of-offer_46911.html
Jun 23, 2020 Genesis Exports Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/genesis-exports-limited-public-announcement_46909.html
Jun 19, 2020 Coral India Finance and Housing Limited – Post Buy-back Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/coral-india-finance-and-housing-limited-post-buy-back-public-announcement_46900.html
Jun 19, 2020 Network Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/network-limited_46890.html
Jun 17, 2020 KSOLVES INDIA LIMITED https://www.sebi.gov.in/filings/public-issues/jun-2020/ksolves-india-limited_46996.html
Jun 10, 2020 Happiest Minds Technologies Limited https://www.sebi.gov.in/filings/public-issues/jun-2020/happiest-minds-technologies-limited_46843.html
Jun 08, 2020 IM+ Capitals Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/im-capitals-limited_46786.html
Jun 05, 2020 HealthCare Global Enterprises Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/healthcare-global-enterprises-limited_46773.html
Jun 02, 2020 Jaikumar Constructions Ltd. - DRHP https://www.sebi.gov.in/filings/public-issues/jun-2020/jaikumar-constructions-ltd-drhp_46774.html
Jun 02, 2020 Mahindra Focused Equity Yojana https://www.sebi.gov.in/filings/mutual-funds/jun-2020/mahindra-focused-equity-yojana_46767.html
Jun 02, 2020 GRANULES INDIA LIMITED - Dispatch advertisement https://www.sebi.gov.in/filings/buybacks/jun-2020/granules-india-limited-dispatch-advertisement_46765.html
Jun 02, 2020 GRANULES INDIA LIMITED - Letter of Offer https://www.sebi.gov.in/filings/buybacks/jun-2020/granules-india-limited-letter-of-offer_46764.html
Jun 02, 2020 Motilal Oswal Multi Asset Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/motilal-oswal-multi-asset-fund_46762.html
Jun 02, 2020 Principal Large Cap Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/principal-large-cap-fund_46761.html
Jun 02, 2020 Mahindra Arbitrage Yojana https://www.sebi.gov.in/filings/mutual-funds/jun-2020/mahindra-arbitrage-yojana_46760.html
Jun 02, 2020 HSBC Mid Cap Equity Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/hsbc-mid-cap-equity-fund_46759.html
Jun 01, 2020 Tanla Solutions Limited - DLOF https://www.sebi.gov.in/filings/buybacks/jun-2020/tanla-solutions-limited-dlof_46750.html
Jun 01, 2020 Axis Banking ETF https://www.sebi.gov.in/filings/mutual-funds/jun-2020/axis-banking-etf_46748.html
Jun 01, 2020 Kalpataru Power Transmission Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/kalpataru-power-transmission-limited-public-announcement_46746.html
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 22, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-22-2020_46745.html
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 19, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-19-2020_46744.html
Page 2...
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 18, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-18-2020_46743.html
May 29, 2020 Muthoottu Mini Financiers Limited- Prospectus https://www.sebi.gov.in/filings/debt-offer-document/may-2020/muthoottu-mini-financiers-limited-prospectus_46769.html
May 29, 2020 Coral India Housing and Finance Limited - Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/coral-india-housing-and-finance-limited-advertisement_46732.html
May 29, 2020 TANLA SOLUTIONS LIMITED - Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/tanla-solutions-limited-public-announcement_46731.html
May 28, 2020 Tips Industries Limited - Dispatch Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/tips-industries-limited-dispatch-advertisement_46723.html
May 27, 2020 KLM Axiva Finvest Limited - Prospectus https://www.sebi.gov.in/filings/debt-offer-document/may-2020/klm-axiva-finvest-limited-prospectus_46755.html
May 26, 2020 Tips Industries Limited - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/tips-industries-limited-letter-of-offer_46708.html
May 26, 2020 Axis Capital Protection Oriented Fund - Series 7-10 https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-capital-protection-oriented-fund-series-7-10_46707.html
May 26, 2020 ICICI Prudential Alpha Low Vol 30 ETF https://www.sebi.gov.in/filings/mutual-funds/may-2020/icici-prudential-alpha-low-vol-30-etf_46706.html
May 22, 2020 NIIT Technologies Ltd. - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/niit-technologies-ltd-letter-of-offer_46700.html
May 22, 2020 NIIT Technologies Ltd. - Dispatch Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/niit-technologies-ltd-dispatch-advertisement_46699.html
May 22, 2020 Coral India Finance and Housing Limited - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/coral-india-finance-and-housing-limited-letter-of-offer_46698.html
May 22, 2020 Jay Ushin Limited https://www.sebi.gov.in/filings/takeovers/may-2020/jay-ushin-limited_46697.html
May 22, 2020 Pennar Industries - Post Buyback Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/pennar-industries-post-buyback-public-announcement_46696.html
May 22, 2020 Axis Global Equity Alpha Fund of Fund. https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-global-equity-alpha-fund-of-fund-_46695.html
May 21, 2020 Axis Global Disruption Fund of Fund https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-global-disruption-fund-of-fund_46694.html
May 18, 2020 Reliance Industries Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/reliance-industries-limited_46675.html
May 14, 2020 Public Advertisement of Spencer's Retail Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/public-advertisement-of-spencer-s-retail-limited_46693.html
May 12, 2020 Spencer's Retail Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/spencer-s-retail-limited_46692.html
May 12, 2020 Sequent Scientific Limited https://www.sebi.gov.in/filings/takeovers/may-2020/sequent-scientific-limited_46662.html
May 11, 2020 Arvind Fashions Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/arvind-fashions-limited_46659.html
May 05, 2020 JK Paper Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/jk-paper-limited-public-announcement_46647.html
May 05, 2020 Aurionpro Solutions Limited - Post BuyBack Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/aurionpro-solutions-limited-post-buyback-advertisement_46646.html
May 04, 2020 KSOLVES INDIA LIMITED https://www.sebi.gov.in/filings/public-issues/may-2020/ksolves-india-limited_46644.html
May 04, 2020 SBI ETF Consumption https://www.sebi.gov.in/filings/mutual-funds/may-2020/sbi-etf-consumption_46639.html
Page 3...
... and so on.
It seems the website is making a POST request to getnewslistinfo.jsp and getting back the new table content as html. You can open up your Network (Ctrl+Shift+E on Firefox) then navigate to the next page and see the request being made and its parameters.
You can mimick that POST request and change the appropriate parameters for the next page (from what I saw it should be nextValue and doDirect) using urllib2 (or preferably requests). After you get the content you can simply parse it using BeautifulSoup and extract the a tags the way you already did.
Also a tip to you: You should separate your code into functions that do different things such as getPage(pageNum) that given a page number returns the html content, getLinks(html) that given an html page it gets all the links from the table and returns them as a list. This way your code will be more readable and easier to debug and use.
Related
I am trying to scrape a date from a web-site using BeautifulSoup:
how do I extract only the date-time from this? I only want : May 21, 2021 19:47
You can use this example how to extract the date-time from the <ctag>s:
from bs4 import BeautifulSoup
html_doc = """
<ctag class="">May 21, 2021 19:47 Source: <span>BSE</span> </ctag>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for ctag in soup.find_all("ctag"):
dt = ctag.get_text(strip=True).rsplit(maxsplit=1)[0]
print(dt)
Prints:
May 21, 2021 19:47
Or:
for ctag in soup.find_all("ctag"):
dt = ctag.contents[0].rsplit(maxsplit=1)[0]
print(dt)
Or:
for ctag in soup.find_all("ctag"):
dt = ctag.find_next(text=True).rsplit(maxsplit=1)[0]
print(dt)
EDIT: To get dataframe of articles, you can do:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.moneycontrol.com/company-notices/reliance-industries/notices/RI"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = []
for ctag in soup.select("li ctag"):
data.append(
{
"title": ctag.find_next("a").get_text(strip=True),
"date": ctag.find_next(text=True).rsplit(maxsplit=1)[0],
"desc": ctag.find_next("p", class_="MT2").get_text(strip=True),
}
)
df = pd.DataFrame(data)
print(df)
Prints:
title date desc
0 Reliance Industries - Compliances-Reg. 39 (3) ... May 21, 2021 19:47 Pursuant to Regulation 39(3) of the Securities...
1 Reliance Industries - Announcement under Regul... May 19, 2021 21:20 We refer to Regulation 5 of the SEBI (Prohibit...
2 Reliance Industries - Announcement under Regul... May 17, 2021 17:18 In continuation of our letter dated May 15, 20...
3 Reliance Industries - Announcement under Regul... May 17, 2021 16:06 Please find attached a media release by Relian...
4 Reliance Industries - Announcement under Regul... May 15, 2021 15:15 The Company has, on May 15, 2021, published in...
5 Reliance Industries - Compliances-Reg. 39 (3) ... May 14, 2021 19:44 Pursuant to Regulation 39(3) of the Securities...
6 Reliance Industries - Notice For Payment Of Fi... May 13, 2021 22:57 We refer to our letter dated May 01, 2021. A...
7 Reliance Industries - Announcement under Regul... May 12, 2021 21:20 We wish to inform you that the Company partici...
8 Reliance Industries - Compliances-Reg. 39 (3) ... May 12, 2021 19:39 Pursuant to Regulation 39(3) of the Securities...
9 Reliance Industries - Compliances-Reg. 39 (3) ... May 11, 2021 19:49 Pursuant to Regulation 39(3) of the Securities...
I have a question in regards to DataFrame. I have written a code with Selenium to extract a table from a website. However, I am having doubt on how to transform the Selenium text into DataFrame and export it in CSV. Below is my code.
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("Path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
table = driver.find_element_by_xpath('//table[#id="inlineSearchTable"]/tbody')
while True:
try:
print(table.text)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
driver.quit()
If you using selenium you need to get the outerHTML of the table and then use pd.read_html() to get the dataframe.
Then append with empty dataframe and export to csv.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
dfbase=pd.DataFrame()
while True:
try:
table =WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#inlineSearchTable"))).get_attribute("outerHTML")
df=pd.read_html(str(table))[0]
dfbase=dfbase.append(df,ignore_index=True)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
print(dfbase)
dfbase.to_csv("TestResultsDF.csv")
driver.quit()
Output:
Name Date Added to the List
0 24option.com Aug 6, 2013
1 3storich Aug 20, 2020
2 4XP Investments & Trading and Forex Place Ltd. Mar 15, 2012
3 6149154 Canada Inc. d.b.a. Forexcanus Aug 25, 2011
4 72Option, owned and operated by Epic Ventures ... Dec 8, 2016
5 A&L Royal Finance Inc. May 6, 2015
6 Abler Finance Sep 26, 2014
7 Accredited International / Accredited FX Mar 15, 2013
8 Aidan Trading Jan 24, 2018
9 AlfaTrade, Nemesis Capital Limited (together, ... Mar 16, 2016
10 Alma Group Co Trading Ltd. Oct 7, 2020
11 Ameron Oil and Gas Ltd. Sep 23, 2010
12 Anchor Securities Limited Aug 29, 2011
13 Anyoption Jul 8, 2013
14 Arial Trading, LLC Nov 20, 2008
15 Asia & Pacific Holdings Inc. Dec 5, 2017
16 Astercap Ltd., doing business as Broker Official Aug 31, 2018
17 Astor Capital Fund Limited (Astor) Apr 9, 2020
18 Astrofx24 Nov 19, 2019
19 Atlantic Global Asset Management Sep 12, 2017
20 Ava FX, Ava Financial Ltd. and Ava Capital Mar... Mar 15, 2012
21 Ava Trade Ltd. May 30, 2016
22 Avariz Group Nov 4, 2020
23 B.I.S. Blueport Investment Services Ltd., doin... Sep 7, 2017
24 B4Option May 3, 2017
25 Banc de Binary Ltd. Jul 29, 2013
26 BCG Invest Apr 6, 2020
27 BeFaster.fit Limited (BeFaster) Jun 22, 2020
28 Beltway M&A Oct 6, 2009
29 Best Commodity Options Aug 1, 2012
.. ... ...
301 Trade12, owned and operated by Exo Capital Mar... Mar 1, 2017
302 TradeNix Jul 30, 2020
303 TradeQuicker May 21, 2014
304 TradeRush.com Aug 6, 2013
305 Trades Capital, operated by TTN Marketing Ltd.... May 18, 2016
306 Tradewell.io Jan 20, 2020
307 TradexOption Apr 20, 2020
308 Trinidad Oil & Gas Corporation Dec 6, 2011
309 Truevalue Investment International Limited May 11, 2018
310 UK Options Mar 3, 2015
311 United Financial Commodity Group, operating as... Nov 15, 2018
312 Up & Down Marketing Limited (dba OneTwoTrade) Apr 27, 2015
313 USI-TECH Limited Dec 15, 2017
314 uTrader and Day Dream Investments Ltd. (togeth... Nov 29, 2017
315 Vision Financial Partners, LLC Feb 18, 2016
316 Vision Trading Advisors Feb 18, 2016
317 Wallis Partridge LLC Apr 24, 2014
318 Waverly M&A Jan 19, 2010
319 Wealth Capital Corp. Sep 4, 2012
320 Wentworth & Wellesley Ltd. / Wentworth & Welle... Mar 13, 2012
321 West Golden Capital Dec 1, 2010
322 World Markets Sep 22, 2020
323 WorldWide CapitalFX Feb 8, 2019
324 XForex, owned and operated by XFR Financial Lt... Jul 19, 2016
325 Xtelus Profit Nov 30, 2020
326 You Trade Holdings Limited Jun 3, 2011
327 Zen Vybe Inc. Mar 27, 2020
328 ZenithOptions Feb 12, 2016
329 Ziptradex Limited (Ziptradex) May 21, 2020
330 Zulu Trade Inc. Mar 2, 2015
I am new to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape an url and want to store list of movies under one date.
Below is the code I have so far:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=ul.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
I am getting "AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
Expected result in list or dataframe
29th May 2020 Romantic
29th May 2020 Sohreyan Da Pind Aa Gaya
5th June 2020 Lakshmi Bomb
and so on
Thanks in advance for help.
This script will get all movies and corresponding dates to a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/calendar?region=IN&ref_=rlm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out, last = [], ''
for tag in soup.select('#main h4, #main li'):
if tag.name == 'h4':
last = tag.get_text(strip=True)
else:
out.append({'Date':last, 'Movie':tag.get_text(strip=True).rsplit('(', maxsplit=1)[0]})
df = pd.DataFrame(out)
print(df)
Prints:
Date Movie
0 29 May 2020 Romantic
1 29 May 2020 Sohreyan Da Pind Aa Gaya
2 05 June 2020 Laxmmi Bomb
3 05 June 2020 Roohi Afzana
4 05 June 2020 Nikamma
.. ... ...
95 26 March 2021 Untitled Luv Ranjan Film
96 02 April 2021 F9
97 02 April 2021 Bell Bottom
98 02 April 2021 NTR Trivikiram Untitled Movie
99 09 April 2021 Manje Bistre 3
[100 rows x 2 columns]
I think you should replace "ul" with "h1" on the 10th line. And add definition of variable "movielist" ahead.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
# add code here
movielist = []
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
# replace ul with h1 here
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
print(movielist)
I didn't specify a list to receive, and I changed it from 'h1' to 'text capture' instead of 'h4'.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
movielist = []
date = soup.find_all("h4")
ul = soup.find_all("ui")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
The reason the date doesn't match in the output result is that the 'date' retrieved looks like the following, so you need to fix the logic.
There are multiple titles on the same release date, so the release date and number of titles don't match up. I can't help you that much because I don't have the time. Have a good night.
29 May 2020
05 June 2020
07 June 2020
07 June 2020 Romantic
12 June 2020
12 June 2020 Sohreyan Da Pind Aa Gaya
18 June 2020
18 June 2020 Laxmmi Bomb
19 June 2020
19 June 2020 Roohi Afzana
25 June 2020
25 June 2020 Nikamma
26 June 2020
26 June 2020 Naandhi
02 July 2020
02 July 2020 Mandela
03 July 2020
03 July 2020 Medium Spicy
10 July 2020
10 July 2020 Out of the Blue
I am having a hard time trying to extract the hyperlinks from a page with beatifulsoup. I have tried many different tags and classes but cant seem to get it without a whole bunch of other html I don't want. Is anyone able to tell me where i'm going wrong? Code below:
from bs4 import BeautifulSoup
import requests
page_link = url
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
pagecode = soup.find(class_='infinite-scroll-container')
title = pagecode.findAll('i')
artist = pagecode.find_all('h1', "exhibition-title")
links = pagecode.find_all('article', "teaser infinite-scroll-item")
printcount=0
while printcount < len(title):
titlestring = title[printcount].text
artiststring = artist[printcount].text
artiststring = artiststring.replace(titlestring, '')
artiststring = artiststring.strip()
titlestring = titlestring.strip()
print(artiststring)
print(titlestring)
print("----------------------------------------")
printcount = printcount+1
You could directly target the all the links in that page and then filter it to get get links within an article. Note that this page is fully loaded only on scroll, you may have to use selenium to get all the links. For now i will answer on how to filter the links.
from bs4 import BeautifulSoup
import requests
import re
page_link = 'https://hopkinsonmossman.com/exhibitions/past/'
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
links= soup.find_all('a')
for link in links:
if link.parent.name=='article':#only article links
print(re.sub(r"\s\s+", " ", link.text).strip())#replace multiple spaces with one
print(link['href'])
print()
Output
Nicola Farquhar A Holotype Heart 22 Nov – 21 Dec 2018 Wellington
https://hopkinsonmossman.com/exhibitions/nicola-farquhar-5/
Bill Culbert Desk Lamp, Crash 19 Oct – 17 Nov 2018 Wellington
https://hopkinsonmossman.com/exhibitions/bill-culbert-2/
Nick Austin, Ammon Ngakuru Many Happy Returns 18 Oct – 15 Nov 2018 Auckland
https://hopkinsonmossman.com/exhibitions/nick-austin-ammon-ngakuru/
Dane Mitchell Tuning 13 Sep – 13 Oct 2018 Wellington
https://hopkinsonmossman.com/exhibitions/dane-mitchell-4/
Shannon Te Ao my life as a tunnel 08 Sep – 13 Oct 2018 Auckland
https://hopkinsonmossman.com/exhibitions/shannon-te-ao/
Tilt Anoushka Akel, Ruth Buchanan, Meg Porteous 16 Aug – 08 Sep 2018 Wellington
https://hopkinsonmossman.com/exhibitions/anoushka-akel-ruth-buchanan-meg-porteous/
Shadow Work Fiona Connor, Oliver Perkins 02 Aug – 01 Sep 2018 Auckland
https://hopkinsonmossman.com/exhibitions/group-show/
Emma McIntyre Rose on red 13 Jul – 11 Aug 2018 Wellington
https://hopkinsonmossman.com/exhibitions/emma-mcintyre-2/
Tahi Moore Incomprehensible public fictions: Writer fights politician in car park 04 Jul – 28 Jul 2018 Auckland
https://hopkinsonmossman.com/exhibitions/tahi-moore-2/
Oliver Perkins Bleeding Edge 01 Jun – 07 Jul 2018 Wellington
https://hopkinsonmossman.com/exhibitions/oliver-perkins-2/
Spinning Phillip Lai, Peter Robinson 19 May – 23 Jun 2018 Auckland
https://hopkinsonmossman.com/exhibitions/1437/
Milli Jannides Cavewoman 19 Apr – 26 May 2018 Wellington
https://hopkinsonmossman.com/exhibitions/milli-jannides/
Oscar Enberg Taste & Power, a prologue 06 Apr – 12 May 2018 Auckland
https://hopkinsonmossman.com/exhibitions/oscar-enberg/
Fiona Connor Closed Down Clubs & Monochromes 09 Mar – 14 Apr 2018 Wellington
https://hopkinsonmossman.com/exhibitions/closed-down-clubs-and-monochromes/
Bill Culbert Colour Theory, Window Mobile 02 Mar – 29 Mar 2018 Auckland
https://hopkinsonmossman.com/exhibitions/colour-theory-window-mobile/
Role Models Curated by Rob McKenzie
Robert Bittenbender, Ellen Cantor, Jennifer McCamley, Josef Strau 26 Jan – 24 Feb 2018 Auckland
https://hopkinsonmossman.com/exhibitions/role-models/
Emma McIntyre Pink Square Sways 24 Nov – 23 Dec 2017 Auckland
https://hopkinsonmossman.com/exhibitions/emma-mcintyre/
My initial thought was to use the "ajax-link" class, but turns out the 'HOPKINSON MOSSMAN' link also has that class. You could also use that approach and filter out the first link in the find_all, which will give you the same result.
from bs4 import BeautifulSoup
import requests
import re
page_link = 'https://hopkinsonmossman.com/exhibitions/past/'
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
links= soup.find_all('a',class_='ajax-link')
for link in links[1:]:
print(re.sub(r"\s\s+", " ", link.text).strip())#replace multiple spaces with one
print(link['href'])
print()
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 4 years ago.
Improve this question
I am practicing some scraping for https://www.nytimes.com/section/politics and
and the page looks like this :
my code till now is like this :
Dates = driver.find_elements_by_class_name("css-umh681")
len(Dates)
Date_M=[]
for Date in Dates:
print(Date.text)
Date_M.append(Date.text)
Date_M
HeadLines=driver.find_elements_by_class_name("css-1dq8tca")
len(HeadLines)
HeadLine_M=[]
for HeadLine in HeadLines:
print(HeadLine.text)
HeadLine_M.append(HeadLine.text)
HeadLine_M
How to extract the text of the selected elements into a dataframe to get this :
try this
driver = webdriver.Chrome('chromedriver.exe')
driver.get('https://www.nytimes.com/section/politics')
class_ele = driver.find_element_by_class_name('css-13mho3u')
pos= 0
df = pd.DataFrame(columns=['Date','Headline'])
for ol in class_ele.find_elements_by_class_name('css-ye6x8s'):
data = []
h2 = ol.find_element_by_class_name('css-1dq8tca').text
div_2 = ol.find_element_by_class_name('css-umh681').text
data.append(div_2)
data.append(h2)
df.loc[pos] = data
pos+=1
print(df)
Date Headline
0 Dec 27, 2018 LinkedIn Co-Founder Apologizes for Deception i...
1 Dec 27, 2018 Trump in Iraq: First Visit to U.S. Troops in C...
2 Dec 27, 2018 Federal Workers, Some in ‘Panic Mode,’ Share S...
3 Dec 26, 2018 Did a Queens Podiatrist Help Donald Trump Avoi...
4 Dec 26, 2018 Donald Trump’s Registration Card
5 Dec 26, 2018 Donald Trump’s Selective Service Records
6 Dec 26, 2018 Arms Sales to Saudis Leave American Fingerprin...
7 Dec 26, 2018 Black Voters, a Force in Democratic Politics, ...
8 Dec 25, 2018 How Did Rifles With an American Stamp End Up i...
9 Dec 25, 2018 Kids, Please Don’t Read This Article on What T...