Finding links with beautifulsoup in Python - python

I am having a hard time trying to extract the hyperlinks from a page with beatifulsoup. I have tried many different tags and classes but cant seem to get it without a whole bunch of other html I don't want. Is anyone able to tell me where i'm going wrong? Code below:
from bs4 import BeautifulSoup
import requests
page_link = url
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
pagecode = soup.find(class_='infinite-scroll-container')
title = pagecode.findAll('i')
artist = pagecode.find_all('h1', "exhibition-title")
links = pagecode.find_all('article', "teaser infinite-scroll-item")
printcount=0
while printcount < len(title):
titlestring = title[printcount].text
artiststring = artist[printcount].text
artiststring = artiststring.replace(titlestring, '')
artiststring = artiststring.strip()
titlestring = titlestring.strip()
print(artiststring)
print(titlestring)
print("----------------------------------------")
printcount = printcount+1

You could directly target the all the links in that page and then filter it to get get links within an article. Note that this page is fully loaded only on scroll, you may have to use selenium to get all the links. For now i will answer on how to filter the links.
from bs4 import BeautifulSoup
import requests
import re
page_link = 'https://hopkinsonmossman.com/exhibitions/past/'
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
links= soup.find_all('a')
for link in links:
if link.parent.name=='article':#only article links
print(re.sub(r"\s\s+", " ", link.text).strip())#replace multiple spaces with one
print(link['href'])
print()
Output
Nicola Farquhar A Holotype Heart 22 Nov – 21 Dec 2018 Wellington
https://hopkinsonmossman.com/exhibitions/nicola-farquhar-5/
Bill Culbert Desk Lamp, Crash 19 Oct – 17 Nov 2018 Wellington
https://hopkinsonmossman.com/exhibitions/bill-culbert-2/
Nick Austin, Ammon Ngakuru Many Happy Returns 18 Oct – 15 Nov 2018 Auckland
https://hopkinsonmossman.com/exhibitions/nick-austin-ammon-ngakuru/
Dane Mitchell Tuning 13 Sep – 13 Oct 2018 Wellington
https://hopkinsonmossman.com/exhibitions/dane-mitchell-4/
Shannon Te Ao my life as a tunnel 08 Sep – 13 Oct 2018 Auckland
https://hopkinsonmossman.com/exhibitions/shannon-te-ao/
Tilt Anoushka Akel, Ruth Buchanan, Meg Porteous 16 Aug – 08 Sep 2018 Wellington
https://hopkinsonmossman.com/exhibitions/anoushka-akel-ruth-buchanan-meg-porteous/
Shadow Work Fiona Connor, Oliver Perkins 02 Aug – 01 Sep 2018 Auckland
https://hopkinsonmossman.com/exhibitions/group-show/
Emma McIntyre Rose on red 13 Jul – 11 Aug 2018 Wellington
https://hopkinsonmossman.com/exhibitions/emma-mcintyre-2/
Tahi Moore Incomprehensible public fictions: Writer fights politician in car park 04 Jul – 28 Jul 2018 Auckland
https://hopkinsonmossman.com/exhibitions/tahi-moore-2/
Oliver Perkins Bleeding Edge 01 Jun – 07 Jul 2018 Wellington
https://hopkinsonmossman.com/exhibitions/oliver-perkins-2/
Spinning Phillip Lai, Peter Robinson 19 May – 23 Jun 2018 Auckland
https://hopkinsonmossman.com/exhibitions/1437/
Milli Jannides Cavewoman 19 Apr – 26 May 2018 Wellington
https://hopkinsonmossman.com/exhibitions/milli-jannides/
Oscar Enberg Taste & Power, a prologue 06 Apr – 12 May 2018 Auckland
https://hopkinsonmossman.com/exhibitions/oscar-enberg/
Fiona Connor Closed Down Clubs & Monochromes 09 Mar – 14 Apr 2018 Wellington
https://hopkinsonmossman.com/exhibitions/closed-down-clubs-and-monochromes/
Bill Culbert Colour Theory, Window Mobile 02 Mar – 29 Mar 2018 Auckland
https://hopkinsonmossman.com/exhibitions/colour-theory-window-mobile/
Role Models Curated by Rob McKenzie
Robert Bittenbender, Ellen Cantor, Jennifer McCamley, Josef Strau 26 Jan – 24 Feb 2018 Auckland
https://hopkinsonmossman.com/exhibitions/role-models/
Emma McIntyre Pink Square Sways 24 Nov – 23 Dec 2017 Auckland
https://hopkinsonmossman.com/exhibitions/emma-mcintyre/
My initial thought was to use the "ajax-link" class, but turns out the 'HOPKINSON MOSSMAN' link also has that class. You could also use that approach and filter out the first link in the find_all, which will give you the same result.
from bs4 import BeautifulSoup
import requests
import re
page_link = 'https://hopkinsonmossman.com/exhibitions/past/'
page_response = requests.get(page_link, timeout=5)
soup = BeautifulSoup(page_response.content, "html.parser")
links= soup.find_all('a',class_='ajax-link')
for link in links[1:]:
print(re.sub(r"\s\s+", " ", link.text).strip())#replace multiple spaces with one
print(link['href'])
print()

Related

Cannot get response.get() to load full webpage

When I go to scrape https://www.onthesnow.com/epic-pass/skireport for the names of all the ski resorts listed, I'm running into an issue where some of the ski resorts don't show up in my output. Here's my current code:
import requests
url = "https://www.onthesnow.com/epic-pass/skireport"
response = requests.get(url)
response.text
The current output gives all resorts up to Mont Sainte Anne, but then it skips to the resorts at the bottom of the webpage under "closed resorts". I notice that when you scroll down the webpage in a browser that the missing resort names need to be scrolled down to before they will load. How do I make my response.get() obtain all of the HTML, even the HTML that still needs to load?
The data you see is loaded from external URL in Json form. To load it, you can use this example:
import json
import requests
url = "https://api.onthesnow.com/api/v2/region/1291/resorts/1/page/1?limit=999"
data = requests.get(url).json()
# uncomment to print all data:
# print(json.dumps(data, indent=4))
for i, d in enumerate(data["data"], 1):
print(i, d["title"])
Prints:
1 Beaver Creek
2 Breckenridge
3 Brides les Bains
4 Courchevel
5 Crested Butte Mountain Resort
6 Fernie Alpine
7 Folgàrida - Marilléva
8 Heavenly
9 Keystone
10 Kicking Horse
11 Kimberley
12 Kirkwood
13 La Tania
14 Les Menuires
15 Madonna di Campiglio
16 Meribel
17 Mont Sainte Anne
18 Nakiska Ski Area
19 Nendaz
20 Northstar California
21 Okemo Mountain Resort
22 Orelle
23 Park City
24 Pontedilegno - Tonale
25 Saint Martin de Belleville
26 Snowbasin
27 Stevens Pass Resort
28 Stoneham
29 Stowe Mountain
30 Sun Valley
31 Thyon 2000
32 Vail
33 Val Thorens
34 Verbier
35 Veysonnaz
36 Whistler Blackcomb

Error "6 columns passed, passed data had 286 columns "

I am web-scraping the table that is found on this website : " https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html "
Everything was good, but I had a small issue with the "Price" label and was unable to fix it. I've been trying for the past few hours and this is the last error that I ran into : " https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html "
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import requests
page = requests.get("https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html")
soup = BeautifulSoup(page.content, "lxml")
gdp = soup.find_all("table", attrs={"class": "table flight-detail hidden-xs"})
print("Number of tables on site: ",len(gdp))
table1 = gdp[0]
# the head will form our column names
body = table1.find_all("tr")
print(len(body))
# Head values (Column names) are the first items of the body list
head = body[0] # 0th item is the header row
body_rows = body[1:] # All other items becomes the rest of the rows
# Lets now iterate through the head HTML code and make list of clean headings
# Declare empty list to keep Columns names
headings = []
for item in head.find_all("th"): # loop through all th elements
# convert the th elements to text and strip "\n"
item = (item.text).rstrip("\n")
# append the clean column name to headings
headings.append(item)
print(headings)
import re
all_rows = [] # will be a list for list for all rows
for row_num in range(len(body_rows)): # A row at a time
row = [] # this will old entries for one row
for row_item in body_rows[row_num].find_all("td")[:-1]: #loop through all row entries
# row_item.text removes the tags from the entries
# the following regex is to remove \xa0 and \n and comma from row_item.text
# xa0 encodes the flag, \n is the newline and comma separates thousands in numbers
aa = re.sub("(\xa0)|(\n)|(\t),","",row_item.text)
#append aa to row - note one row entry is being appended
row.append(aa)
# append one row to all_rows
all_rows.append(row)
for row_item in body_rows[row_num].find_all("td")[-1].find("span").text: #loop through the last row entry, price.
aa = re.sub("(\xa0)|(\n)|(\t),","",row_item)
row.append(aa)
all_rows.append(row)
# We can now use the data on all_rowsa and headings to make a table
# all_rows becomes our data and headings the column names
df = pd.DataFrame(data=all_rows,columns=headings)
#df.head()
#print(df)
df["Date"]=pd.to_datetime(df["Date"]).dt.strftime("%d/%m/%Y")
print(df)
If you could please run the code and tell me how to solve this issue so I could print everything when I am using " print(df) ".
Previusly, I was able to print eveything, except the price, who had "\t\t\t\t\t\t\t" instead of the price.
Thank you.
To get the table into panda DataFrame, you can use this example:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = (
"https://www.privatefly.com/privatejet-services/private-jet-empty-legs.html"
)
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = []
for tr in soup.select("tr:has(td)"):
row = [td.get_text(strip=True, separator=" ") for td in tr.select("td")]
data.append(row)
df = pd.DataFrame(data, columns="From To Aircraft Seats Date Price".split())
print(df)
df.to_csv("data.csv", index=False)
Prints:
From To Aircraft Seats Date Price
0 Prague Vaclav Havel Airport Bratislava M R Stefanik Citation XLS+ 9 Thu Jun 03 00:00:00 UTC 2021 €3 300 (RRP €6 130)
1 Billund Odense Learjet 45 / 45XR 8 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €7 100)
2 La Roche/yon Les Ajoncs Nantes Atlantique Embraer Phenom 100 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €4 820)
3 London Biggin Hill Paris Le Bourget Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €6 980)
4 Prague Vaclav Havel Airport Salzburg (mozart) Gulfstream G200 9 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 800)
5 Palma De Mallorca Edinburgh Cessna C525 Citation CJ2 5 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €18 680)
6 Linz Blue Danube Linz Munich Munchen Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €3 600)
7 Geneva Cointrin Paris Le Bourget Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €9 240)
8 Vienna Schwechat Cologne-bonn Koln Bonn Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 590)
9 Cannes Mandelieu Geneva Cointrin Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 220)
10 Brussels National Cologne-bonn Koln Bonn Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €3 790)
11 Split Bari Palese Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €8 220)
12 Copenhagen Roskilde Aalborg Challenger 604 11 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €16 750)
13 Brussels National Leipzig Halle Cessna 510 Mustang 4 Thu Jun 03 00:00:00 UTC 2021 Call for Price (RRP €6 690)
...
And saves data.csv (screenshot from LibreOffice):

Python Selenium Text Convert into Data Frame

I have a question in regards to DataFrame. I have written a code with Selenium to extract a table from a website. However, I am having doubt on how to transform the Selenium text into DataFrame and export it in CSV. Below is my code.
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("Path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
table = driver.find_element_by_xpath('//table[#id="inlineSearchTable"]/tbody')
while True:
try:
print(table.text)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
driver.quit()
If you using selenium you need to get the outerHTML of the table and then use pd.read_html() to get the dataframe.
Then append with empty dataframe and export to csv.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
dfbase=pd.DataFrame()
while True:
try:
table =WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#inlineSearchTable"))).get_attribute("outerHTML")
df=pd.read_html(str(table))[0]
dfbase=dfbase.append(df,ignore_index=True)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
print(dfbase)
dfbase.to_csv("TestResultsDF.csv")
driver.quit()
Output:
Name Date Added to the List
0 24option.com Aug 6, 2013
1 3storich Aug 20, 2020
2 4XP Investments & Trading and Forex Place Ltd. Mar 15, 2012
3 6149154 Canada Inc. d.b.a. Forexcanus Aug 25, 2011
4 72Option, owned and operated by Epic Ventures ... Dec 8, 2016
5 A&L Royal Finance Inc. May 6, 2015
6 Abler Finance Sep 26, 2014
7 Accredited International / Accredited FX Mar 15, 2013
8 Aidan Trading Jan 24, 2018
9 AlfaTrade, Nemesis Capital Limited (together, ... Mar 16, 2016
10 Alma Group Co Trading Ltd. Oct 7, 2020
11 Ameron Oil and Gas Ltd. Sep 23, 2010
12 Anchor Securities Limited Aug 29, 2011
13 Anyoption Jul 8, 2013
14 Arial Trading, LLC Nov 20, 2008
15 Asia & Pacific Holdings Inc. Dec 5, 2017
16 Astercap Ltd., doing business as Broker Official Aug 31, 2018
17 Astor Capital Fund Limited (Astor) Apr 9, 2020
18 Astrofx24 Nov 19, 2019
19 Atlantic Global Asset Management Sep 12, 2017
20 Ava FX, Ava Financial Ltd. and Ava Capital Mar... Mar 15, 2012
21 Ava Trade Ltd. May 30, 2016
22 Avariz Group Nov 4, 2020
23 B.I.S. Blueport Investment Services Ltd., doin... Sep 7, 2017
24 B4Option May 3, 2017
25 Banc de Binary Ltd. Jul 29, 2013
26 BCG Invest Apr 6, 2020
27 BeFaster.fit Limited (BeFaster) Jun 22, 2020
28 Beltway M&A Oct 6, 2009
29 Best Commodity Options Aug 1, 2012
.. ... ...
301 Trade12, owned and operated by Exo Capital Mar... Mar 1, 2017
302 TradeNix Jul 30, 2020
303 TradeQuicker May 21, 2014
304 TradeRush.com Aug 6, 2013
305 Trades Capital, operated by TTN Marketing Ltd.... May 18, 2016
306 Tradewell.io Jan 20, 2020
307 TradexOption Apr 20, 2020
308 Trinidad Oil & Gas Corporation Dec 6, 2011
309 Truevalue Investment International Limited May 11, 2018
310 UK Options Mar 3, 2015
311 United Financial Commodity Group, operating as... Nov 15, 2018
312 Up & Down Marketing Limited (dba OneTwoTrade) Apr 27, 2015
313 USI-TECH Limited Dec 15, 2017
314 uTrader and Day Dream Investments Ltd. (togeth... Nov 29, 2017
315 Vision Financial Partners, LLC Feb 18, 2016
316 Vision Trading Advisors Feb 18, 2016
317 Wallis Partridge LLC Apr 24, 2014
318 Waverly M&A Jan 19, 2010
319 Wealth Capital Corp. Sep 4, 2012
320 Wentworth & Wellesley Ltd. / Wentworth & Welle... Mar 13, 2012
321 West Golden Capital Dec 1, 2010
322 World Markets Sep 22, 2020
323 WorldWide CapitalFX Feb 8, 2019
324 XForex, owned and operated by XFR Financial Lt... Jul 19, 2016
325 Xtelus Profit Nov 30, 2020
326 You Trade Holdings Limited Jun 3, 2011
327 Zen Vybe Inc. Mar 27, 2020
328 ZenithOptions Feb 12, 2016
329 Ziptradex Limited (Ziptradex) May 21, 2020
330 Zulu Trade Inc. Mar 2, 2015

How to scrape pdf links from webpages having unchanging urls?

I am working on a project on web scraping and I am asked to scrape all the pdf links from a website:
https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=3&s .
The website has 397 pages but every page has the same URL. I tried the inspect element tool and found out that a javascript code helps to navigate to different pages. But still I am not able to figure out how to run my script for all the pages.
Below is my code.
from bs4 import BeautifulSoup
import lxml
url = 'https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=3&s'
conn = urllib2.urlopen(url)
html = conn.read()
soup = BeautifulSoup(html)
links = soup.find_all('a')
urls=[]
for tag in links:
link = tag.get('href',None)
if link is not None and link.endswith('html'):
#urls.append(link)
purl=link
new=urllib2.urlopen(purl)
htm=new.read()
sp=BeautifulSoup(htm)
nl=sp.find_all('a')
nm=sp.find_all('iframe')
for i in nl:
q=i.get('href',None)
title=i.get('title',None)
if q is not None and q.endswith('pdf'):
print(q)
urls.append(q)
for j in nm:
z=j.get('src',None)
title=j.get('title',None)
if z is not None and z.endswith('pdf')and title is not None:
print(z)
print(title)
urls.append(z)
print(len(urls))
You can use their API located on https://www.sebi.gov.in/sebiweb/ajax/home/getnewslistinfo.jsp to load the data.
For example:
from bs4 import BeautifulSoup
from requests import get
api_url = 'https://www.sebi.gov.in/sebiweb/ajax/home/getnewslistinfo.jsp'
payload = {
'nextValue': "1",
'next': "n",
'search': "",
'fromDate': "",
'toDate': "",
'fromYear': "",
'toYear': "",
'deptId': "",
'sid': "3",
'ssid': "-1",
'smid': "0",
'intmid': "-1",
'sText': "Filings",
'ssText': "-- All Sub Section --",
'smText': "",
'doDirect': "1",
}
page = 0
while True:
print('Page {}...'.format(page))
payload['doDirect'] = page
soup = BeautifulSoup(requests.post(api_url, data=payload).content, 'html.parser')
rows = soup.select('tr:has(td)')
if not rows:
break
for tr in rows:
row = [td.get_text(strip=True) for td in tr.select('td')] + [tr.a['href']]
print(*row, sep='\t')
page += 1
Prints:
...
Page 1...
Jun 25, 2020 Mindspace Business Parks REIT – Addendum to Draft Prospectus https://www.sebi.gov.in/filings/reit-issues/jun-2020/mindspace-business-parks-reit-addendum-to-draft-prospectus_46928.html
Jun 25, 2020 Amrit Corp. Ltd. - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/amrit-corp-ltd-public-announcement_46927.html
Jun 24, 2020 NIIT Technologies Buyback - Post Buyback - Public Advertisement https://www.sebi.gov.in/filings/buybacks/jun-2020/niit-technologies-buyback-post-buyback-public-advertisement_46923.html
Jun 23, 2020 Addendum to Letter of Offer of Arvind Fashions Limited https://www.sebi.gov.in/filings/rights-issues/jun-2020/addendum-to-letter-of-offer-of-arvind-fashions-limited_46941.html
Jun 23, 2020 Genesis Exports Limited - Draft letter of Offer https://www.sebi.gov.in/filings/buybacks/jun-2020/genesis-exports-limited-draft-letter-of-offer_46911.html
Jun 23, 2020 Genesis Exports Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/genesis-exports-limited-public-announcement_46909.html
Jun 19, 2020 Coral India Finance and Housing Limited – Post Buy-back Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/coral-india-finance-and-housing-limited-post-buy-back-public-announcement_46900.html
Jun 19, 2020 Network Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/network-limited_46890.html
Jun 17, 2020 KSOLVES INDIA LIMITED https://www.sebi.gov.in/filings/public-issues/jun-2020/ksolves-india-limited_46996.html
Jun 10, 2020 Happiest Minds Technologies Limited https://www.sebi.gov.in/filings/public-issues/jun-2020/happiest-minds-technologies-limited_46843.html
Jun 08, 2020 IM+ Capitals Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/im-capitals-limited_46786.html
Jun 05, 2020 HealthCare Global Enterprises Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/healthcare-global-enterprises-limited_46773.html
Jun 02, 2020 Jaikumar Constructions Ltd. - DRHP https://www.sebi.gov.in/filings/public-issues/jun-2020/jaikumar-constructions-ltd-drhp_46774.html
Jun 02, 2020 Mahindra Focused Equity Yojana https://www.sebi.gov.in/filings/mutual-funds/jun-2020/mahindra-focused-equity-yojana_46767.html
Jun 02, 2020 GRANULES INDIA LIMITED - Dispatch advertisement https://www.sebi.gov.in/filings/buybacks/jun-2020/granules-india-limited-dispatch-advertisement_46765.html
Jun 02, 2020 GRANULES INDIA LIMITED - Letter of Offer https://www.sebi.gov.in/filings/buybacks/jun-2020/granules-india-limited-letter-of-offer_46764.html
Jun 02, 2020 Motilal Oswal Multi Asset Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/motilal-oswal-multi-asset-fund_46762.html
Jun 02, 2020 Principal Large Cap Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/principal-large-cap-fund_46761.html
Jun 02, 2020 Mahindra Arbitrage Yojana https://www.sebi.gov.in/filings/mutual-funds/jun-2020/mahindra-arbitrage-yojana_46760.html
Jun 02, 2020 HSBC Mid Cap Equity Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/hsbc-mid-cap-equity-fund_46759.html
Jun 01, 2020 Tanla Solutions Limited - DLOF https://www.sebi.gov.in/filings/buybacks/jun-2020/tanla-solutions-limited-dlof_46750.html
Jun 01, 2020 Axis Banking ETF https://www.sebi.gov.in/filings/mutual-funds/jun-2020/axis-banking-etf_46748.html
Jun 01, 2020 Kalpataru Power Transmission Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/kalpataru-power-transmission-limited-public-announcement_46746.html
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 22, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-22-2020_46745.html
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 19, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-19-2020_46744.html
Page 2...
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 18, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-18-2020_46743.html
May 29, 2020 Muthoottu Mini Financiers Limited- Prospectus https://www.sebi.gov.in/filings/debt-offer-document/may-2020/muthoottu-mini-financiers-limited-prospectus_46769.html
May 29, 2020 Coral India Housing and Finance Limited - Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/coral-india-housing-and-finance-limited-advertisement_46732.html
May 29, 2020 TANLA SOLUTIONS LIMITED - Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/tanla-solutions-limited-public-announcement_46731.html
May 28, 2020 Tips Industries Limited - Dispatch Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/tips-industries-limited-dispatch-advertisement_46723.html
May 27, 2020 KLM Axiva Finvest Limited - Prospectus https://www.sebi.gov.in/filings/debt-offer-document/may-2020/klm-axiva-finvest-limited-prospectus_46755.html
May 26, 2020 Tips Industries Limited - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/tips-industries-limited-letter-of-offer_46708.html
May 26, 2020 Axis Capital Protection Oriented Fund - Series 7-10 https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-capital-protection-oriented-fund-series-7-10_46707.html
May 26, 2020 ICICI Prudential Alpha Low Vol 30 ETF https://www.sebi.gov.in/filings/mutual-funds/may-2020/icici-prudential-alpha-low-vol-30-etf_46706.html
May 22, 2020 NIIT Technologies Ltd. - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/niit-technologies-ltd-letter-of-offer_46700.html
May 22, 2020 NIIT Technologies Ltd. - Dispatch Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/niit-technologies-ltd-dispatch-advertisement_46699.html
May 22, 2020 Coral India Finance and Housing Limited - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/coral-india-finance-and-housing-limited-letter-of-offer_46698.html
May 22, 2020 Jay Ushin Limited https://www.sebi.gov.in/filings/takeovers/may-2020/jay-ushin-limited_46697.html
May 22, 2020 Pennar Industries - Post Buyback Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/pennar-industries-post-buyback-public-announcement_46696.html
May 22, 2020 Axis Global Equity Alpha Fund of Fund. https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-global-equity-alpha-fund-of-fund-_46695.html
May 21, 2020 Axis Global Disruption Fund of Fund https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-global-disruption-fund-of-fund_46694.html
May 18, 2020 Reliance Industries Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/reliance-industries-limited_46675.html
May 14, 2020 Public Advertisement of Spencer's Retail Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/public-advertisement-of-spencer-s-retail-limited_46693.html
May 12, 2020 Spencer's Retail Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/spencer-s-retail-limited_46692.html
May 12, 2020 Sequent Scientific Limited https://www.sebi.gov.in/filings/takeovers/may-2020/sequent-scientific-limited_46662.html
May 11, 2020 Arvind Fashions Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/arvind-fashions-limited_46659.html
May 05, 2020 JK Paper Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/jk-paper-limited-public-announcement_46647.html
May 05, 2020 Aurionpro Solutions Limited - Post BuyBack Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/aurionpro-solutions-limited-post-buyback-advertisement_46646.html
May 04, 2020 KSOLVES INDIA LIMITED https://www.sebi.gov.in/filings/public-issues/may-2020/ksolves-india-limited_46644.html
May 04, 2020 SBI ETF Consumption https://www.sebi.gov.in/filings/mutual-funds/may-2020/sbi-etf-consumption_46639.html
Page 3...
... and so on.
It seems the website is making a POST request to getnewslistinfo.jsp and getting back the new table content as html. You can open up your Network (Ctrl+Shift+E on Firefox) then navigate to the next page and see the request being made and its parameters.
You can mimick that POST request and change the appropriate parameters for the next page (from what I saw it should be nextValue and doDirect) using urllib2 (or preferably requests). After you get the content you can simply parse it using BeautifulSoup and extract the a tags the way you already did.
Also a tip to you: You should separate your code into functions that do different things such as getPage(pageNum) that given a page number returns the html content, getLinks(html) that given an html page it gets all the links from the table and returns them as a list. This way your code will be more readable and easier to debug and use.

Using Python and BeautifulSoup to scrape list from an URL

I am new to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape an url and want to store list of movies under one date.
Below is the code I have so far:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=ul.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
I am getting "AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
Expected result in list or dataframe
29th May 2020 Romantic
29th May 2020 Sohreyan Da Pind Aa Gaya
5th June 2020 Lakshmi Bomb
and so on
Thanks in advance for help.
This script will get all movies and corresponding dates to a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/calendar?region=IN&ref_=rlm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out, last = [], ''
for tag in soup.select('#main h4, #main li'):
if tag.name == 'h4':
last = tag.get_text(strip=True)
else:
out.append({'Date':last, 'Movie':tag.get_text(strip=True).rsplit('(', maxsplit=1)[0]})
df = pd.DataFrame(out)
print(df)
Prints:
Date Movie
0 29 May 2020 Romantic
1 29 May 2020 Sohreyan Da Pind Aa Gaya
2 05 June 2020 Laxmmi Bomb
3 05 June 2020 Roohi Afzana
4 05 June 2020 Nikamma
.. ... ...
95 26 March 2021 Untitled Luv Ranjan Film
96 02 April 2021 F9
97 02 April 2021 Bell Bottom
98 02 April 2021 NTR Trivikiram Untitled Movie
99 09 April 2021 Manje Bistre 3
[100 rows x 2 columns]
I think you should replace "ul" with "h1" on the 10th line. And add definition of variable "movielist" ahead.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
# add code here
movielist = []
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
# replace ul with h1 here
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
print(movielist)
I didn't specify a list to receive, and I changed it from 'h1' to 'text capture' instead of 'h4'.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
movielist = []
date = soup.find_all("h4")
ul = soup.find_all("ui")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
The reason the date doesn't match in the output result is that the 'date' retrieved looks like the following, so you need to fix the logic.
There are multiple titles on the same release date, so the release date and number of titles don't match up. I can't help you that much because I don't have the time. Have a good night.
29 May 2020
05 June 2020
07 June 2020
07 June 2020 Romantic
12 June 2020
12 June 2020 Sohreyan Da Pind Aa Gaya
18 June 2020
18 June 2020 Laxmmi Bomb
19 June 2020
19 June 2020 Roohi Afzana
25 June 2020
25 June 2020 Nikamma
26 June 2020
26 June 2020 Naandhi
02 July 2020
02 July 2020 Mandela
03 July 2020
03 July 2020 Medium Spicy
10 July 2020
10 July 2020 Out of the Blue

Categories

Resources