extract just date from beautifulsoup result - python

I am trying to scrape a date from a web-site using BeautifulSoup:
how do I extract only the date-time from this? I only want : May 21, 2021 19:47

You can use this example how to extract the date-time from the <ctag>s:
from bs4 import BeautifulSoup
html_doc = """
<ctag class="">May 21, 2021 19:47 Source: <span>BSE</span> </ctag>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for ctag in soup.find_all("ctag"):
dt = ctag.get_text(strip=True).rsplit(maxsplit=1)[0]
print(dt)
Prints:
May 21, 2021 19:47
Or:
for ctag in soup.find_all("ctag"):
dt = ctag.contents[0].rsplit(maxsplit=1)[0]
print(dt)
Or:
for ctag in soup.find_all("ctag"):
dt = ctag.find_next(text=True).rsplit(maxsplit=1)[0]
print(dt)
EDIT: To get dataframe of articles, you can do:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.moneycontrol.com/company-notices/reliance-industries/notices/RI"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = []
for ctag in soup.select("li ctag"):
data.append(
{
"title": ctag.find_next("a").get_text(strip=True),
"date": ctag.find_next(text=True).rsplit(maxsplit=1)[0],
"desc": ctag.find_next("p", class_="MT2").get_text(strip=True),
}
)
df = pd.DataFrame(data)
print(df)
Prints:
title date desc
0 Reliance Industries - Compliances-Reg. 39 (3) ... May 21, 2021 19:47 Pursuant to Regulation 39(3) of the Securities...
1 Reliance Industries - Announcement under Regul... May 19, 2021 21:20 We refer to Regulation 5 of the SEBI (Prohibit...
2 Reliance Industries - Announcement under Regul... May 17, 2021 17:18 In continuation of our letter dated May 15, 20...
3 Reliance Industries - Announcement under Regul... May 17, 2021 16:06 Please find attached a media release by Relian...
4 Reliance Industries - Announcement under Regul... May 15, 2021 15:15 The Company has, on May 15, 2021, published in...
5 Reliance Industries - Compliances-Reg. 39 (3) ... May 14, 2021 19:44 Pursuant to Regulation 39(3) of the Securities...
6 Reliance Industries - Notice For Payment Of Fi... May 13, 2021 22:57 We refer to our letter dated May 01, 2021. A...
7 Reliance Industries - Announcement under Regul... May 12, 2021 21:20 We wish to inform you that the Company partici...
8 Reliance Industries - Compliances-Reg. 39 (3) ... May 12, 2021 19:39 Pursuant to Regulation 39(3) of the Securities...
9 Reliance Industries - Compliances-Reg. 39 (3) ... May 11, 2021 19:49 Pursuant to Regulation 39(3) of the Securities...

Related

Python: Getting a table in CSV from a website without a table class

I'm a newbie seeking help.
I've tried without success with the following.
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.canada.ca/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds.html"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
print(table.get('class'))
Result:
['table']
None
Can anyone help me with how to get this data?
Thank you so much.
The data you see on the page is loaded from external URL. To load the data you can use next example:
import requests
import pandas as pd
url = "https://www.canada.ca/content/dam/ircc/documents/json/ee_rounds_123_en.json"
data = requests.get(url).json()
df = pd.DataFrame(data["rounds"])
df = df.drop(columns=["drawNumberURL", "DrawText1", "mitext"])
print(df.head(10).to_markdown(index=False))
Prints:
drawNumber
drawDate
drawDateFull
drawName
drawSize
drawCRS
drawText2
drawDateTime
drawCutOff
drawDistributionAsOn
dd1
dd2
dd3
dd4
dd5
dd6
dd7
dd8
dd9
dd10
dd11
dd12
dd13
dd14
dd15
dd16
dd17
dd18
231
2022-09-14
September 14, 2022
No Program Specified
3,250
510
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
September 14, 2022 at 13:29:26 UTC
January 08, 2022 at 10:24:52 UTC
September 12, 2022
408
6,228
63,860
5,845
9,505
19,156
16,541
12,813
58,019
12,245
12,635
9,767
11,186
12,186
68,857
35,833
5,068
238,273
230
2022-08-31
August 31, 2022
No Program Specified
2,750
516
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 31, 2022 at 13:55:23 UTC
April 16, 2022 at 18:24:41 UTC
August 29, 2022
466
7,224
63,270
5,554
9,242
19,033
16,476
12,965
58,141
12,287
12,758
9,796
11,105
12,195
68,974
36,001
5,120
239,196
229
2022-08-17
August 17, 2022
No Program Specified
2,250
525
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 17, 2022 at 13:43:47 UTC
December 28, 2021 at 11:03:15 UTC
August 15, 2022
538
8,221
62,753
5,435
9,129
18,831
16,465
12,893
58,113
12,200
12,721
9,801
11,138
12,253
68,440
35,745
5,137
238,947
228
2022-08-03
August 3, 2022
No Program Specified
2,000
533
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 03, 2022 at 15:16:24 UTC
January 06, 2022 at 14:29:50 UTC
August 2, 2022
640
8,975
62,330
5,343
9,044
18,747
16,413
12,783
57,987
12,101
12,705
9,747
11,117
12,317
68,325
35,522
5,145
238,924
227
2022-07-20
July 20, 2022
No Program Specified
1,750
542
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 20, 2022 at 16:32:49 UTC
December 30, 2021 at 15:29:35 UTC
July 18, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
226
2022-07-06
July 6, 2022
No Program Specified
1,500
557
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 6, 2022 at 14:34:34 UTC
November 13, 2021 at 02:20:46 UTC
July 11, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
225
2022-06-22
June 22, 2022
Provincial Nominee Program
636
752
Provincial Nominee Program
June 22, 2022 at 14:13:57 UTC
April 19, 2022 at 13:45:45 UTC
June 20, 2022
664
8,017
55,917
4,246
7,845
16,969
15,123
11,734
53,094
10,951
11,621
8,800
10,325
11,397
64,478
33,585
4,919
220,674
224
2022-06-08
June 8, 2022
Provincial Nominee Program
932
796
Provincial Nominee Program
June 08, 2022 at 14:03:28 UTC
October 18, 2021 at 17:13:17 UTC
June 6, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
223
2022-05-25
May 25, 2022
Provincial Nominee Program
590
741
Provincial Nominee Program
May 25, 2022 at 13:21:23 UTC
February 02, 2022 at 12:29:53 UTC
May 23, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
222
2022-05-11
May 11, 2022
Provincial Nominee Program
545
753
Provincial Nominee Program
May 11, 2022 at 14:08:07 UTC
December 15, 2021 at 20:32:57 UTC
May 9, 2022
635
7,193
52,684
3,749
7,237
16,027
14,466
11,205
50,811
10,484
11,030
8,393
9,945
10,959
62,341
32,590
4,839
211,093

Extract date and sort rows by date

I have a dataset that includes some strings in the following forms:
Text
Jun 28, 2021 — Brendan Moore is p...
Professor of Psychology at University
Aug 24, 2019 — Chemistry (Nobel prize...
by A Craig · 2019 · Cited by 1 — Authors. ...
... 2020 | Volume 8 | Article 330Edited by:
I would like to create a new column where there are, if there exist, dates sorted by ascending order.
To do so, I need to extract the part of string which includes date information from each row, whether exits.
Something like this:
Text Numbering
Jun 28, 2021 — Brendan Moore is p... 2
Professor of Psychology at University -1
Aug 24, 2019 — Chemistry (Nobel prize... 1
by A Craig · 2019 · Cited by 1 — Authors. ... -1
... 2020 | Volume 8 | Article 330Edited by: -1
All the rows not starting with a date (that follows the format: Jun 28, 2021 — are assigned to -1.
The first step would be identify the pattern: xxx xx, xxxx;
then, transforming date object into datetime (yyyy-mm-dd).
Once got this date information, it needs to be converted into numerical, then sorted.
I am having difficulties in answering the last point, specifically on how to filter only dates and sort them in an appropriate way.
The expected output would be
Text Numbering (sort by date asc)
Jun 28, 2021 — Brendan Moore is p... 2
Professor of Psychology at University -1
Aug 24, 2019 — Chemistry (Nobel prize... 1
by A Craig · 2019 · Cited by 1 — Authors. ... -1
... 2020 | Volume 8 | Article 330Edited by: -1
Mission accomplished:
# Find rows that start with a date
matches = df['Text'].str.match(r'^\w+ \d+, \d{4}')
# Parse dates out of date rows
df['date'] = pd.to_datetime(df[matches]['Text'], format='%b %d, %Y', exact=False, errors='coerce')
# Assign numbering for dates
df['Numbering'] = df['date'].sort_values().groupby(np.ones(df.shape[0])).cumcount() + 1
# -1 for the non-dates
df.loc[~matches, 'Numbering'] = -1
# Cleanup
df.drop('date', axis=1, inplace=True)
Output:
>>> df
Text Numbering
0 Jun 28, 2021 - Brendan Moore is p... 2
1 Professor of Psychology at University -1
2 Aug 24, 2019 - Chemistry (Nobel prize... 1
3 by A Craig - 2019 - Cited by 1 - Authors. ... -1
4 ... 2020 | Volume 8 | Article 330Edited by: -1

Webscrape - Fields of different length

The current code scrapes individual fields, but I would like to map the time and the titles together.
Since the webpage does not have the time and titles in the same class, how would this mapping occur?
Piggy-backing off this question -Link (My question uses an example where the time and title is not of equal length)
Website for reference:
https://ash.confex.com/ash/2021/webprogram/WALKS.html
Sample Expected Output:
5:00 PM-6:00 PM, ASH Poster Walk on Geriatric Hematology: Selecting the Right Treatment for the Patient, Not Just the Disease
5:00 PM-6:00 PM, ASH Poster Walk on Healthcare Quality Improvement
etc
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
times = soup.select('.time')
This could be an alternative:
import requests
from bs4 import BeautifulSoup
url = 'https://ash.confex.com/ash/2021/webprogram/WALKS.html'
#this is to get the url part before the last "/"
base_url = url.rsplit("/", 1)[0]
res = requests.get(url)
soup = BeautifulSoup(res.content,'html.parser')
productlist = soup.select('div.itemtitle > a')
#times = soup.select('.time')
for a in productlist:
title = a.text.strip()
time = a.find_previous('h3').text.strip()
date = a.find_previous('h4').text.strip()
page = a['href'].strip()
#sep = "/" is the separator between each parameter
#end = "makes the double linebreak when print function is done"
print(title, date, time, base_url + page, sep = "\n", end = "\n\n")
OUTPUT
ASH Poster Walk on What's Hot in Sickle Cell Disease
Wednesday, December 15, 2021
10:00 AM-11:00 AM
https://ash.confex.com/ash/2021/webprogramSession20816.html
ASH Poster Walk on Geriatric Hematology: Selecting the Right Treatment for the Patient, Not Just the Disease
Wednesday, December 15, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20695.html
ASH Poster Walk on Healthcare Quality Improvement
Wednesday, December 15, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession21143.html
ASH Poster Walk on Natural Killer Cell-Based Immunotherapy
Wednesday, December 15, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20655.html
ASH Poster Walk on Pediatric Non-malignant Hematology Highlights
Wednesday, December 15, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20721.html
ASH Poster Walk on Clinical Trials In Progress
Thursday, December 16, 2021
10:00 AM-11:00 AM
https://ash.confex.com/ash/2021/webprogramSession20589.html
ASH Poster Walk on Financial Toxicity in Hematologic Malignancies
Thursday, December 16, 2021
10:00 AM-11:00 AM
https://ash.confex.com/ash/2021/webprogramSession20663.html
ASH Poster Walk on Diversity, Equity, and Inclusion in Hematologic Malignancies and Cell Therapy
Thursday, December 16, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20809.html
ASH Poster Walk on Emerging Research in Immunotherapies
Thursday, December 16, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20805.html
ASH Poster Walk on the Spectrum of Hemostasis and Thrombosis Research
Thursday, December 16, 2021
5:00 PM-6:00 PM
https://ash.confex.com/ash/2021/webprogramSession20821.html
Try this:
content = soup.find('div', {"class": "content"})
times = content.find_all("h3")
output = []
for i,h3 in enumerate(times):
for j in h3.next_siblings:
if j.name:
if j.name == "h3":
break
j = j.text.replace('\n', '')
output.append(f"{times[i].text}, {j}")
print(output)

How to scrape pdf links from webpages having unchanging urls?

I am working on a project on web scraping and I am asked to scrape all the pdf links from a website:
https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=3&s .
The website has 397 pages but every page has the same URL. I tried the inspect element tool and found out that a javascript code helps to navigate to different pages. But still I am not able to figure out how to run my script for all the pages.
Below is my code.
from bs4 import BeautifulSoup
import lxml
url = 'https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=3&s'
conn = urllib2.urlopen(url)
html = conn.read()
soup = BeautifulSoup(html)
links = soup.find_all('a')
urls=[]
for tag in links:
link = tag.get('href',None)
if link is not None and link.endswith('html'):
#urls.append(link)
purl=link
new=urllib2.urlopen(purl)
htm=new.read()
sp=BeautifulSoup(htm)
nl=sp.find_all('a')
nm=sp.find_all('iframe')
for i in nl:
q=i.get('href',None)
title=i.get('title',None)
if q is not None and q.endswith('pdf'):
print(q)
urls.append(q)
for j in nm:
z=j.get('src',None)
title=j.get('title',None)
if z is not None and z.endswith('pdf')and title is not None:
print(z)
print(title)
urls.append(z)
print(len(urls))
You can use their API located on https://www.sebi.gov.in/sebiweb/ajax/home/getnewslistinfo.jsp to load the data.
For example:
from bs4 import BeautifulSoup
from requests import get
api_url = 'https://www.sebi.gov.in/sebiweb/ajax/home/getnewslistinfo.jsp'
payload = {
'nextValue': "1",
'next': "n",
'search': "",
'fromDate': "",
'toDate': "",
'fromYear': "",
'toYear': "",
'deptId': "",
'sid': "3",
'ssid': "-1",
'smid': "0",
'intmid': "-1",
'sText': "Filings",
'ssText': "-- All Sub Section --",
'smText': "",
'doDirect': "1",
}
page = 0
while True:
print('Page {}...'.format(page))
payload['doDirect'] = page
soup = BeautifulSoup(requests.post(api_url, data=payload).content, 'html.parser')
rows = soup.select('tr:has(td)')
if not rows:
break
for tr in rows:
row = [td.get_text(strip=True) for td in tr.select('td')] + [tr.a['href']]
print(*row, sep='\t')
page += 1
Prints:
...
Page 1...
Jun 25, 2020 Mindspace Business Parks REIT – Addendum to Draft Prospectus https://www.sebi.gov.in/filings/reit-issues/jun-2020/mindspace-business-parks-reit-addendum-to-draft-prospectus_46928.html
Jun 25, 2020 Amrit Corp. Ltd. - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/amrit-corp-ltd-public-announcement_46927.html
Jun 24, 2020 NIIT Technologies Buyback - Post Buyback - Public Advertisement https://www.sebi.gov.in/filings/buybacks/jun-2020/niit-technologies-buyback-post-buyback-public-advertisement_46923.html
Jun 23, 2020 Addendum to Letter of Offer of Arvind Fashions Limited https://www.sebi.gov.in/filings/rights-issues/jun-2020/addendum-to-letter-of-offer-of-arvind-fashions-limited_46941.html
Jun 23, 2020 Genesis Exports Limited - Draft letter of Offer https://www.sebi.gov.in/filings/buybacks/jun-2020/genesis-exports-limited-draft-letter-of-offer_46911.html
Jun 23, 2020 Genesis Exports Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/genesis-exports-limited-public-announcement_46909.html
Jun 19, 2020 Coral India Finance and Housing Limited – Post Buy-back Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/coral-india-finance-and-housing-limited-post-buy-back-public-announcement_46900.html
Jun 19, 2020 Network Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/network-limited_46890.html
Jun 17, 2020 KSOLVES INDIA LIMITED https://www.sebi.gov.in/filings/public-issues/jun-2020/ksolves-india-limited_46996.html
Jun 10, 2020 Happiest Minds Technologies Limited https://www.sebi.gov.in/filings/public-issues/jun-2020/happiest-minds-technologies-limited_46843.html
Jun 08, 2020 IM+ Capitals Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/im-capitals-limited_46786.html
Jun 05, 2020 HealthCare Global Enterprises Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/healthcare-global-enterprises-limited_46773.html
Jun 02, 2020 Jaikumar Constructions Ltd. - DRHP https://www.sebi.gov.in/filings/public-issues/jun-2020/jaikumar-constructions-ltd-drhp_46774.html
Jun 02, 2020 Mahindra Focused Equity Yojana https://www.sebi.gov.in/filings/mutual-funds/jun-2020/mahindra-focused-equity-yojana_46767.html
Jun 02, 2020 GRANULES INDIA LIMITED - Dispatch advertisement https://www.sebi.gov.in/filings/buybacks/jun-2020/granules-india-limited-dispatch-advertisement_46765.html
Jun 02, 2020 GRANULES INDIA LIMITED - Letter of Offer https://www.sebi.gov.in/filings/buybacks/jun-2020/granules-india-limited-letter-of-offer_46764.html
Jun 02, 2020 Motilal Oswal Multi Asset Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/motilal-oswal-multi-asset-fund_46762.html
Jun 02, 2020 Principal Large Cap Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/principal-large-cap-fund_46761.html
Jun 02, 2020 Mahindra Arbitrage Yojana https://www.sebi.gov.in/filings/mutual-funds/jun-2020/mahindra-arbitrage-yojana_46760.html
Jun 02, 2020 HSBC Mid Cap Equity Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/hsbc-mid-cap-equity-fund_46759.html
Jun 01, 2020 Tanla Solutions Limited - DLOF https://www.sebi.gov.in/filings/buybacks/jun-2020/tanla-solutions-limited-dlof_46750.html
Jun 01, 2020 Axis Banking ETF https://www.sebi.gov.in/filings/mutual-funds/jun-2020/axis-banking-etf_46748.html
Jun 01, 2020 Kalpataru Power Transmission Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/kalpataru-power-transmission-limited-public-announcement_46746.html
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 22, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-22-2020_46745.html
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 19, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-19-2020_46744.html
Page 2...
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 18, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-18-2020_46743.html
May 29, 2020 Muthoottu Mini Financiers Limited- Prospectus https://www.sebi.gov.in/filings/debt-offer-document/may-2020/muthoottu-mini-financiers-limited-prospectus_46769.html
May 29, 2020 Coral India Housing and Finance Limited - Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/coral-india-housing-and-finance-limited-advertisement_46732.html
May 29, 2020 TANLA SOLUTIONS LIMITED - Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/tanla-solutions-limited-public-announcement_46731.html
May 28, 2020 Tips Industries Limited - Dispatch Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/tips-industries-limited-dispatch-advertisement_46723.html
May 27, 2020 KLM Axiva Finvest Limited - Prospectus https://www.sebi.gov.in/filings/debt-offer-document/may-2020/klm-axiva-finvest-limited-prospectus_46755.html
May 26, 2020 Tips Industries Limited - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/tips-industries-limited-letter-of-offer_46708.html
May 26, 2020 Axis Capital Protection Oriented Fund - Series 7-10 https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-capital-protection-oriented-fund-series-7-10_46707.html
May 26, 2020 ICICI Prudential Alpha Low Vol 30 ETF https://www.sebi.gov.in/filings/mutual-funds/may-2020/icici-prudential-alpha-low-vol-30-etf_46706.html
May 22, 2020 NIIT Technologies Ltd. - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/niit-technologies-ltd-letter-of-offer_46700.html
May 22, 2020 NIIT Technologies Ltd. - Dispatch Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/niit-technologies-ltd-dispatch-advertisement_46699.html
May 22, 2020 Coral India Finance and Housing Limited - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/coral-india-finance-and-housing-limited-letter-of-offer_46698.html
May 22, 2020 Jay Ushin Limited https://www.sebi.gov.in/filings/takeovers/may-2020/jay-ushin-limited_46697.html
May 22, 2020 Pennar Industries - Post Buyback Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/pennar-industries-post-buyback-public-announcement_46696.html
May 22, 2020 Axis Global Equity Alpha Fund of Fund. https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-global-equity-alpha-fund-of-fund-_46695.html
May 21, 2020 Axis Global Disruption Fund of Fund https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-global-disruption-fund-of-fund_46694.html
May 18, 2020 Reliance Industries Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/reliance-industries-limited_46675.html
May 14, 2020 Public Advertisement of Spencer's Retail Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/public-advertisement-of-spencer-s-retail-limited_46693.html
May 12, 2020 Spencer's Retail Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/spencer-s-retail-limited_46692.html
May 12, 2020 Sequent Scientific Limited https://www.sebi.gov.in/filings/takeovers/may-2020/sequent-scientific-limited_46662.html
May 11, 2020 Arvind Fashions Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/arvind-fashions-limited_46659.html
May 05, 2020 JK Paper Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/jk-paper-limited-public-announcement_46647.html
May 05, 2020 Aurionpro Solutions Limited - Post BuyBack Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/aurionpro-solutions-limited-post-buyback-advertisement_46646.html
May 04, 2020 KSOLVES INDIA LIMITED https://www.sebi.gov.in/filings/public-issues/may-2020/ksolves-india-limited_46644.html
May 04, 2020 SBI ETF Consumption https://www.sebi.gov.in/filings/mutual-funds/may-2020/sbi-etf-consumption_46639.html
Page 3...
... and so on.
It seems the website is making a POST request to getnewslistinfo.jsp and getting back the new table content as html. You can open up your Network (Ctrl+Shift+E on Firefox) then navigate to the next page and see the request being made and its parameters.
You can mimick that POST request and change the appropriate parameters for the next page (from what I saw it should be nextValue and doDirect) using urllib2 (or preferably requests). After you get the content you can simply parse it using BeautifulSoup and extract the a tags the way you already did.
Also a tip to you: You should separate your code into functions that do different things such as getPage(pageNum) that given a page number returns the html content, getLinks(html) that given an html page it gets all the links from the table and returns them as a list. This way your code will be more readable and easier to debug and use.

Using Python and BeautifulSoup to scrape list from an URL

I am new to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape an url and want to store list of movies under one date.
Below is the code I have so far:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=ul.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
I am getting "AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
Expected result in list or dataframe
29th May 2020 Romantic
29th May 2020 Sohreyan Da Pind Aa Gaya
5th June 2020 Lakshmi Bomb
and so on
Thanks in advance for help.
This script will get all movies and corresponding dates to a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/calendar?region=IN&ref_=rlm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out, last = [], ''
for tag in soup.select('#main h4, #main li'):
if tag.name == 'h4':
last = tag.get_text(strip=True)
else:
out.append({'Date':last, 'Movie':tag.get_text(strip=True).rsplit('(', maxsplit=1)[0]})
df = pd.DataFrame(out)
print(df)
Prints:
Date Movie
0 29 May 2020 Romantic
1 29 May 2020 Sohreyan Da Pind Aa Gaya
2 05 June 2020 Laxmmi Bomb
3 05 June 2020 Roohi Afzana
4 05 June 2020 Nikamma
.. ... ...
95 26 March 2021 Untitled Luv Ranjan Film
96 02 April 2021 F9
97 02 April 2021 Bell Bottom
98 02 April 2021 NTR Trivikiram Untitled Movie
99 09 April 2021 Manje Bistre 3
[100 rows x 2 columns]
I think you should replace "ul" with "h1" on the 10th line. And add definition of variable "movielist" ahead.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
# add code here
movielist = []
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
# replace ul with h1 here
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
print(movielist)
I didn't specify a list to receive, and I changed it from 'h1' to 'text capture' instead of 'h4'.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
movielist = []
date = soup.find_all("h4")
ul = soup.find_all("ui")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
The reason the date doesn't match in the output result is that the 'date' retrieved looks like the following, so you need to fix the logic.
There are multiple titles on the same release date, so the release date and number of titles don't match up. I can't help you that much because I don't have the time. Have a good night.
29 May 2020
05 June 2020
07 June 2020
07 June 2020 Romantic
12 June 2020
12 June 2020 Sohreyan Da Pind Aa Gaya
18 June 2020
18 June 2020 Laxmmi Bomb
19 June 2020
19 June 2020 Roohi Afzana
25 June 2020
25 June 2020 Nikamma
26 June 2020
26 June 2020 Naandhi
02 July 2020
02 July 2020 Mandela
03 July 2020
03 July 2020 Medium Spicy
10 July 2020
10 July 2020 Out of the Blue

Categories

Resources