beautiful soup parse published date of article on medium python - python

I need to parse "published date" of article on medium using beautiful soup.
I successfully parsed in loop author, title, reading time, but for some reason "published date" is not working for me.
here is example:
https://medium.com/interlay/archive/2020
so output of prasing will be Jun 18, 2020 ; Mar 5 , 2020 ; Feb 23, 2020 etc.

The Date is present inside the <time> tag of every article <div>.
Select that <time> tag and print it's text.
Here is the code.
import requests
from bs4 import BeautifulSoup
url = 'https://medium.com/interlay/archive/2020'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
t = [x.text.strip() for x in soup.find_all('time')]
print(t)
['Jun 18, 2020', 'Mar 4, 2020', 'Feb 23, 2020', 'Nov 30, 2020', 'Apr 15, 2020', 'Aug 21, 2020', 'Oct 27, 2020']

import requests
from bs4 import BeautifulSoup
url='https://medium.com/interlay/archive/2020'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
You can find main_div tag as from its class and loop over it to get data from time tag
main_div=soup.find_all("div",class_="streamItem streamItem--postPreview js-streamItem")
for div in main_div:
print(div.find("time").text)
Output:
Jun 18, 2020
Mar 4, 2020
Feb 23, 2020
Nov 30, 2020
Apr 15, 2020
Aug 21, 2020
Oct 27, 2020
​

Related

Unable to get stock data from yahoo with pandas_datareader

This is my code:
start = '2015-1-1'
end = '2020-12-31'
source = 'yahoo'
google = data.DataReader('GOOG', start=start, end=end, data_source=source).reset_index()
I was using this code till last month and it was working properly, after a month I tried this code and now this code is throwing me error:
Unable to read URL: https://finance.yahoo.com/quote/GOOG/history?period1=1420065000&period2=1609453799&interval=1d&frequency=1d&filter=history
I am not able to figure it out, can you please make me understand, why is this happening?
Yahoo! Finance has changed slightly their structure. Now requires headers for the data retreival on the http request. Once done works fine.
For pandas & pandas-datareader which you'll need to upgrade them if you use it. (Which has been already sorted). Probably on all other packages using data from yahoo! such backtrader, etc, you'll need either upgrade or add headers on the yahoo! script to retrieve data :).
pip install --upgrade pandas
pip install --upgrade pandas-datareader
Have a nice day ;).
Please upgrade the pandas_datareader to a version >= 0.10.0 . This bug is fixed in 0.10.0 as per the release notes.
Fixed Yahoo readers which now require headers
Yahoo! Finance is working fine with pandas without any issue.
Script:
import pandas as pd
import requests
link = 'https://finance.yahoo.com/quote/GOOG/history?period1=1420065000&period2=1609453799&interval=1d&frequency=1d&filter=history'
r = requests.get(link, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
data = pd.read_html(r.text)[0]
df =pd.DataFrame(data)
df=df.iloc[0:100]
print(df)
Output:
Date Open High Low Close AdjClose Volume
Dec 31, 2020 1735.42 1758.93 1735.42 1751.88 1751.88 1011900
Dec 30, 2020 1762.01 1765.09 1725.6 1739.52 1739.52 1306100
Dec 29, 2020 1787.79 1792.44 1756.09 1758.72 1758.72 1299400
Dec 28, 2020 1751.64 1790.73 1746.33 1776.09 1776.09 1393000
Dec 24, 2020 1735 1746 1729.11 1738.85 1738.85 346800
Dec 23, 2020 1728.11 1747.99 1725.04 1732.38 1732.38 1033800
Dec 22, 2020 1734.43 1737.41 1712.57 1723.5 1723.5 936700
Dec 21, 2020 1713.51 1740.85 1699 1739.37 1739.37 1828400
Dec 18, 2020 1754.18 1755.11 1720.22 1731.01 1731.01 4016400
Dec 17, 2020 1768.51 1771.78 1738.66 1747.9 1747.9 1624700
Dec 16, 2020 1772.88 1773 1756.08 1763 1763 1513500
Dec 15, 2020 1764.42 1771.42 1749.95 1767.77 1767.77 1482300
Dec 14, 2020 1775 1797.39 1757.21 1760.06 1760.06 1600200
Dec 11, 2020 1763.06 1784.45 1760 1781.77 1781.77 1220700
Dec 10, 2020 1769.8 1781.31 1740.32 1775.33 1775.33 1362800
Dec 09, 2020 1812.01 1834.27 1767.81 1784.13 1784.13 1507600
Dec 08, 2020 1810.1 1821.9 1796.2 1818.55 1818.55 1096300
Dec 07, 2020 1819 1832.37 1805.78 1819.48 1819.48 1320900
Dec 04, 2020 1824.52 1833.16 1816.99 1827.99 1827.99 1378200
Dec 03, 2020 1824.01 1847.2 1822.65 1826.77 1826.77 1227300
Dec 02, 2020 1798.1 1835.65 1789.47 1827.95 1827.95 1222000
Dec 01, 2020 1774.37 1824.83 1769.37 1798.1 1798.1 1736900
Nov 30, 2020 1781.18 1788.06 1755 1760.74 1760.74 1823800
Nov 27, 2020 1773.09 1804 1772.44 1793.19 1793.19 884900
Nov 25, 2020 1772.89 1778.54 1756.54 1771.43 1771.43 1045800
Nov 24, 2020 1730.5 1771.6 1727.69 1768.88 1768.88 1578000
Nov 23, 2020 1749.6 1753.9 1717.72 1734.86 1734.86 2161600
Nov 20, 2020 1765.21 1774 1741.86 1742.19 1742.19 2313500
Nov 19, 2020 1738.38 1769.59 1737.01 1763.92 1763.92 1249900
Nov 18, 2020 1765.23 1773.47 1746.14 1746.78 1746.78 1173500
Nov 17, 2020 1776.94 1785 1767 1770.15 1770.15 1147100
Nov 16, 2020 1771.7 1799.07 1767.69 1781.38 1781.38 1246800
Nov 13, 2020 1757.63 1781.04 1744.55 1777.02 1777.02 1499900
Nov 12, 2020 1747.63 1768.27 1745.6 1749.84 1749.84 1247500
Nov 11, 2020 1750 1764.22 1747.36 1752.71 1752.71 1264000
Nov 10, 2020 1731.09 1763 1717.3 1740.39 1740.39 2636100
Nov 09, 2020 1790.9 1818.06 1760.02 1763 1763 2268300
Nov 06, 2020 1753.95 1772.43 1740.35 1761.75 1761.75 1660900
Nov 05, 2020 1781 1793.64 1750.51 1763.37 1763.37 2065800
Nov 04, 2020 1710.28 1771.36 1706.03 1749.13 1749.13 3570900
Nov 03, 2020 1631.78 1661.7 1616.62 1650.21 1650.21 1661700
Nov 02, 2020 1628.16 1660.77 1616.03 1626.03 1626.03 2535400
Oct 30, 2020 1672.11 1687 1604.46 1621.01 1621.01 4329100
Oct 29, 2020 1522.36 1593.71 1522.24 1567.24 1567.24 2003100
Oct 28, 2020 1559.74 1561.35 1514.62 1516.62 1516.62 1834000
Oct 27, 2020 1595.67 1606.84 1582.78 1604.26 1604.26 1229000
Oct 26, 2020 1625.01 1638.24 1576.5 1590.45 1590.45 1853300
Oct 23, 2020 1626.07 1642.36 1620.51 1641 1641 1375800
Oct 22, 2020 1593.05 1621.99 1585 1615.33 1615.33 1433600
Oct 21, 2020 1573.33 1618.73 1571.63 1593.31 1593.31 2568300
Oct 20, 2020 1527.05 1577.5 1525.67 1555.93 1555.93 2241700
Oct 19, 2020 1580.46 1588.15 1528 1534.61 1534.61 1607100
Oct 16, 2020 1565.85 1581.13 1563 1573.01 1573.01 1434700
Oct 15, 2020 1547.15 1575.1 1545.03 1559.13 1559.13 1540000
Oct 14, 2020 1578.59 1587.68 1550.53 1568.08 1568.08 1929300
Oct 13, 2020 1583.73 1590 1563.2 1571.68 1571.68 1601000
Oct 12, 2020 1543 1593.86 1532.57 1569.15 1569.15 2482600
Oct 09, 2020 1494.7 1516.52 1489.45 1515.22 1515.22 1435300
Oct 08, 2020 1465.09 1490 1465.09 1485.93 1485.93 1187800
Oct 07, 2020 1464.29 1468.96 1436 1460.29 1460.29 1746200
Oct 06, 2020 1475.58 1486.76 1448.59 1453.44 1453.44 1245400
Oct 05, 2020 1466.21 1488.21 1464.27 1486.02 1486.02 1113300
Oct 02, 2020 1462.03 1483.2 1450.92 1458.42 1458.42 1284100
Oct 01, 2020 1484.27 1499.04 1479.21 1490.09 1490.09 1779500
Sep 30, 2020 1466.8 1489.75 1459.88 1469.6 1469.6 1701600
Sep 29, 2020 1470.39 1476.66 1458.81 1469.33 1469.33 978200
Sep 28, 2020 1474.21 1476.8 1449.3 1464.52 1464.52 2007900
Sep 25, 2020 1432.63 1450 1413.34 1444.96 1444.96 1323000
Sep 24, 2020 1411.03 1443.71 1409.85 1428.29 1428.29 1450200
Sep 23, 2020 1458.78 1460.96 1407.7 1415.21 1415.21 1657400
Sep 22, 2020 1450.09 1469.52 1434.53 1465.46 1465.46 1583200
Sep 21, 2020 1440.06 1448.36 1406.55 1431.16 1431.16 2888800
Sep 18, 2020 1498.01 1503 1437.13 1459.99 1459.99 3103900
Sep 17, 2020 1496 1508.3 1470 1495.53 1495.53 1879800
Sep 16, 2020 1555.54 1562 1519.82 1520.9 1520.9 1311700
Sep 15, 2020 1536 1559.57 1531.83 1541.44 1541.44 1331100
Sep 14, 2020 1539.01 1564 1515.74 1519.28 1519.28 1696600
Sep 11, 2020 1536 1575.2 1497.36 1520.72 1520.72 1597100
Sep 10, 2020 1560.64 1584.08 1525.81 1532.02 1532.02 1618600
Sep 09, 2020 1557.53 1569 1536.05 1556.96 1556.96 1774700
Sep 08, 2020 1533.51 1563.86 1528.01 1532.39 1532.39 2610900
Sep 04, 2020 1624.26 1645.11 1547.61 1591.04 1591.04 2608600
Sep 03, 2020 1709.71 1709.71 1615.06 1641.84 1641.84 3107800
Sep 02, 2020 1673.78 1733.18 1666.33 1728.28 1728.28 2511200
Sep 01, 2020 1636.63 1665.73 1632.22 1660.71 1660.71 1825300
Aug 31, 2020 1647.89 1647.96 1630.31 1634.18 1634.18 1823400
Aug 28, 2020 1633.49 1647.17 1630.75 1644.41 1644.41 1499000
Aug 27, 2020 1653.68 1655 1625.75 1634.33 1634.33 1861600
Aug 26, 2020 1608 1659.22 1603.6 1652.38 1652.38 3993400
Aug 25, 2020 1582.07 1611.62 1582.07 1608.22 1608.22 2247100
Aug 24, 2020 1593.98 1614.17 1580.57 1588.2 1588.2 1409900
Aug 21, 2020 1577.03 1597.72 1568.01 1580.42 1580.42 1446500
Aug 20, 2020 1543.45 1585.87 1538.2 1581.75 1581.75 1706900
Aug 19, 2020 1553.31 1573.68 1543.95 1547.53 1547.53 1660600
Aug 18, 2020 1526.18 1562.47 1523.71 1558.6 1558.6 2027100
Aug 17, 2020 1514.67 1525.61 1507.97 1517.98 1517.98 1378300
Aug 14, 2020 1515.66 1521.9 1502.88 1507.73 1507.73 1354800
Aug 13, 2020 1510.34 1537.25 1508.01 1518.45 1518.45 1455200
Aug 12, 2020 1485.58 1512.39 1485.25 1506.62 1506.62 1437000
Aug 11, 2020 1492.44 1510 1478 1480.32 1480.32 1454400
Yahoo finance has decommissioned their API. Try this python library.

extract just date from beautifulsoup result

I am trying to scrape a date from a web-site using BeautifulSoup:
how do I extract only the date-time from this? I only want : May 21, 2021 19:47
You can use this example how to extract the date-time from the <ctag>s:
from bs4 import BeautifulSoup
html_doc = """
<ctag class="">May 21, 2021 19:47 Source: <span>BSE</span> </ctag>
"""
soup = BeautifulSoup(html_doc, "html.parser")
for ctag in soup.find_all("ctag"):
dt = ctag.get_text(strip=True).rsplit(maxsplit=1)[0]
print(dt)
Prints:
May 21, 2021 19:47
Or:
for ctag in soup.find_all("ctag"):
dt = ctag.contents[0].rsplit(maxsplit=1)[0]
print(dt)
Or:
for ctag in soup.find_all("ctag"):
dt = ctag.find_next(text=True).rsplit(maxsplit=1)[0]
print(dt)
EDIT: To get dataframe of articles, you can do:
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.moneycontrol.com/company-notices/reliance-industries/notices/RI"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = []
for ctag in soup.select("li ctag"):
data.append(
{
"title": ctag.find_next("a").get_text(strip=True),
"date": ctag.find_next(text=True).rsplit(maxsplit=1)[0],
"desc": ctag.find_next("p", class_="MT2").get_text(strip=True),
}
)
df = pd.DataFrame(data)
print(df)
Prints:
title date desc
0 Reliance Industries - Compliances-Reg. 39 (3) ... May 21, 2021 19:47 Pursuant to Regulation 39(3) of the Securities...
1 Reliance Industries - Announcement under Regul... May 19, 2021 21:20 We refer to Regulation 5 of the SEBI (Prohibit...
2 Reliance Industries - Announcement under Regul... May 17, 2021 17:18 In continuation of our letter dated May 15, 20...
3 Reliance Industries - Announcement under Regul... May 17, 2021 16:06 Please find attached a media release by Relian...
4 Reliance Industries - Announcement under Regul... May 15, 2021 15:15 The Company has, on May 15, 2021, published in...
5 Reliance Industries - Compliances-Reg. 39 (3) ... May 14, 2021 19:44 Pursuant to Regulation 39(3) of the Securities...
6 Reliance Industries - Notice For Payment Of Fi... May 13, 2021 22:57 We refer to our letter dated May 01, 2021. A...
7 Reliance Industries - Announcement under Regul... May 12, 2021 21:20 We wish to inform you that the Company partici...
8 Reliance Industries - Compliances-Reg. 39 (3) ... May 12, 2021 19:39 Pursuant to Regulation 39(3) of the Securities...
9 Reliance Industries - Compliances-Reg. 39 (3) ... May 11, 2021 19:49 Pursuant to Regulation 39(3) of the Securities...

How to scrape pdf links from webpages having unchanging urls?

I am working on a project on web scraping and I am asked to scrape all the pdf links from a website:
https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=3&s .
The website has 397 pages but every page has the same URL. I tried the inspect element tool and found out that a javascript code helps to navigate to different pages. But still I am not able to figure out how to run my script for all the pages.
Below is my code.
from bs4 import BeautifulSoup
import lxml
url = 'https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=3&s'
conn = urllib2.urlopen(url)
html = conn.read()
soup = BeautifulSoup(html)
links = soup.find_all('a')
urls=[]
for tag in links:
link = tag.get('href',None)
if link is not None and link.endswith('html'):
#urls.append(link)
purl=link
new=urllib2.urlopen(purl)
htm=new.read()
sp=BeautifulSoup(htm)
nl=sp.find_all('a')
nm=sp.find_all('iframe')
for i in nl:
q=i.get('href',None)
title=i.get('title',None)
if q is not None and q.endswith('pdf'):
print(q)
urls.append(q)
for j in nm:
z=j.get('src',None)
title=j.get('title',None)
if z is not None and z.endswith('pdf')and title is not None:
print(z)
print(title)
urls.append(z)
print(len(urls))
You can use their API located on https://www.sebi.gov.in/sebiweb/ajax/home/getnewslistinfo.jsp to load the data.
For example:
from bs4 import BeautifulSoup
from requests import get
api_url = 'https://www.sebi.gov.in/sebiweb/ajax/home/getnewslistinfo.jsp'
payload = {
'nextValue': "1",
'next': "n",
'search': "",
'fromDate': "",
'toDate': "",
'fromYear': "",
'toYear': "",
'deptId': "",
'sid': "3",
'ssid': "-1",
'smid': "0",
'intmid': "-1",
'sText': "Filings",
'ssText': "-- All Sub Section --",
'smText': "",
'doDirect': "1",
}
page = 0
while True:
print('Page {}...'.format(page))
payload['doDirect'] = page
soup = BeautifulSoup(requests.post(api_url, data=payload).content, 'html.parser')
rows = soup.select('tr:has(td)')
if not rows:
break
for tr in rows:
row = [td.get_text(strip=True) for td in tr.select('td')] + [tr.a['href']]
print(*row, sep='\t')
page += 1
Prints:
...
Page 1...
Jun 25, 2020 Mindspace Business Parks REIT – Addendum to Draft Prospectus https://www.sebi.gov.in/filings/reit-issues/jun-2020/mindspace-business-parks-reit-addendum-to-draft-prospectus_46928.html
Jun 25, 2020 Amrit Corp. Ltd. - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/amrit-corp-ltd-public-announcement_46927.html
Jun 24, 2020 NIIT Technologies Buyback - Post Buyback - Public Advertisement https://www.sebi.gov.in/filings/buybacks/jun-2020/niit-technologies-buyback-post-buyback-public-advertisement_46923.html
Jun 23, 2020 Addendum to Letter of Offer of Arvind Fashions Limited https://www.sebi.gov.in/filings/rights-issues/jun-2020/addendum-to-letter-of-offer-of-arvind-fashions-limited_46941.html
Jun 23, 2020 Genesis Exports Limited - Draft letter of Offer https://www.sebi.gov.in/filings/buybacks/jun-2020/genesis-exports-limited-draft-letter-of-offer_46911.html
Jun 23, 2020 Genesis Exports Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/genesis-exports-limited-public-announcement_46909.html
Jun 19, 2020 Coral India Finance and Housing Limited – Post Buy-back Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/coral-india-finance-and-housing-limited-post-buy-back-public-announcement_46900.html
Jun 19, 2020 Network Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/network-limited_46890.html
Jun 17, 2020 KSOLVES INDIA LIMITED https://www.sebi.gov.in/filings/public-issues/jun-2020/ksolves-india-limited_46996.html
Jun 10, 2020 Happiest Minds Technologies Limited https://www.sebi.gov.in/filings/public-issues/jun-2020/happiest-minds-technologies-limited_46843.html
Jun 08, 2020 IM+ Capitals Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/im-capitals-limited_46786.html
Jun 05, 2020 HealthCare Global Enterprises Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/healthcare-global-enterprises-limited_46773.html
Jun 02, 2020 Jaikumar Constructions Ltd. - DRHP https://www.sebi.gov.in/filings/public-issues/jun-2020/jaikumar-constructions-ltd-drhp_46774.html
Jun 02, 2020 Mahindra Focused Equity Yojana https://www.sebi.gov.in/filings/mutual-funds/jun-2020/mahindra-focused-equity-yojana_46767.html
Jun 02, 2020 GRANULES INDIA LIMITED - Dispatch advertisement https://www.sebi.gov.in/filings/buybacks/jun-2020/granules-india-limited-dispatch-advertisement_46765.html
Jun 02, 2020 GRANULES INDIA LIMITED - Letter of Offer https://www.sebi.gov.in/filings/buybacks/jun-2020/granules-india-limited-letter-of-offer_46764.html
Jun 02, 2020 Motilal Oswal Multi Asset Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/motilal-oswal-multi-asset-fund_46762.html
Jun 02, 2020 Principal Large Cap Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/principal-large-cap-fund_46761.html
Jun 02, 2020 Mahindra Arbitrage Yojana https://www.sebi.gov.in/filings/mutual-funds/jun-2020/mahindra-arbitrage-yojana_46760.html
Jun 02, 2020 HSBC Mid Cap Equity Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/hsbc-mid-cap-equity-fund_46759.html
Jun 01, 2020 Tanla Solutions Limited - DLOF https://www.sebi.gov.in/filings/buybacks/jun-2020/tanla-solutions-limited-dlof_46750.html
Jun 01, 2020 Axis Banking ETF https://www.sebi.gov.in/filings/mutual-funds/jun-2020/axis-banking-etf_46748.html
Jun 01, 2020 Kalpataru Power Transmission Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/kalpataru-power-transmission-limited-public-announcement_46746.html
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 22, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-22-2020_46745.html
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 19, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-19-2020_46744.html
Page 2...
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 18, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-18-2020_46743.html
May 29, 2020 Muthoottu Mini Financiers Limited- Prospectus https://www.sebi.gov.in/filings/debt-offer-document/may-2020/muthoottu-mini-financiers-limited-prospectus_46769.html
May 29, 2020 Coral India Housing and Finance Limited - Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/coral-india-housing-and-finance-limited-advertisement_46732.html
May 29, 2020 TANLA SOLUTIONS LIMITED - Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/tanla-solutions-limited-public-announcement_46731.html
May 28, 2020 Tips Industries Limited - Dispatch Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/tips-industries-limited-dispatch-advertisement_46723.html
May 27, 2020 KLM Axiva Finvest Limited - Prospectus https://www.sebi.gov.in/filings/debt-offer-document/may-2020/klm-axiva-finvest-limited-prospectus_46755.html
May 26, 2020 Tips Industries Limited - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/tips-industries-limited-letter-of-offer_46708.html
May 26, 2020 Axis Capital Protection Oriented Fund - Series 7-10 https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-capital-protection-oriented-fund-series-7-10_46707.html
May 26, 2020 ICICI Prudential Alpha Low Vol 30 ETF https://www.sebi.gov.in/filings/mutual-funds/may-2020/icici-prudential-alpha-low-vol-30-etf_46706.html
May 22, 2020 NIIT Technologies Ltd. - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/niit-technologies-ltd-letter-of-offer_46700.html
May 22, 2020 NIIT Technologies Ltd. - Dispatch Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/niit-technologies-ltd-dispatch-advertisement_46699.html
May 22, 2020 Coral India Finance and Housing Limited - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/coral-india-finance-and-housing-limited-letter-of-offer_46698.html
May 22, 2020 Jay Ushin Limited https://www.sebi.gov.in/filings/takeovers/may-2020/jay-ushin-limited_46697.html
May 22, 2020 Pennar Industries - Post Buyback Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/pennar-industries-post-buyback-public-announcement_46696.html
May 22, 2020 Axis Global Equity Alpha Fund of Fund. https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-global-equity-alpha-fund-of-fund-_46695.html
May 21, 2020 Axis Global Disruption Fund of Fund https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-global-disruption-fund-of-fund_46694.html
May 18, 2020 Reliance Industries Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/reliance-industries-limited_46675.html
May 14, 2020 Public Advertisement of Spencer's Retail Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/public-advertisement-of-spencer-s-retail-limited_46693.html
May 12, 2020 Spencer's Retail Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/spencer-s-retail-limited_46692.html
May 12, 2020 Sequent Scientific Limited https://www.sebi.gov.in/filings/takeovers/may-2020/sequent-scientific-limited_46662.html
May 11, 2020 Arvind Fashions Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/arvind-fashions-limited_46659.html
May 05, 2020 JK Paper Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/jk-paper-limited-public-announcement_46647.html
May 05, 2020 Aurionpro Solutions Limited - Post BuyBack Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/aurionpro-solutions-limited-post-buyback-advertisement_46646.html
May 04, 2020 KSOLVES INDIA LIMITED https://www.sebi.gov.in/filings/public-issues/may-2020/ksolves-india-limited_46644.html
May 04, 2020 SBI ETF Consumption https://www.sebi.gov.in/filings/mutual-funds/may-2020/sbi-etf-consumption_46639.html
Page 3...
... and so on.
It seems the website is making a POST request to getnewslistinfo.jsp and getting back the new table content as html. You can open up your Network (Ctrl+Shift+E on Firefox) then navigate to the next page and see the request being made and its parameters.
You can mimick that POST request and change the appropriate parameters for the next page (from what I saw it should be nextValue and doDirect) using urllib2 (or preferably requests). After you get the content you can simply parse it using BeautifulSoup and extract the a tags the way you already did.
Also a tip to you: You should separate your code into functions that do different things such as getPage(pageNum) that given a page number returns the html content, getLinks(html) that given an html page it gets all the links from the table and returns them as a list. This way your code will be more readable and easier to debug and use.

Using Python and BeautifulSoup to scrape list from an URL

I am new to BeautifulSoup so please excuse any beginner mistakes here. I am attempting to scrape an url and want to store list of movies under one date.
Below is the code I have so far:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=ul.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
I am getting "AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?"
Expected result in list or dataframe
29th May 2020 Romantic
29th May 2020 Sohreyan Da Pind Aa Gaya
5th June 2020 Lakshmi Bomb
and so on
Thanks in advance for help.
This script will get all movies and corresponding dates to a dataframe:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/calendar?region=IN&ref_=rlm'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
out, last = [], ''
for tag in soup.select('#main h4, #main li'):
if tag.name == 'h4':
last = tag.get_text(strip=True)
else:
out.append({'Date':last, 'Movie':tag.get_text(strip=True).rsplit('(', maxsplit=1)[0]})
df = pd.DataFrame(out)
print(df)
Prints:
Date Movie
0 29 May 2020 Romantic
1 29 May 2020 Sohreyan Da Pind Aa Gaya
2 05 June 2020 Laxmmi Bomb
3 05 June 2020 Roohi Afzana
4 05 June 2020 Nikamma
.. ... ...
95 26 March 2021 Untitled Luv Ranjan Film
96 02 April 2021 F9
97 02 April 2021 Bell Bottom
98 02 April 2021 NTR Trivikiram Untitled Movie
99 09 April 2021 Manje Bistre 3
[100 rows x 2 columns]
I think you should replace "ul" with "h1" on the 10th line. And add definition of variable "movielist" ahead.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
date = soup.find_all("h4")
ul = soup.find_all("ul")
# add code here
movielist = []
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
# replace ul with h1 here
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
print(movielist)
I didn't specify a list to receive, and I changed it from 'h1' to 'text capture' instead of 'h4'.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.imdb.com/calendar?region=IN&ref_=rlm")
soup = BeautifulSoup(page.content, 'html.parser')
movielist = []
date = soup.find_all("h4")
ul = soup.find_all("ui")
for h4,h1 in zip(date,ul):
dd_=h4.get_text()
mv=h1.find_all('a')
for movie in mv:
text=movie.get_text()
print (dd_,text)
movielist.append((dd_,text))
The reason the date doesn't match in the output result is that the 'date' retrieved looks like the following, so you need to fix the logic.
There are multiple titles on the same release date, so the release date and number of titles don't match up. I can't help you that much because I don't have the time. Have a good night.
29 May 2020
05 June 2020
07 June 2020
07 June 2020 Romantic
12 June 2020
12 June 2020 Sohreyan Da Pind Aa Gaya
18 June 2020
18 June 2020 Laxmmi Bomb
19 June 2020
19 June 2020 Roohi Afzana
25 June 2020
25 June 2020 Nikamma
26 June 2020
26 June 2020 Naandhi
02 July 2020
02 July 2020 Mandela
03 July 2020
03 July 2020 Medium Spicy
10 July 2020
10 July 2020 Out of the Blue

Extract Columns from html using Python (Beautifulsoup)

I need to extract the info from this page -http://www.investing.com/currencies/usd-brl-historical-data. I need Date, Price, Open, High, Low,Change %.
I`m new to Python so I got stuck at this step:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('table', {'class':'genTbl closedTbl historicalTbl'})
d=[]
for item in g_data:
Table_Values = item.find_all('tr')
N=len(Table_Values)-1
for n in range(N):
k = (item.find_all('td', {'class':'first left bold noWrap'})[n].text)
print(item.find_all('td', {'class':'first left bold noWrap'})[n].text)
Here I have several problems:
Column for Price can de tagged as or . How can I specify that I want items tagged with class = 'redFont' or/and 'greenfont'?. Also Change % can also have class redFont and greenFont. Other columns are tagged by . How can I extract them?
Is there a way to extract columns from table?
Ideally I would like to have a dateframe with Columns Date, Price, Open, High, Low,Change %.
Thanks
How to parse the table from that site I have already answered here but since you want a DataFrame, just use pandas.read_html
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
import pandas as pd
df = pd.read_html(r.content,attrs = {'id': 'curr_table'})[0]
Which will give you:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3609 3.4411 3.4465 3.3584 -2.36%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%
You can generally pass the url directly but we get a 403 error for this particular site using urllib2 which is the lib used by read_html so we need to use requests to get that html.
Here's a way to convert the html table into a nested list
The solution is to find the specific table, then loop through each tr in the table, creating a sublist of the text of all the items inside that tr. The code to do this is a nested list comprehension.
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
#first row is empty
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
pprint(tableRows)
This gets all the data from the table
[['Jun 08, 2016', '3.3614', '3.4411', '3.4465', '3.3584', '-2.34%'],
['Jun 07, 2016', '3.4421', '3.4885', '3.5141', '3.4401', '-1.36%'],
['Jun 06, 2016', '3.4896', '3.5265', '3.5295', '3.4840', '-1.09%'],
['Jun 05, 2016', '3.5280', '3.5280', '3.5280', '3.5280', '0.11%'],
['Jun 03, 2016', '3.5240', '3.5910', '3.5947', '3.5212', '-1.91%'],
['Jun 02, 2016', '3.5926', '3.6005', '3.6157', '3.5765', '-0.22%'],
['Jun 01, 2016', '3.6007', '3.6080', '3.6363', '3.5755', '-0.29%'],
['May 31, 2016', '3.6111', '3.5700', '3.6383', '3.5534', '1.11%'],
['May 30, 2016', '3.5713', '3.6110', '3.6167', '3.5675', '-1.11%'],
['May 27, 2016', '3.6115', '3.5824', '3.6303', '3.5792', '0.81%'],
['May 26, 2016', '3.5825', '3.5826', '3.5857', '3.5757', '-0.03%'],
['May 25, 2016', '3.5836', '3.5702', '3.6218', '3.5511', '0.34%'],
['May 24, 2016', '3.5713', '3.5717', '3.5903', '3.5417', '-0.04%'],
['May 23, 2016', '3.5728', '3.5195', '3.5894', '3.5121', '1.49%'],
['May 20, 2016', '3.5202', '3.5633', '3.5663', '3.5154', '-1.24%'],
['May 19, 2016', '3.5644', '3.5668', '3.6197', '3.5503', '-0.11%'],
['May 18, 2016', '3.5683', '3.4877', '3.5703', '3.4854', '2.28%'],
['May 17, 2016', '3.4888', '3.4990', '3.5300', '3.4812', '-0.32%'],
['May 16, 2016', '3.5001', '3.5309', '3.5366', '3.4944', '-0.96%'],
['May 13, 2016', '3.5340', '3.4845', '3.5345', '3.4630', '1.39%'],
['May 12, 2016', '3.4855', '3.4514', '3.5068', '3.4346', '0.95%'],
['May 11, 2016', '3.4528', '3.4755', '3.4835', '3.4389', '-0.66%'],
['May 10, 2016', '3.4758', '3.5155', '3.5173', '3.4623', '-1.15%'],
['May 09, 2016', '3.5164', '3.5010', '3.6766', '3.4906', '0.40%']]
If you want to convert it to a pandas dataframe you just need to also grab the table headings and add them
import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]
#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)
print(df)
Then you'll get a dataframe that looks like this:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3596 3.4411 3.4465 3.3584 -2.40%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%

Categories

Resources