Related
I'm a newbie seeking help.
I've tried without success with the following.
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.canada.ca/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds.html"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
print(table.get('class'))
Result:
['table']
None
Can anyone help me with how to get this data?
Thank you so much.
The data you see on the page is loaded from external URL. To load the data you can use next example:
import requests
import pandas as pd
url = "https://www.canada.ca/content/dam/ircc/documents/json/ee_rounds_123_en.json"
data = requests.get(url).json()
df = pd.DataFrame(data["rounds"])
df = df.drop(columns=["drawNumberURL", "DrawText1", "mitext"])
print(df.head(10).to_markdown(index=False))
Prints:
drawNumber
drawDate
drawDateFull
drawName
drawSize
drawCRS
drawText2
drawDateTime
drawCutOff
drawDistributionAsOn
dd1
dd2
dd3
dd4
dd5
dd6
dd7
dd8
dd9
dd10
dd11
dd12
dd13
dd14
dd15
dd16
dd17
dd18
231
2022-09-14
September 14, 2022
No Program Specified
3,250
510
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
September 14, 2022 at 13:29:26 UTC
January 08, 2022 at 10:24:52 UTC
September 12, 2022
408
6,228
63,860
5,845
9,505
19,156
16,541
12,813
58,019
12,245
12,635
9,767
11,186
12,186
68,857
35,833
5,068
238,273
230
2022-08-31
August 31, 2022
No Program Specified
2,750
516
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 31, 2022 at 13:55:23 UTC
April 16, 2022 at 18:24:41 UTC
August 29, 2022
466
7,224
63,270
5,554
9,242
19,033
16,476
12,965
58,141
12,287
12,758
9,796
11,105
12,195
68,974
36,001
5,120
239,196
229
2022-08-17
August 17, 2022
No Program Specified
2,250
525
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 17, 2022 at 13:43:47 UTC
December 28, 2021 at 11:03:15 UTC
August 15, 2022
538
8,221
62,753
5,435
9,129
18,831
16,465
12,893
58,113
12,200
12,721
9,801
11,138
12,253
68,440
35,745
5,137
238,947
228
2022-08-03
August 3, 2022
No Program Specified
2,000
533
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 03, 2022 at 15:16:24 UTC
January 06, 2022 at 14:29:50 UTC
August 2, 2022
640
8,975
62,330
5,343
9,044
18,747
16,413
12,783
57,987
12,101
12,705
9,747
11,117
12,317
68,325
35,522
5,145
238,924
227
2022-07-20
July 20, 2022
No Program Specified
1,750
542
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 20, 2022 at 16:32:49 UTC
December 30, 2021 at 15:29:35 UTC
July 18, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
226
2022-07-06
July 6, 2022
No Program Specified
1,500
557
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 6, 2022 at 14:34:34 UTC
November 13, 2021 at 02:20:46 UTC
July 11, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
225
2022-06-22
June 22, 2022
Provincial Nominee Program
636
752
Provincial Nominee Program
June 22, 2022 at 14:13:57 UTC
April 19, 2022 at 13:45:45 UTC
June 20, 2022
664
8,017
55,917
4,246
7,845
16,969
15,123
11,734
53,094
10,951
11,621
8,800
10,325
11,397
64,478
33,585
4,919
220,674
224
2022-06-08
June 8, 2022
Provincial Nominee Program
932
796
Provincial Nominee Program
June 08, 2022 at 14:03:28 UTC
October 18, 2021 at 17:13:17 UTC
June 6, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
223
2022-05-25
May 25, 2022
Provincial Nominee Program
590
741
Provincial Nominee Program
May 25, 2022 at 13:21:23 UTC
February 02, 2022 at 12:29:53 UTC
May 23, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
222
2022-05-11
May 11, 2022
Provincial Nominee Program
545
753
Provincial Nominee Program
May 11, 2022 at 14:08:07 UTC
December 15, 2021 at 20:32:57 UTC
May 9, 2022
635
7,193
52,684
3,749
7,237
16,027
14,466
11,205
50,811
10,484
11,030
8,393
9,945
10,959
62,341
32,590
4,839
211,093
This is my code:
start = '2015-1-1'
end = '2020-12-31'
source = 'yahoo'
google = data.DataReader('GOOG', start=start, end=end, data_source=source).reset_index()
I was using this code till last month and it was working properly, after a month I tried this code and now this code is throwing me error:
Unable to read URL: https://finance.yahoo.com/quote/GOOG/history?period1=1420065000&period2=1609453799&interval=1d&frequency=1d&filter=history
I am not able to figure it out, can you please make me understand, why is this happening?
Yahoo! Finance has changed slightly their structure. Now requires headers for the data retreival on the http request. Once done works fine.
For pandas & pandas-datareader which you'll need to upgrade them if you use it. (Which has been already sorted). Probably on all other packages using data from yahoo! such backtrader, etc, you'll need either upgrade or add headers on the yahoo! script to retrieve data :).
pip install --upgrade pandas
pip install --upgrade pandas-datareader
Have a nice day ;).
Please upgrade the pandas_datareader to a version >= 0.10.0 . This bug is fixed in 0.10.0 as per the release notes.
Fixed Yahoo readers which now require headers
Yahoo! Finance is working fine with pandas without any issue.
Script:
import pandas as pd
import requests
link = 'https://finance.yahoo.com/quote/GOOG/history?period1=1420065000&period2=1609453799&interval=1d&frequency=1d&filter=history'
r = requests.get(link, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
data = pd.read_html(r.text)[0]
df =pd.DataFrame(data)
df=df.iloc[0:100]
print(df)
Output:
Date Open High Low Close AdjClose Volume
Dec 31, 2020 1735.42 1758.93 1735.42 1751.88 1751.88 1011900
Dec 30, 2020 1762.01 1765.09 1725.6 1739.52 1739.52 1306100
Dec 29, 2020 1787.79 1792.44 1756.09 1758.72 1758.72 1299400
Dec 28, 2020 1751.64 1790.73 1746.33 1776.09 1776.09 1393000
Dec 24, 2020 1735 1746 1729.11 1738.85 1738.85 346800
Dec 23, 2020 1728.11 1747.99 1725.04 1732.38 1732.38 1033800
Dec 22, 2020 1734.43 1737.41 1712.57 1723.5 1723.5 936700
Dec 21, 2020 1713.51 1740.85 1699 1739.37 1739.37 1828400
Dec 18, 2020 1754.18 1755.11 1720.22 1731.01 1731.01 4016400
Dec 17, 2020 1768.51 1771.78 1738.66 1747.9 1747.9 1624700
Dec 16, 2020 1772.88 1773 1756.08 1763 1763 1513500
Dec 15, 2020 1764.42 1771.42 1749.95 1767.77 1767.77 1482300
Dec 14, 2020 1775 1797.39 1757.21 1760.06 1760.06 1600200
Dec 11, 2020 1763.06 1784.45 1760 1781.77 1781.77 1220700
Dec 10, 2020 1769.8 1781.31 1740.32 1775.33 1775.33 1362800
Dec 09, 2020 1812.01 1834.27 1767.81 1784.13 1784.13 1507600
Dec 08, 2020 1810.1 1821.9 1796.2 1818.55 1818.55 1096300
Dec 07, 2020 1819 1832.37 1805.78 1819.48 1819.48 1320900
Dec 04, 2020 1824.52 1833.16 1816.99 1827.99 1827.99 1378200
Dec 03, 2020 1824.01 1847.2 1822.65 1826.77 1826.77 1227300
Dec 02, 2020 1798.1 1835.65 1789.47 1827.95 1827.95 1222000
Dec 01, 2020 1774.37 1824.83 1769.37 1798.1 1798.1 1736900
Nov 30, 2020 1781.18 1788.06 1755 1760.74 1760.74 1823800
Nov 27, 2020 1773.09 1804 1772.44 1793.19 1793.19 884900
Nov 25, 2020 1772.89 1778.54 1756.54 1771.43 1771.43 1045800
Nov 24, 2020 1730.5 1771.6 1727.69 1768.88 1768.88 1578000
Nov 23, 2020 1749.6 1753.9 1717.72 1734.86 1734.86 2161600
Nov 20, 2020 1765.21 1774 1741.86 1742.19 1742.19 2313500
Nov 19, 2020 1738.38 1769.59 1737.01 1763.92 1763.92 1249900
Nov 18, 2020 1765.23 1773.47 1746.14 1746.78 1746.78 1173500
Nov 17, 2020 1776.94 1785 1767 1770.15 1770.15 1147100
Nov 16, 2020 1771.7 1799.07 1767.69 1781.38 1781.38 1246800
Nov 13, 2020 1757.63 1781.04 1744.55 1777.02 1777.02 1499900
Nov 12, 2020 1747.63 1768.27 1745.6 1749.84 1749.84 1247500
Nov 11, 2020 1750 1764.22 1747.36 1752.71 1752.71 1264000
Nov 10, 2020 1731.09 1763 1717.3 1740.39 1740.39 2636100
Nov 09, 2020 1790.9 1818.06 1760.02 1763 1763 2268300
Nov 06, 2020 1753.95 1772.43 1740.35 1761.75 1761.75 1660900
Nov 05, 2020 1781 1793.64 1750.51 1763.37 1763.37 2065800
Nov 04, 2020 1710.28 1771.36 1706.03 1749.13 1749.13 3570900
Nov 03, 2020 1631.78 1661.7 1616.62 1650.21 1650.21 1661700
Nov 02, 2020 1628.16 1660.77 1616.03 1626.03 1626.03 2535400
Oct 30, 2020 1672.11 1687 1604.46 1621.01 1621.01 4329100
Oct 29, 2020 1522.36 1593.71 1522.24 1567.24 1567.24 2003100
Oct 28, 2020 1559.74 1561.35 1514.62 1516.62 1516.62 1834000
Oct 27, 2020 1595.67 1606.84 1582.78 1604.26 1604.26 1229000
Oct 26, 2020 1625.01 1638.24 1576.5 1590.45 1590.45 1853300
Oct 23, 2020 1626.07 1642.36 1620.51 1641 1641 1375800
Oct 22, 2020 1593.05 1621.99 1585 1615.33 1615.33 1433600
Oct 21, 2020 1573.33 1618.73 1571.63 1593.31 1593.31 2568300
Oct 20, 2020 1527.05 1577.5 1525.67 1555.93 1555.93 2241700
Oct 19, 2020 1580.46 1588.15 1528 1534.61 1534.61 1607100
Oct 16, 2020 1565.85 1581.13 1563 1573.01 1573.01 1434700
Oct 15, 2020 1547.15 1575.1 1545.03 1559.13 1559.13 1540000
Oct 14, 2020 1578.59 1587.68 1550.53 1568.08 1568.08 1929300
Oct 13, 2020 1583.73 1590 1563.2 1571.68 1571.68 1601000
Oct 12, 2020 1543 1593.86 1532.57 1569.15 1569.15 2482600
Oct 09, 2020 1494.7 1516.52 1489.45 1515.22 1515.22 1435300
Oct 08, 2020 1465.09 1490 1465.09 1485.93 1485.93 1187800
Oct 07, 2020 1464.29 1468.96 1436 1460.29 1460.29 1746200
Oct 06, 2020 1475.58 1486.76 1448.59 1453.44 1453.44 1245400
Oct 05, 2020 1466.21 1488.21 1464.27 1486.02 1486.02 1113300
Oct 02, 2020 1462.03 1483.2 1450.92 1458.42 1458.42 1284100
Oct 01, 2020 1484.27 1499.04 1479.21 1490.09 1490.09 1779500
Sep 30, 2020 1466.8 1489.75 1459.88 1469.6 1469.6 1701600
Sep 29, 2020 1470.39 1476.66 1458.81 1469.33 1469.33 978200
Sep 28, 2020 1474.21 1476.8 1449.3 1464.52 1464.52 2007900
Sep 25, 2020 1432.63 1450 1413.34 1444.96 1444.96 1323000
Sep 24, 2020 1411.03 1443.71 1409.85 1428.29 1428.29 1450200
Sep 23, 2020 1458.78 1460.96 1407.7 1415.21 1415.21 1657400
Sep 22, 2020 1450.09 1469.52 1434.53 1465.46 1465.46 1583200
Sep 21, 2020 1440.06 1448.36 1406.55 1431.16 1431.16 2888800
Sep 18, 2020 1498.01 1503 1437.13 1459.99 1459.99 3103900
Sep 17, 2020 1496 1508.3 1470 1495.53 1495.53 1879800
Sep 16, 2020 1555.54 1562 1519.82 1520.9 1520.9 1311700
Sep 15, 2020 1536 1559.57 1531.83 1541.44 1541.44 1331100
Sep 14, 2020 1539.01 1564 1515.74 1519.28 1519.28 1696600
Sep 11, 2020 1536 1575.2 1497.36 1520.72 1520.72 1597100
Sep 10, 2020 1560.64 1584.08 1525.81 1532.02 1532.02 1618600
Sep 09, 2020 1557.53 1569 1536.05 1556.96 1556.96 1774700
Sep 08, 2020 1533.51 1563.86 1528.01 1532.39 1532.39 2610900
Sep 04, 2020 1624.26 1645.11 1547.61 1591.04 1591.04 2608600
Sep 03, 2020 1709.71 1709.71 1615.06 1641.84 1641.84 3107800
Sep 02, 2020 1673.78 1733.18 1666.33 1728.28 1728.28 2511200
Sep 01, 2020 1636.63 1665.73 1632.22 1660.71 1660.71 1825300
Aug 31, 2020 1647.89 1647.96 1630.31 1634.18 1634.18 1823400
Aug 28, 2020 1633.49 1647.17 1630.75 1644.41 1644.41 1499000
Aug 27, 2020 1653.68 1655 1625.75 1634.33 1634.33 1861600
Aug 26, 2020 1608 1659.22 1603.6 1652.38 1652.38 3993400
Aug 25, 2020 1582.07 1611.62 1582.07 1608.22 1608.22 2247100
Aug 24, 2020 1593.98 1614.17 1580.57 1588.2 1588.2 1409900
Aug 21, 2020 1577.03 1597.72 1568.01 1580.42 1580.42 1446500
Aug 20, 2020 1543.45 1585.87 1538.2 1581.75 1581.75 1706900
Aug 19, 2020 1553.31 1573.68 1543.95 1547.53 1547.53 1660600
Aug 18, 2020 1526.18 1562.47 1523.71 1558.6 1558.6 2027100
Aug 17, 2020 1514.67 1525.61 1507.97 1517.98 1517.98 1378300
Aug 14, 2020 1515.66 1521.9 1502.88 1507.73 1507.73 1354800
Aug 13, 2020 1510.34 1537.25 1508.01 1518.45 1518.45 1455200
Aug 12, 2020 1485.58 1512.39 1485.25 1506.62 1506.62 1437000
Aug 11, 2020 1492.44 1510 1478 1480.32 1480.32 1454400
Yahoo finance has decommissioned their API. Try this python library.
I am working on a project on web scraping and I am asked to scrape all the pdf links from a website:
https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=3&s .
The website has 397 pages but every page has the same URL. I tried the inspect element tool and found out that a javascript code helps to navigate to different pages. But still I am not able to figure out how to run my script for all the pages.
Below is my code.
from bs4 import BeautifulSoup
import lxml
url = 'https://www.sebi.gov.in/sebiweb/home/HomeAction.do?doListing=yes&sid=3&s'
conn = urllib2.urlopen(url)
html = conn.read()
soup = BeautifulSoup(html)
links = soup.find_all('a')
urls=[]
for tag in links:
link = tag.get('href',None)
if link is not None and link.endswith('html'):
#urls.append(link)
purl=link
new=urllib2.urlopen(purl)
htm=new.read()
sp=BeautifulSoup(htm)
nl=sp.find_all('a')
nm=sp.find_all('iframe')
for i in nl:
q=i.get('href',None)
title=i.get('title',None)
if q is not None and q.endswith('pdf'):
print(q)
urls.append(q)
for j in nm:
z=j.get('src',None)
title=j.get('title',None)
if z is not None and z.endswith('pdf')and title is not None:
print(z)
print(title)
urls.append(z)
print(len(urls))
You can use their API located on https://www.sebi.gov.in/sebiweb/ajax/home/getnewslistinfo.jsp to load the data.
For example:
from bs4 import BeautifulSoup
from requests import get
api_url = 'https://www.sebi.gov.in/sebiweb/ajax/home/getnewslistinfo.jsp'
payload = {
'nextValue': "1",
'next': "n",
'search': "",
'fromDate': "",
'toDate': "",
'fromYear': "",
'toYear': "",
'deptId': "",
'sid': "3",
'ssid': "-1",
'smid': "0",
'intmid': "-1",
'sText': "Filings",
'ssText': "-- All Sub Section --",
'smText': "",
'doDirect': "1",
}
page = 0
while True:
print('Page {}...'.format(page))
payload['doDirect'] = page
soup = BeautifulSoup(requests.post(api_url, data=payload).content, 'html.parser')
rows = soup.select('tr:has(td)')
if not rows:
break
for tr in rows:
row = [td.get_text(strip=True) for td in tr.select('td')] + [tr.a['href']]
print(*row, sep='\t')
page += 1
Prints:
...
Page 1...
Jun 25, 2020 Mindspace Business Parks REIT – Addendum to Draft Prospectus https://www.sebi.gov.in/filings/reit-issues/jun-2020/mindspace-business-parks-reit-addendum-to-draft-prospectus_46928.html
Jun 25, 2020 Amrit Corp. Ltd. - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/amrit-corp-ltd-public-announcement_46927.html
Jun 24, 2020 NIIT Technologies Buyback - Post Buyback - Public Advertisement https://www.sebi.gov.in/filings/buybacks/jun-2020/niit-technologies-buyback-post-buyback-public-advertisement_46923.html
Jun 23, 2020 Addendum to Letter of Offer of Arvind Fashions Limited https://www.sebi.gov.in/filings/rights-issues/jun-2020/addendum-to-letter-of-offer-of-arvind-fashions-limited_46941.html
Jun 23, 2020 Genesis Exports Limited - Draft letter of Offer https://www.sebi.gov.in/filings/buybacks/jun-2020/genesis-exports-limited-draft-letter-of-offer_46911.html
Jun 23, 2020 Genesis Exports Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/genesis-exports-limited-public-announcement_46909.html
Jun 19, 2020 Coral India Finance and Housing Limited – Post Buy-back Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/coral-india-finance-and-housing-limited-post-buy-back-public-announcement_46900.html
Jun 19, 2020 Network Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/network-limited_46890.html
Jun 17, 2020 KSOLVES INDIA LIMITED https://www.sebi.gov.in/filings/public-issues/jun-2020/ksolves-india-limited_46996.html
Jun 10, 2020 Happiest Minds Technologies Limited https://www.sebi.gov.in/filings/public-issues/jun-2020/happiest-minds-technologies-limited_46843.html
Jun 08, 2020 IM+ Capitals Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/im-capitals-limited_46786.html
Jun 05, 2020 HealthCare Global Enterprises Limited https://www.sebi.gov.in/filings/takeovers/jun-2020/healthcare-global-enterprises-limited_46773.html
Jun 02, 2020 Jaikumar Constructions Ltd. - DRHP https://www.sebi.gov.in/filings/public-issues/jun-2020/jaikumar-constructions-ltd-drhp_46774.html
Jun 02, 2020 Mahindra Focused Equity Yojana https://www.sebi.gov.in/filings/mutual-funds/jun-2020/mahindra-focused-equity-yojana_46767.html
Jun 02, 2020 GRANULES INDIA LIMITED - Dispatch advertisement https://www.sebi.gov.in/filings/buybacks/jun-2020/granules-india-limited-dispatch-advertisement_46765.html
Jun 02, 2020 GRANULES INDIA LIMITED - Letter of Offer https://www.sebi.gov.in/filings/buybacks/jun-2020/granules-india-limited-letter-of-offer_46764.html
Jun 02, 2020 Motilal Oswal Multi Asset Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/motilal-oswal-multi-asset-fund_46762.html
Jun 02, 2020 Principal Large Cap Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/principal-large-cap-fund_46761.html
Jun 02, 2020 Mahindra Arbitrage Yojana https://www.sebi.gov.in/filings/mutual-funds/jun-2020/mahindra-arbitrage-yojana_46760.html
Jun 02, 2020 HSBC Mid Cap Equity Fund https://www.sebi.gov.in/filings/mutual-funds/jun-2020/hsbc-mid-cap-equity-fund_46759.html
Jun 01, 2020 Tanla Solutions Limited - DLOF https://www.sebi.gov.in/filings/buybacks/jun-2020/tanla-solutions-limited-dlof_46750.html
Jun 01, 2020 Axis Banking ETF https://www.sebi.gov.in/filings/mutual-funds/jun-2020/axis-banking-etf_46748.html
Jun 01, 2020 Kalpataru Power Transmission Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/jun-2020/kalpataru-power-transmission-limited-public-announcement_46746.html
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 22, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-22-2020_46745.html
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 19, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-19-2020_46744.html
Page 2...
Jun 01, 2020 Reliance Industries Limited - Addendum dated May 18, 2020 https://www.sebi.gov.in/filings/rights-issues/jun-2020/reliance-industries-limited-addendum-dated-may-18-2020_46743.html
May 29, 2020 Muthoottu Mini Financiers Limited- Prospectus https://www.sebi.gov.in/filings/debt-offer-document/may-2020/muthoottu-mini-financiers-limited-prospectus_46769.html
May 29, 2020 Coral India Housing and Finance Limited - Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/coral-india-housing-and-finance-limited-advertisement_46732.html
May 29, 2020 TANLA SOLUTIONS LIMITED - Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/tanla-solutions-limited-public-announcement_46731.html
May 28, 2020 Tips Industries Limited - Dispatch Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/tips-industries-limited-dispatch-advertisement_46723.html
May 27, 2020 KLM Axiva Finvest Limited - Prospectus https://www.sebi.gov.in/filings/debt-offer-document/may-2020/klm-axiva-finvest-limited-prospectus_46755.html
May 26, 2020 Tips Industries Limited - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/tips-industries-limited-letter-of-offer_46708.html
May 26, 2020 Axis Capital Protection Oriented Fund - Series 7-10 https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-capital-protection-oriented-fund-series-7-10_46707.html
May 26, 2020 ICICI Prudential Alpha Low Vol 30 ETF https://www.sebi.gov.in/filings/mutual-funds/may-2020/icici-prudential-alpha-low-vol-30-etf_46706.html
May 22, 2020 NIIT Technologies Ltd. - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/niit-technologies-ltd-letter-of-offer_46700.html
May 22, 2020 NIIT Technologies Ltd. - Dispatch Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/niit-technologies-ltd-dispatch-advertisement_46699.html
May 22, 2020 Coral India Finance and Housing Limited - Letter of Offer https://www.sebi.gov.in/filings/buybacks/may-2020/coral-india-finance-and-housing-limited-letter-of-offer_46698.html
May 22, 2020 Jay Ushin Limited https://www.sebi.gov.in/filings/takeovers/may-2020/jay-ushin-limited_46697.html
May 22, 2020 Pennar Industries - Post Buyback Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/pennar-industries-post-buyback-public-announcement_46696.html
May 22, 2020 Axis Global Equity Alpha Fund of Fund. https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-global-equity-alpha-fund-of-fund-_46695.html
May 21, 2020 Axis Global Disruption Fund of Fund https://www.sebi.gov.in/filings/mutual-funds/may-2020/axis-global-disruption-fund-of-fund_46694.html
May 18, 2020 Reliance Industries Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/reliance-industries-limited_46675.html
May 14, 2020 Public Advertisement of Spencer's Retail Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/public-advertisement-of-spencer-s-retail-limited_46693.html
May 12, 2020 Spencer's Retail Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/spencer-s-retail-limited_46692.html
May 12, 2020 Sequent Scientific Limited https://www.sebi.gov.in/filings/takeovers/may-2020/sequent-scientific-limited_46662.html
May 11, 2020 Arvind Fashions Limited https://www.sebi.gov.in/filings/rights-issues/may-2020/arvind-fashions-limited_46659.html
May 05, 2020 JK Paper Limited - Public Announcement https://www.sebi.gov.in/filings/buybacks/may-2020/jk-paper-limited-public-announcement_46647.html
May 05, 2020 Aurionpro Solutions Limited - Post BuyBack Advertisement https://www.sebi.gov.in/filings/buybacks/may-2020/aurionpro-solutions-limited-post-buyback-advertisement_46646.html
May 04, 2020 KSOLVES INDIA LIMITED https://www.sebi.gov.in/filings/public-issues/may-2020/ksolves-india-limited_46644.html
May 04, 2020 SBI ETF Consumption https://www.sebi.gov.in/filings/mutual-funds/may-2020/sbi-etf-consumption_46639.html
Page 3...
... and so on.
It seems the website is making a POST request to getnewslistinfo.jsp and getting back the new table content as html. You can open up your Network (Ctrl+Shift+E on Firefox) then navigate to the next page and see the request being made and its parameters.
You can mimick that POST request and change the appropriate parameters for the next page (from what I saw it should be nextValue and doDirect) using urllib2 (or preferably requests). After you get the content you can simply parse it using BeautifulSoup and extract the a tags the way you already did.
Also a tip to you: You should separate your code into functions that do different things such as getPage(pageNum) that given a page number returns the html content, getLinks(html) that given an html page it gets all the links from the table and returns them as a list. This way your code will be more readable and easier to debug and use.
I am scraping lists of US presidents using beautiful soup and requests. I want to scrape both the date for example start of the presidency and end of the presidency date and for some reason it's showing list index out of range error . I'll Provide you the link so you can understand better .
website Link : https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = BeautifulSoup(page_html , 'html.parser' )
containers = page_soup.find_all('table' , class_ = 'wikitable')
#print(containers[0])
#print(len(containers))
#print(soup.prettify(containers[0]))
container = containers[0]
date =container.find_all('span' , attrs = {'class': 'date'})
#print(len(date))
#print(date[0].text)
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
print(date_container[0].text)
The find_all function can return an empty list, which can lead you to getting an error.
You can simple check this:
all_dates = []
for container in containers:
date_container = container.find_all('span', attrs={'class': 'date'})
all_dates.extend([date.text for date in date_container])
As you have last lines of code, that store all spans of dates on first table "wikitable", you can make list comprehension:
date = [x.text for x in container.find_all('span' , attrs = {'class': 'date'})]
print(date)
Which will print:
['April 30, 1789', 'March 4, 1797', 'March 4, 1797', 'March 4, 1801', 'March 4, 1801'...
Since it has <table> tags, have you considered using pandas' .read_html()? It uses BeautifulSoup under the hood. Takes alot of the work out and puts it straight into a dataframe for you. The only work then needed is any manipulation or cleanup/filtering:
import pandas as pd
import re
my_url = 'https://en.wikipedia.org/wiki/List_of_presidents_of_the_United_States'
# Returns a list of dataframes
dfs = pd.read_html(my_url)
# Get the specific dataframe with the desired columns
df = dfs[1].iloc[:,[1,3]]
# Rename the columns
df.columns = ['Date','Name']
# Split the date column into start and end dates and drop the date column
df[['Start','End']] = df.Date.str.split('–', expand=True)
df = df.drop('Date',axis=1)
# Clean up the name column using regex to pull out the name
df['Name'] = [re.match(r'.+?(?=\d)', x)[0].strip().split('Born')[0] for x in df['Name']]
# Drop duplicate rows
df.drop_duplicates(inplace = True)
print (df)
Output:
print (df.to_string())
Name Start End
0 George Washington April 30, 1789[d] March 4, 1797
1 John Adams March 4, 1797 March 4, 1801
2 Thomas Jefferson March 4, 1801 March 4, 1809
3 James Madison March 4, 1809 March 4, 1817
4 James Monroe March 4, 1817 March 4, 1825
5 John Quincy Adams March 4, 1825 March 4, 1829
6 Andrew Jackson March 4, 1829 March 4, 1837
7 Martin Van Buren March 4, 1837 March 4, 1841
8 William Henry Harrison March 4, 1841 April 4, 1841(Died in office)
9 John Tyler April 4, 1841[i] March 4, 1845
10 James K. Polk March 4, 1845 March 4, 1849
11 Zachary Taylor March 4, 1849 July 9, 1850(Died in office)
12 Millard Fillmore July 9, 1850[k] March 4, 1853
13 Franklin Pierce March 4, 1853 March 4, 1857
14 James Buchanan March 4, 1857 March 4, 1861
15 Abraham Lincoln March 4, 1861 April 15, 1865(Assassinated)
16 Andrew Johnson April 15, 1865 March 4, 1869
17 Ulysses S. Grant March 4, 1869 March 4, 1877
18 Rutherford B. Hayes March 4, 1877 March 4, 1881
19 James A. Garfield March 4, 1881 September 19, 1881(Assassinated)
20 Chester A. Arthur September 19, 1881[n] March 4, 1885
21 Grover Cleveland March 4, 1885 March 4, 1889
22 Benjamin Harrison March 4, 1889 March 4, 1893
23 Grover Cleveland March 4, 1893 March 4, 1897
24 William McKinley March 4, 1897 September 14, 1901(Assassinated)
25 Theodore Roosevelt September 14, 1901 March 4, 1909
26 William Howard Taft March 4, 1909 March 4, 1913
27 Woodrow Wilson March 4, 1913 March 4, 1921
28 Warren G. Harding March 4, 1921 August 2, 1923(Died in office)
29 Calvin Coolidge August 2, 1923[o] March 4, 1929
30 Herbert Hoover March 4, 1929 March 4, 1933
31 Franklin D. Roosevelt March 4, 1933 April 12, 1945(Died in office)
32 Harry S. Truman April 12, 1945 January 20, 1953
33 Dwight D. Eisenhower January 20, 1953 January 20, 1961
34 John F. Kennedy January 20, 1961 November 22, 1963(Assassinated)
35 Lyndon B. Johnson November 22, 1963 January 20, 1969
36 Richard Nixon January 20, 1969 August 9, 1974(Resigned)
37 Gerald Ford August 9, 1974 January 20, 1977
38 Jimmy Carter January 20, 1977 January 20, 1981
39 Ronald Reagan January 20, 1981 January 20, 1989
40 George H. W. Bush January 20, 1989 January 20, 1993
41 Bill Clinton January 20, 1993 January 20, 2001
42 George W. Bush January 20, 2001 January 20, 2009
43 Barack Obama January 20, 2009 January 20, 2017
44 Donald Trump January 20, 2017 Incumbent
I need to extract the info from this page -http://www.investing.com/currencies/usd-brl-historical-data. I need Date, Price, Open, High, Low,Change %.
I`m new to Python so I got stuck at this step:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('table', {'class':'genTbl closedTbl historicalTbl'})
d=[]
for item in g_data:
Table_Values = item.find_all('tr')
N=len(Table_Values)-1
for n in range(N):
k = (item.find_all('td', {'class':'first left bold noWrap'})[n].text)
print(item.find_all('td', {'class':'first left bold noWrap'})[n].text)
Here I have several problems:
Column for Price can de tagged as or . How can I specify that I want items tagged with class = 'redFont' or/and 'greenfont'?. Also Change % can also have class redFont and greenFont. Other columns are tagged by . How can I extract them?
Is there a way to extract columns from table?
Ideally I would like to have a dateframe with Columns Date, Price, Open, High, Low,Change %.
Thanks
How to parse the table from that site I have already answered here but since you want a DataFrame, just use pandas.read_html
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
import pandas as pd
df = pd.read_html(r.content,attrs = {'id': 'curr_table'})[0]
Which will give you:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3609 3.4411 3.4465 3.3584 -2.36%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%
You can generally pass the url directly but we get a 403 error for this particular site using urllib2 which is the lib used by read_html so we need to use requests to get that html.
Here's a way to convert the html table into a nested list
The solution is to find the specific table, then loop through each tr in the table, creating a sublist of the text of all the items inside that tr. The code to do this is a nested list comprehension.
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
#first row is empty
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
pprint(tableRows)
This gets all the data from the table
[['Jun 08, 2016', '3.3614', '3.4411', '3.4465', '3.3584', '-2.34%'],
['Jun 07, 2016', '3.4421', '3.4885', '3.5141', '3.4401', '-1.36%'],
['Jun 06, 2016', '3.4896', '3.5265', '3.5295', '3.4840', '-1.09%'],
['Jun 05, 2016', '3.5280', '3.5280', '3.5280', '3.5280', '0.11%'],
['Jun 03, 2016', '3.5240', '3.5910', '3.5947', '3.5212', '-1.91%'],
['Jun 02, 2016', '3.5926', '3.6005', '3.6157', '3.5765', '-0.22%'],
['Jun 01, 2016', '3.6007', '3.6080', '3.6363', '3.5755', '-0.29%'],
['May 31, 2016', '3.6111', '3.5700', '3.6383', '3.5534', '1.11%'],
['May 30, 2016', '3.5713', '3.6110', '3.6167', '3.5675', '-1.11%'],
['May 27, 2016', '3.6115', '3.5824', '3.6303', '3.5792', '0.81%'],
['May 26, 2016', '3.5825', '3.5826', '3.5857', '3.5757', '-0.03%'],
['May 25, 2016', '3.5836', '3.5702', '3.6218', '3.5511', '0.34%'],
['May 24, 2016', '3.5713', '3.5717', '3.5903', '3.5417', '-0.04%'],
['May 23, 2016', '3.5728', '3.5195', '3.5894', '3.5121', '1.49%'],
['May 20, 2016', '3.5202', '3.5633', '3.5663', '3.5154', '-1.24%'],
['May 19, 2016', '3.5644', '3.5668', '3.6197', '3.5503', '-0.11%'],
['May 18, 2016', '3.5683', '3.4877', '3.5703', '3.4854', '2.28%'],
['May 17, 2016', '3.4888', '3.4990', '3.5300', '3.4812', '-0.32%'],
['May 16, 2016', '3.5001', '3.5309', '3.5366', '3.4944', '-0.96%'],
['May 13, 2016', '3.5340', '3.4845', '3.5345', '3.4630', '1.39%'],
['May 12, 2016', '3.4855', '3.4514', '3.5068', '3.4346', '0.95%'],
['May 11, 2016', '3.4528', '3.4755', '3.4835', '3.4389', '-0.66%'],
['May 10, 2016', '3.4758', '3.5155', '3.5173', '3.4623', '-1.15%'],
['May 09, 2016', '3.5164', '3.5010', '3.6766', '3.4906', '0.40%']]
If you want to convert it to a pandas dataframe you just need to also grab the table headings and add them
import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]
#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)
print(df)
Then you'll get a dataframe that looks like this:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3596 3.4411 3.4465 3.3584 -2.40%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%