Extract Columns from html using Python (Beautifulsoup) - python

I need to extract the info from this page -http://www.investing.com/currencies/usd-brl-historical-data. I need Date, Price, Open, High, Low,Change %.
I`m new to Python so I got stuck at this step:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('table', {'class':'genTbl closedTbl historicalTbl'})
d=[]
for item in g_data:
Table_Values = item.find_all('tr')
N=len(Table_Values)-1
for n in range(N):
k = (item.find_all('td', {'class':'first left bold noWrap'})[n].text)
print(item.find_all('td', {'class':'first left bold noWrap'})[n].text)
Here I have several problems:
Column for Price can de tagged as or . How can I specify that I want items tagged with class = 'redFont' or/and 'greenfont'?. Also Change % can also have class redFont and greenFont. Other columns are tagged by . How can I extract them?
Is there a way to extract columns from table?
Ideally I would like to have a dateframe with Columns Date, Price, Open, High, Low,Change %.
Thanks

How to parse the table from that site I have already answered here but since you want a DataFrame, just use pandas.read_html
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
import pandas as pd
df = pd.read_html(r.content,attrs = {'id': 'curr_table'})[0]
Which will give you:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3609 3.4411 3.4465 3.3584 -2.36%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%
You can generally pass the url directly but we get a 403 error for this particular site using urllib2 which is the lib used by read_html so we need to use requests to get that html.

Here's a way to convert the html table into a nested list
The solution is to find the specific table, then loop through each tr in the table, creating a sublist of the text of all the items inside that tr. The code to do this is a nested list comprehension.
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
#first row is empty
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
pprint(tableRows)
This gets all the data from the table
[['Jun 08, 2016', '3.3614', '3.4411', '3.4465', '3.3584', '-2.34%'],
['Jun 07, 2016', '3.4421', '3.4885', '3.5141', '3.4401', '-1.36%'],
['Jun 06, 2016', '3.4896', '3.5265', '3.5295', '3.4840', '-1.09%'],
['Jun 05, 2016', '3.5280', '3.5280', '3.5280', '3.5280', '0.11%'],
['Jun 03, 2016', '3.5240', '3.5910', '3.5947', '3.5212', '-1.91%'],
['Jun 02, 2016', '3.5926', '3.6005', '3.6157', '3.5765', '-0.22%'],
['Jun 01, 2016', '3.6007', '3.6080', '3.6363', '3.5755', '-0.29%'],
['May 31, 2016', '3.6111', '3.5700', '3.6383', '3.5534', '1.11%'],
['May 30, 2016', '3.5713', '3.6110', '3.6167', '3.5675', '-1.11%'],
['May 27, 2016', '3.6115', '3.5824', '3.6303', '3.5792', '0.81%'],
['May 26, 2016', '3.5825', '3.5826', '3.5857', '3.5757', '-0.03%'],
['May 25, 2016', '3.5836', '3.5702', '3.6218', '3.5511', '0.34%'],
['May 24, 2016', '3.5713', '3.5717', '3.5903', '3.5417', '-0.04%'],
['May 23, 2016', '3.5728', '3.5195', '3.5894', '3.5121', '1.49%'],
['May 20, 2016', '3.5202', '3.5633', '3.5663', '3.5154', '-1.24%'],
['May 19, 2016', '3.5644', '3.5668', '3.6197', '3.5503', '-0.11%'],
['May 18, 2016', '3.5683', '3.4877', '3.5703', '3.4854', '2.28%'],
['May 17, 2016', '3.4888', '3.4990', '3.5300', '3.4812', '-0.32%'],
['May 16, 2016', '3.5001', '3.5309', '3.5366', '3.4944', '-0.96%'],
['May 13, 2016', '3.5340', '3.4845', '3.5345', '3.4630', '1.39%'],
['May 12, 2016', '3.4855', '3.4514', '3.5068', '3.4346', '0.95%'],
['May 11, 2016', '3.4528', '3.4755', '3.4835', '3.4389', '-0.66%'],
['May 10, 2016', '3.4758', '3.5155', '3.5173', '3.4623', '-1.15%'],
['May 09, 2016', '3.5164', '3.5010', '3.6766', '3.4906', '0.40%']]
If you want to convert it to a pandas dataframe you just need to also grab the table headings and add them
import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]
#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)
print(df)
Then you'll get a dataframe that looks like this:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3596 3.4411 3.4465 3.3584 -2.40%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%

Related

beautiful soup parse published date of article on medium python

I need to parse "published date" of article on medium using beautiful soup.
I successfully parsed in loop author, title, reading time, but for some reason "published date" is not working for me.
here is example:
https://medium.com/interlay/archive/2020
so output of prasing will be Jun 18, 2020 ; Mar 5 , 2020 ; Feb 23, 2020 etc.
The Date is present inside the <time> tag of every article <div>.
Select that <time> tag and print it's text.
Here is the code.
import requests
from bs4 import BeautifulSoup
url = 'https://medium.com/interlay/archive/2020'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
t = [x.text.strip() for x in soup.find_all('time')]
print(t)
['Jun 18, 2020', 'Mar 4, 2020', 'Feb 23, 2020', 'Nov 30, 2020', 'Apr 15, 2020', 'Aug 21, 2020', 'Oct 27, 2020']
import requests
from bs4 import BeautifulSoup
url='https://medium.com/interlay/archive/2020'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
You can find main_div tag as from its class and loop over it to get data from time tag
main_div=soup.find_all("div",class_="streamItem streamItem--postPreview js-streamItem")
for div in main_div:
print(div.find("time").text)
Output:
Jun 18, 2020
Mar 4, 2020
Feb 23, 2020
Nov 30, 2020
Apr 15, 2020
Aug 21, 2020
Oct 27, 2020
​

Unable to get stock data from yahoo with pandas_datareader

This is my code:
start = '2015-1-1'
end = '2020-12-31'
source = 'yahoo'
google = data.DataReader('GOOG', start=start, end=end, data_source=source).reset_index()
I was using this code till last month and it was working properly, after a month I tried this code and now this code is throwing me error:
Unable to read URL: https://finance.yahoo.com/quote/GOOG/history?period1=1420065000&period2=1609453799&interval=1d&frequency=1d&filter=history
I am not able to figure it out, can you please make me understand, why is this happening?
Yahoo! Finance has changed slightly their structure. Now requires headers for the data retreival on the http request. Once done works fine.
For pandas & pandas-datareader which you'll need to upgrade them if you use it. (Which has been already sorted). Probably on all other packages using data from yahoo! such backtrader, etc, you'll need either upgrade or add headers on the yahoo! script to retrieve data :).
pip install --upgrade pandas
pip install --upgrade pandas-datareader
Have a nice day ;).
Please upgrade the pandas_datareader to a version >= 0.10.0 . This bug is fixed in 0.10.0 as per the release notes.
Fixed Yahoo readers which now require headers
Yahoo! Finance is working fine with pandas without any issue.
Script:
import pandas as pd
import requests
link = 'https://finance.yahoo.com/quote/GOOG/history?period1=1420065000&period2=1609453799&interval=1d&frequency=1d&filter=history'
r = requests.get(link, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
data = pd.read_html(r.text)[0]
df =pd.DataFrame(data)
df=df.iloc[0:100]
print(df)
Output:
Date Open High Low Close AdjClose Volume
Dec 31, 2020 1735.42 1758.93 1735.42 1751.88 1751.88 1011900
Dec 30, 2020 1762.01 1765.09 1725.6 1739.52 1739.52 1306100
Dec 29, 2020 1787.79 1792.44 1756.09 1758.72 1758.72 1299400
Dec 28, 2020 1751.64 1790.73 1746.33 1776.09 1776.09 1393000
Dec 24, 2020 1735 1746 1729.11 1738.85 1738.85 346800
Dec 23, 2020 1728.11 1747.99 1725.04 1732.38 1732.38 1033800
Dec 22, 2020 1734.43 1737.41 1712.57 1723.5 1723.5 936700
Dec 21, 2020 1713.51 1740.85 1699 1739.37 1739.37 1828400
Dec 18, 2020 1754.18 1755.11 1720.22 1731.01 1731.01 4016400
Dec 17, 2020 1768.51 1771.78 1738.66 1747.9 1747.9 1624700
Dec 16, 2020 1772.88 1773 1756.08 1763 1763 1513500
Dec 15, 2020 1764.42 1771.42 1749.95 1767.77 1767.77 1482300
Dec 14, 2020 1775 1797.39 1757.21 1760.06 1760.06 1600200
Dec 11, 2020 1763.06 1784.45 1760 1781.77 1781.77 1220700
Dec 10, 2020 1769.8 1781.31 1740.32 1775.33 1775.33 1362800
Dec 09, 2020 1812.01 1834.27 1767.81 1784.13 1784.13 1507600
Dec 08, 2020 1810.1 1821.9 1796.2 1818.55 1818.55 1096300
Dec 07, 2020 1819 1832.37 1805.78 1819.48 1819.48 1320900
Dec 04, 2020 1824.52 1833.16 1816.99 1827.99 1827.99 1378200
Dec 03, 2020 1824.01 1847.2 1822.65 1826.77 1826.77 1227300
Dec 02, 2020 1798.1 1835.65 1789.47 1827.95 1827.95 1222000
Dec 01, 2020 1774.37 1824.83 1769.37 1798.1 1798.1 1736900
Nov 30, 2020 1781.18 1788.06 1755 1760.74 1760.74 1823800
Nov 27, 2020 1773.09 1804 1772.44 1793.19 1793.19 884900
Nov 25, 2020 1772.89 1778.54 1756.54 1771.43 1771.43 1045800
Nov 24, 2020 1730.5 1771.6 1727.69 1768.88 1768.88 1578000
Nov 23, 2020 1749.6 1753.9 1717.72 1734.86 1734.86 2161600
Nov 20, 2020 1765.21 1774 1741.86 1742.19 1742.19 2313500
Nov 19, 2020 1738.38 1769.59 1737.01 1763.92 1763.92 1249900
Nov 18, 2020 1765.23 1773.47 1746.14 1746.78 1746.78 1173500
Nov 17, 2020 1776.94 1785 1767 1770.15 1770.15 1147100
Nov 16, 2020 1771.7 1799.07 1767.69 1781.38 1781.38 1246800
Nov 13, 2020 1757.63 1781.04 1744.55 1777.02 1777.02 1499900
Nov 12, 2020 1747.63 1768.27 1745.6 1749.84 1749.84 1247500
Nov 11, 2020 1750 1764.22 1747.36 1752.71 1752.71 1264000
Nov 10, 2020 1731.09 1763 1717.3 1740.39 1740.39 2636100
Nov 09, 2020 1790.9 1818.06 1760.02 1763 1763 2268300
Nov 06, 2020 1753.95 1772.43 1740.35 1761.75 1761.75 1660900
Nov 05, 2020 1781 1793.64 1750.51 1763.37 1763.37 2065800
Nov 04, 2020 1710.28 1771.36 1706.03 1749.13 1749.13 3570900
Nov 03, 2020 1631.78 1661.7 1616.62 1650.21 1650.21 1661700
Nov 02, 2020 1628.16 1660.77 1616.03 1626.03 1626.03 2535400
Oct 30, 2020 1672.11 1687 1604.46 1621.01 1621.01 4329100
Oct 29, 2020 1522.36 1593.71 1522.24 1567.24 1567.24 2003100
Oct 28, 2020 1559.74 1561.35 1514.62 1516.62 1516.62 1834000
Oct 27, 2020 1595.67 1606.84 1582.78 1604.26 1604.26 1229000
Oct 26, 2020 1625.01 1638.24 1576.5 1590.45 1590.45 1853300
Oct 23, 2020 1626.07 1642.36 1620.51 1641 1641 1375800
Oct 22, 2020 1593.05 1621.99 1585 1615.33 1615.33 1433600
Oct 21, 2020 1573.33 1618.73 1571.63 1593.31 1593.31 2568300
Oct 20, 2020 1527.05 1577.5 1525.67 1555.93 1555.93 2241700
Oct 19, 2020 1580.46 1588.15 1528 1534.61 1534.61 1607100
Oct 16, 2020 1565.85 1581.13 1563 1573.01 1573.01 1434700
Oct 15, 2020 1547.15 1575.1 1545.03 1559.13 1559.13 1540000
Oct 14, 2020 1578.59 1587.68 1550.53 1568.08 1568.08 1929300
Oct 13, 2020 1583.73 1590 1563.2 1571.68 1571.68 1601000
Oct 12, 2020 1543 1593.86 1532.57 1569.15 1569.15 2482600
Oct 09, 2020 1494.7 1516.52 1489.45 1515.22 1515.22 1435300
Oct 08, 2020 1465.09 1490 1465.09 1485.93 1485.93 1187800
Oct 07, 2020 1464.29 1468.96 1436 1460.29 1460.29 1746200
Oct 06, 2020 1475.58 1486.76 1448.59 1453.44 1453.44 1245400
Oct 05, 2020 1466.21 1488.21 1464.27 1486.02 1486.02 1113300
Oct 02, 2020 1462.03 1483.2 1450.92 1458.42 1458.42 1284100
Oct 01, 2020 1484.27 1499.04 1479.21 1490.09 1490.09 1779500
Sep 30, 2020 1466.8 1489.75 1459.88 1469.6 1469.6 1701600
Sep 29, 2020 1470.39 1476.66 1458.81 1469.33 1469.33 978200
Sep 28, 2020 1474.21 1476.8 1449.3 1464.52 1464.52 2007900
Sep 25, 2020 1432.63 1450 1413.34 1444.96 1444.96 1323000
Sep 24, 2020 1411.03 1443.71 1409.85 1428.29 1428.29 1450200
Sep 23, 2020 1458.78 1460.96 1407.7 1415.21 1415.21 1657400
Sep 22, 2020 1450.09 1469.52 1434.53 1465.46 1465.46 1583200
Sep 21, 2020 1440.06 1448.36 1406.55 1431.16 1431.16 2888800
Sep 18, 2020 1498.01 1503 1437.13 1459.99 1459.99 3103900
Sep 17, 2020 1496 1508.3 1470 1495.53 1495.53 1879800
Sep 16, 2020 1555.54 1562 1519.82 1520.9 1520.9 1311700
Sep 15, 2020 1536 1559.57 1531.83 1541.44 1541.44 1331100
Sep 14, 2020 1539.01 1564 1515.74 1519.28 1519.28 1696600
Sep 11, 2020 1536 1575.2 1497.36 1520.72 1520.72 1597100
Sep 10, 2020 1560.64 1584.08 1525.81 1532.02 1532.02 1618600
Sep 09, 2020 1557.53 1569 1536.05 1556.96 1556.96 1774700
Sep 08, 2020 1533.51 1563.86 1528.01 1532.39 1532.39 2610900
Sep 04, 2020 1624.26 1645.11 1547.61 1591.04 1591.04 2608600
Sep 03, 2020 1709.71 1709.71 1615.06 1641.84 1641.84 3107800
Sep 02, 2020 1673.78 1733.18 1666.33 1728.28 1728.28 2511200
Sep 01, 2020 1636.63 1665.73 1632.22 1660.71 1660.71 1825300
Aug 31, 2020 1647.89 1647.96 1630.31 1634.18 1634.18 1823400
Aug 28, 2020 1633.49 1647.17 1630.75 1644.41 1644.41 1499000
Aug 27, 2020 1653.68 1655 1625.75 1634.33 1634.33 1861600
Aug 26, 2020 1608 1659.22 1603.6 1652.38 1652.38 3993400
Aug 25, 2020 1582.07 1611.62 1582.07 1608.22 1608.22 2247100
Aug 24, 2020 1593.98 1614.17 1580.57 1588.2 1588.2 1409900
Aug 21, 2020 1577.03 1597.72 1568.01 1580.42 1580.42 1446500
Aug 20, 2020 1543.45 1585.87 1538.2 1581.75 1581.75 1706900
Aug 19, 2020 1553.31 1573.68 1543.95 1547.53 1547.53 1660600
Aug 18, 2020 1526.18 1562.47 1523.71 1558.6 1558.6 2027100
Aug 17, 2020 1514.67 1525.61 1507.97 1517.98 1517.98 1378300
Aug 14, 2020 1515.66 1521.9 1502.88 1507.73 1507.73 1354800
Aug 13, 2020 1510.34 1537.25 1508.01 1518.45 1518.45 1455200
Aug 12, 2020 1485.58 1512.39 1485.25 1506.62 1506.62 1437000
Aug 11, 2020 1492.44 1510 1478 1480.32 1480.32 1454400
Yahoo finance has decommissioned their API. Try this python library.

Python Selenium Text Convert into Data Frame

I have a question in regards to DataFrame. I have written a code with Selenium to extract a table from a website. However, I am having doubt on how to transform the Selenium text into DataFrame and export it in CSV. Below is my code.
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("Path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
table = driver.find_element_by_xpath('//table[#id="inlineSearchTable"]/tbody')
while True:
try:
print(table.text)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
driver.quit()
If you using selenium you need to get the outerHTML of the table and then use pd.read_html() to get the dataframe.
Then append with empty dataframe and export to csv.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
dfbase=pd.DataFrame()
while True:
try:
table =WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#inlineSearchTable"))).get_attribute("outerHTML")
df=pd.read_html(str(table))[0]
dfbase=dfbase.append(df,ignore_index=True)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
print(dfbase)
dfbase.to_csv("TestResultsDF.csv")
driver.quit()
Output:
Name Date Added to the List
0 24option.com Aug 6, 2013
1 3storich Aug 20, 2020
2 4XP Investments & Trading and Forex Place Ltd. Mar 15, 2012
3 6149154 Canada Inc. d.b.a. Forexcanus Aug 25, 2011
4 72Option, owned and operated by Epic Ventures ... Dec 8, 2016
5 A&L Royal Finance Inc. May 6, 2015
6 Abler Finance Sep 26, 2014
7 Accredited International / Accredited FX Mar 15, 2013
8 Aidan Trading Jan 24, 2018
9 AlfaTrade, Nemesis Capital Limited (together, ... Mar 16, 2016
10 Alma Group Co Trading Ltd. Oct 7, 2020
11 Ameron Oil and Gas Ltd. Sep 23, 2010
12 Anchor Securities Limited Aug 29, 2011
13 Anyoption Jul 8, 2013
14 Arial Trading, LLC Nov 20, 2008
15 Asia & Pacific Holdings Inc. Dec 5, 2017
16 Astercap Ltd., doing business as Broker Official Aug 31, 2018
17 Astor Capital Fund Limited (Astor) Apr 9, 2020
18 Astrofx24 Nov 19, 2019
19 Atlantic Global Asset Management Sep 12, 2017
20 Ava FX, Ava Financial Ltd. and Ava Capital Mar... Mar 15, 2012
21 Ava Trade Ltd. May 30, 2016
22 Avariz Group Nov 4, 2020
23 B.I.S. Blueport Investment Services Ltd., doin... Sep 7, 2017
24 B4Option May 3, 2017
25 Banc de Binary Ltd. Jul 29, 2013
26 BCG Invest Apr 6, 2020
27 BeFaster.fit Limited (BeFaster) Jun 22, 2020
28 Beltway M&A Oct 6, 2009
29 Best Commodity Options Aug 1, 2012
.. ... ...
301 Trade12, owned and operated by Exo Capital Mar... Mar 1, 2017
302 TradeNix Jul 30, 2020
303 TradeQuicker May 21, 2014
304 TradeRush.com Aug 6, 2013
305 Trades Capital, operated by TTN Marketing Ltd.... May 18, 2016
306 Tradewell.io Jan 20, 2020
307 TradexOption Apr 20, 2020
308 Trinidad Oil & Gas Corporation Dec 6, 2011
309 Truevalue Investment International Limited May 11, 2018
310 UK Options Mar 3, 2015
311 United Financial Commodity Group, operating as... Nov 15, 2018
312 Up & Down Marketing Limited (dba OneTwoTrade) Apr 27, 2015
313 USI-TECH Limited Dec 15, 2017
314 uTrader and Day Dream Investments Ltd. (togeth... Nov 29, 2017
315 Vision Financial Partners, LLC Feb 18, 2016
316 Vision Trading Advisors Feb 18, 2016
317 Wallis Partridge LLC Apr 24, 2014
318 Waverly M&A Jan 19, 2010
319 Wealth Capital Corp. Sep 4, 2012
320 Wentworth & Wellesley Ltd. / Wentworth & Welle... Mar 13, 2012
321 West Golden Capital Dec 1, 2010
322 World Markets Sep 22, 2020
323 WorldWide CapitalFX Feb 8, 2019
324 XForex, owned and operated by XFR Financial Lt... Jul 19, 2016
325 Xtelus Profit Nov 30, 2020
326 You Trade Holdings Limited Jun 3, 2011
327 Zen Vybe Inc. Mar 27, 2020
328 ZenithOptions Feb 12, 2016
329 Ziptradex Limited (Ziptradex) May 21, 2020
330 Zulu Trade Inc. Mar 2, 2015

scrape data from into dataframe with BeautifulSoup

I'm working on a project to scrape and parse data from California lottery into a dataframe
Here's my code so far, it produces no error but also no output:
import requests
from bs4 import BeautifulSoup as bs4
draw = 'http://www.calottery.com/play/draw-games/superlotto-plus/winning-numbers/?page=1'
page = requests.get(draw)
soup = bs4(page.text)
drawing_list = []
for table_row in soup.select("table.tag_even_numbers tr"):
cells = table_row.findAll('td')
if len(cells) > 0:
draw_date = cells[0].text.strip()
numbers = cells[1].text.strip()
mega = cells[2].text.strip()
drawings = {'dates': draw_date, 'winning_numbers': numbers, 'mega_number': mega}
drawing_list.append(drawings)
print "added {0} {1} {2}, to the list".format(draw_date, numbers, mega)
Expected Output: I'd love to scrape the table rows into a dataframe
draw_date | numbers | mega
-----------|----------------|-----
12/06/2017 | 12 24 07 01 02 | 23
12/02/2017 | 33 18 07 42 40 | 7
Thanks for any revision or assistance into the right direction.
This expression "table.tag_even_numbers tr" selects nothing because the table has no 'tag_even_numbers' class, but has a 'tag_even' class and a 'numbers' class.
So if you change this:
soup.select("table.tag_even_numbers tr")
to:
soup.select("table.tag_even.numbers tr")
you should have 20 items in drawing_list.
Also by using .text to select numbers you get all the numbers joined side by side in a string.
If you want a list of numbers you should use .stripped_strings instead, eg:
numbers = list(cells[1].stripped_strings)
Then you can create a dataframe from drawing_list, eg:
df = pd.DataFrame(drawing_list)
print(df.head())
dates mega_number winning_numbers
0 Dec 6, 2017 - 3201 23 [12, 24, 07, 01, 02]
1 Dec 2, 2017 - 3200 7 [33, 18, 07, 42, 40]
2 Nov 29, 2017 - 3199 6 [03, 33, 26, 27, 07]
3 Nov 25, 2017 - 3198 19 [21, 46, 13, 25, 17]
4 Nov 22, 2017 - 3197 3 [32, 40, 27, 42, 08]

Output not showing entirely in Python

The output when running a simple code breaks and is not entirely shown. What are the options to avoid the breaks?
22 December 23, 1989, Saturday, Late Edition - Final
23 December 22, 1989, Friday, Late Edition - Final
24 December 21, 1989, Thursday, Late Edition - Final
25 December 21, 1989, Thursday, Late Edition - Final
26 December 20, 1989, Wednesday, Late Edition - F...
27 December 20, 1989, Wednesday, Late Edition - F...
28 December 19, 1989, Tuesday, Late Edition - Final
29 December 18, 1989, Monday, Late Edition - Final
...
605 January 12, 2016 Tuesday
606 January 12, 2016 Tuesday 10:58 PM EST
607 January 12, 2016 Tuesday 8:28 PM EST
608 January 12, 2016 Tuesday 9:43 AM EST
Thanks!
PD: this is the code used to produce the file:
import json
import nltk
import re
import pandas
appended_data = []
for i in range(1989,2017):
df0 = pandas.DataFrame([json.loads(l) for l in open('NYT_%d.json' % i)])
df1 = pandas.DataFrame([json.loads(l) for l in open('USAT_%d.json' % i)])
df2 = pandas.DataFrame([json.loads(l) for l in open('WP_%d.json' % i)])
appended_data.append(df0)
appended_data.append(df1)
appended_data.append(df2)
appended_data = pandas.concat(appended_data)
print(appended_data.date)
You need to change the display width for pandas. The option you are looking for is pd.set_option('display.width', 2000), but you may also find some other pandas options helpful which I use regularily:
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 2000)
Detailed description can be found here.

Categories

Resources