This is my code:
start = '2015-1-1'
end = '2020-12-31'
source = 'yahoo'
google = data.DataReader('GOOG', start=start, end=end, data_source=source).reset_index()
I was using this code till last month and it was working properly, after a month I tried this code and now this code is throwing me error:
Unable to read URL: https://finance.yahoo.com/quote/GOOG/history?period1=1420065000&period2=1609453799&interval=1d&frequency=1d&filter=history
I am not able to figure it out, can you please make me understand, why is this happening?
Yahoo! Finance has changed slightly their structure. Now requires headers for the data retreival on the http request. Once done works fine.
For pandas & pandas-datareader which you'll need to upgrade them if you use it. (Which has been already sorted). Probably on all other packages using data from yahoo! such backtrader, etc, you'll need either upgrade or add headers on the yahoo! script to retrieve data :).
pip install --upgrade pandas
pip install --upgrade pandas-datareader
Have a nice day ;).
Please upgrade the pandas_datareader to a version >= 0.10.0 . This bug is fixed in 0.10.0 as per the release notes.
Fixed Yahoo readers which now require headers
Yahoo! Finance is working fine with pandas without any issue.
Script:
import pandas as pd
import requests
link = 'https://finance.yahoo.com/quote/GOOG/history?period1=1420065000&period2=1609453799&interval=1d&frequency=1d&filter=history'
r = requests.get(link, headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'})
data = pd.read_html(r.text)[0]
df =pd.DataFrame(data)
df=df.iloc[0:100]
print(df)
Output:
Date Open High Low Close AdjClose Volume
Dec 31, 2020 1735.42 1758.93 1735.42 1751.88 1751.88 1011900
Dec 30, 2020 1762.01 1765.09 1725.6 1739.52 1739.52 1306100
Dec 29, 2020 1787.79 1792.44 1756.09 1758.72 1758.72 1299400
Dec 28, 2020 1751.64 1790.73 1746.33 1776.09 1776.09 1393000
Dec 24, 2020 1735 1746 1729.11 1738.85 1738.85 346800
Dec 23, 2020 1728.11 1747.99 1725.04 1732.38 1732.38 1033800
Dec 22, 2020 1734.43 1737.41 1712.57 1723.5 1723.5 936700
Dec 21, 2020 1713.51 1740.85 1699 1739.37 1739.37 1828400
Dec 18, 2020 1754.18 1755.11 1720.22 1731.01 1731.01 4016400
Dec 17, 2020 1768.51 1771.78 1738.66 1747.9 1747.9 1624700
Dec 16, 2020 1772.88 1773 1756.08 1763 1763 1513500
Dec 15, 2020 1764.42 1771.42 1749.95 1767.77 1767.77 1482300
Dec 14, 2020 1775 1797.39 1757.21 1760.06 1760.06 1600200
Dec 11, 2020 1763.06 1784.45 1760 1781.77 1781.77 1220700
Dec 10, 2020 1769.8 1781.31 1740.32 1775.33 1775.33 1362800
Dec 09, 2020 1812.01 1834.27 1767.81 1784.13 1784.13 1507600
Dec 08, 2020 1810.1 1821.9 1796.2 1818.55 1818.55 1096300
Dec 07, 2020 1819 1832.37 1805.78 1819.48 1819.48 1320900
Dec 04, 2020 1824.52 1833.16 1816.99 1827.99 1827.99 1378200
Dec 03, 2020 1824.01 1847.2 1822.65 1826.77 1826.77 1227300
Dec 02, 2020 1798.1 1835.65 1789.47 1827.95 1827.95 1222000
Dec 01, 2020 1774.37 1824.83 1769.37 1798.1 1798.1 1736900
Nov 30, 2020 1781.18 1788.06 1755 1760.74 1760.74 1823800
Nov 27, 2020 1773.09 1804 1772.44 1793.19 1793.19 884900
Nov 25, 2020 1772.89 1778.54 1756.54 1771.43 1771.43 1045800
Nov 24, 2020 1730.5 1771.6 1727.69 1768.88 1768.88 1578000
Nov 23, 2020 1749.6 1753.9 1717.72 1734.86 1734.86 2161600
Nov 20, 2020 1765.21 1774 1741.86 1742.19 1742.19 2313500
Nov 19, 2020 1738.38 1769.59 1737.01 1763.92 1763.92 1249900
Nov 18, 2020 1765.23 1773.47 1746.14 1746.78 1746.78 1173500
Nov 17, 2020 1776.94 1785 1767 1770.15 1770.15 1147100
Nov 16, 2020 1771.7 1799.07 1767.69 1781.38 1781.38 1246800
Nov 13, 2020 1757.63 1781.04 1744.55 1777.02 1777.02 1499900
Nov 12, 2020 1747.63 1768.27 1745.6 1749.84 1749.84 1247500
Nov 11, 2020 1750 1764.22 1747.36 1752.71 1752.71 1264000
Nov 10, 2020 1731.09 1763 1717.3 1740.39 1740.39 2636100
Nov 09, 2020 1790.9 1818.06 1760.02 1763 1763 2268300
Nov 06, 2020 1753.95 1772.43 1740.35 1761.75 1761.75 1660900
Nov 05, 2020 1781 1793.64 1750.51 1763.37 1763.37 2065800
Nov 04, 2020 1710.28 1771.36 1706.03 1749.13 1749.13 3570900
Nov 03, 2020 1631.78 1661.7 1616.62 1650.21 1650.21 1661700
Nov 02, 2020 1628.16 1660.77 1616.03 1626.03 1626.03 2535400
Oct 30, 2020 1672.11 1687 1604.46 1621.01 1621.01 4329100
Oct 29, 2020 1522.36 1593.71 1522.24 1567.24 1567.24 2003100
Oct 28, 2020 1559.74 1561.35 1514.62 1516.62 1516.62 1834000
Oct 27, 2020 1595.67 1606.84 1582.78 1604.26 1604.26 1229000
Oct 26, 2020 1625.01 1638.24 1576.5 1590.45 1590.45 1853300
Oct 23, 2020 1626.07 1642.36 1620.51 1641 1641 1375800
Oct 22, 2020 1593.05 1621.99 1585 1615.33 1615.33 1433600
Oct 21, 2020 1573.33 1618.73 1571.63 1593.31 1593.31 2568300
Oct 20, 2020 1527.05 1577.5 1525.67 1555.93 1555.93 2241700
Oct 19, 2020 1580.46 1588.15 1528 1534.61 1534.61 1607100
Oct 16, 2020 1565.85 1581.13 1563 1573.01 1573.01 1434700
Oct 15, 2020 1547.15 1575.1 1545.03 1559.13 1559.13 1540000
Oct 14, 2020 1578.59 1587.68 1550.53 1568.08 1568.08 1929300
Oct 13, 2020 1583.73 1590 1563.2 1571.68 1571.68 1601000
Oct 12, 2020 1543 1593.86 1532.57 1569.15 1569.15 2482600
Oct 09, 2020 1494.7 1516.52 1489.45 1515.22 1515.22 1435300
Oct 08, 2020 1465.09 1490 1465.09 1485.93 1485.93 1187800
Oct 07, 2020 1464.29 1468.96 1436 1460.29 1460.29 1746200
Oct 06, 2020 1475.58 1486.76 1448.59 1453.44 1453.44 1245400
Oct 05, 2020 1466.21 1488.21 1464.27 1486.02 1486.02 1113300
Oct 02, 2020 1462.03 1483.2 1450.92 1458.42 1458.42 1284100
Oct 01, 2020 1484.27 1499.04 1479.21 1490.09 1490.09 1779500
Sep 30, 2020 1466.8 1489.75 1459.88 1469.6 1469.6 1701600
Sep 29, 2020 1470.39 1476.66 1458.81 1469.33 1469.33 978200
Sep 28, 2020 1474.21 1476.8 1449.3 1464.52 1464.52 2007900
Sep 25, 2020 1432.63 1450 1413.34 1444.96 1444.96 1323000
Sep 24, 2020 1411.03 1443.71 1409.85 1428.29 1428.29 1450200
Sep 23, 2020 1458.78 1460.96 1407.7 1415.21 1415.21 1657400
Sep 22, 2020 1450.09 1469.52 1434.53 1465.46 1465.46 1583200
Sep 21, 2020 1440.06 1448.36 1406.55 1431.16 1431.16 2888800
Sep 18, 2020 1498.01 1503 1437.13 1459.99 1459.99 3103900
Sep 17, 2020 1496 1508.3 1470 1495.53 1495.53 1879800
Sep 16, 2020 1555.54 1562 1519.82 1520.9 1520.9 1311700
Sep 15, 2020 1536 1559.57 1531.83 1541.44 1541.44 1331100
Sep 14, 2020 1539.01 1564 1515.74 1519.28 1519.28 1696600
Sep 11, 2020 1536 1575.2 1497.36 1520.72 1520.72 1597100
Sep 10, 2020 1560.64 1584.08 1525.81 1532.02 1532.02 1618600
Sep 09, 2020 1557.53 1569 1536.05 1556.96 1556.96 1774700
Sep 08, 2020 1533.51 1563.86 1528.01 1532.39 1532.39 2610900
Sep 04, 2020 1624.26 1645.11 1547.61 1591.04 1591.04 2608600
Sep 03, 2020 1709.71 1709.71 1615.06 1641.84 1641.84 3107800
Sep 02, 2020 1673.78 1733.18 1666.33 1728.28 1728.28 2511200
Sep 01, 2020 1636.63 1665.73 1632.22 1660.71 1660.71 1825300
Aug 31, 2020 1647.89 1647.96 1630.31 1634.18 1634.18 1823400
Aug 28, 2020 1633.49 1647.17 1630.75 1644.41 1644.41 1499000
Aug 27, 2020 1653.68 1655 1625.75 1634.33 1634.33 1861600
Aug 26, 2020 1608 1659.22 1603.6 1652.38 1652.38 3993400
Aug 25, 2020 1582.07 1611.62 1582.07 1608.22 1608.22 2247100
Aug 24, 2020 1593.98 1614.17 1580.57 1588.2 1588.2 1409900
Aug 21, 2020 1577.03 1597.72 1568.01 1580.42 1580.42 1446500
Aug 20, 2020 1543.45 1585.87 1538.2 1581.75 1581.75 1706900
Aug 19, 2020 1553.31 1573.68 1543.95 1547.53 1547.53 1660600
Aug 18, 2020 1526.18 1562.47 1523.71 1558.6 1558.6 2027100
Aug 17, 2020 1514.67 1525.61 1507.97 1517.98 1517.98 1378300
Aug 14, 2020 1515.66 1521.9 1502.88 1507.73 1507.73 1354800
Aug 13, 2020 1510.34 1537.25 1508.01 1518.45 1518.45 1455200
Aug 12, 2020 1485.58 1512.39 1485.25 1506.62 1506.62 1437000
Aug 11, 2020 1492.44 1510 1478 1480.32 1480.32 1454400
Yahoo finance has decommissioned their API. Try this python library.
Related
I'm a newbie seeking help.
I've tried without success with the following.
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.canada.ca/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds.html"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
print(table.get('class'))
Result:
['table']
None
Can anyone help me with how to get this data?
Thank you so much.
The data you see on the page is loaded from external URL. To load the data you can use next example:
import requests
import pandas as pd
url = "https://www.canada.ca/content/dam/ircc/documents/json/ee_rounds_123_en.json"
data = requests.get(url).json()
df = pd.DataFrame(data["rounds"])
df = df.drop(columns=["drawNumberURL", "DrawText1", "mitext"])
print(df.head(10).to_markdown(index=False))
Prints:
drawNumber
drawDate
drawDateFull
drawName
drawSize
drawCRS
drawText2
drawDateTime
drawCutOff
drawDistributionAsOn
dd1
dd2
dd3
dd4
dd5
dd6
dd7
dd8
dd9
dd10
dd11
dd12
dd13
dd14
dd15
dd16
dd17
dd18
231
2022-09-14
September 14, 2022
No Program Specified
3,250
510
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
September 14, 2022 at 13:29:26 UTC
January 08, 2022 at 10:24:52 UTC
September 12, 2022
408
6,228
63,860
5,845
9,505
19,156
16,541
12,813
58,019
12,245
12,635
9,767
11,186
12,186
68,857
35,833
5,068
238,273
230
2022-08-31
August 31, 2022
No Program Specified
2,750
516
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 31, 2022 at 13:55:23 UTC
April 16, 2022 at 18:24:41 UTC
August 29, 2022
466
7,224
63,270
5,554
9,242
19,033
16,476
12,965
58,141
12,287
12,758
9,796
11,105
12,195
68,974
36,001
5,120
239,196
229
2022-08-17
August 17, 2022
No Program Specified
2,250
525
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 17, 2022 at 13:43:47 UTC
December 28, 2021 at 11:03:15 UTC
August 15, 2022
538
8,221
62,753
5,435
9,129
18,831
16,465
12,893
58,113
12,200
12,721
9,801
11,138
12,253
68,440
35,745
5,137
238,947
228
2022-08-03
August 3, 2022
No Program Specified
2,000
533
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 03, 2022 at 15:16:24 UTC
January 06, 2022 at 14:29:50 UTC
August 2, 2022
640
8,975
62,330
5,343
9,044
18,747
16,413
12,783
57,987
12,101
12,705
9,747
11,117
12,317
68,325
35,522
5,145
238,924
227
2022-07-20
July 20, 2022
No Program Specified
1,750
542
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 20, 2022 at 16:32:49 UTC
December 30, 2021 at 15:29:35 UTC
July 18, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
226
2022-07-06
July 6, 2022
No Program Specified
1,500
557
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 6, 2022 at 14:34:34 UTC
November 13, 2021 at 02:20:46 UTC
July 11, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
225
2022-06-22
June 22, 2022
Provincial Nominee Program
636
752
Provincial Nominee Program
June 22, 2022 at 14:13:57 UTC
April 19, 2022 at 13:45:45 UTC
June 20, 2022
664
8,017
55,917
4,246
7,845
16,969
15,123
11,734
53,094
10,951
11,621
8,800
10,325
11,397
64,478
33,585
4,919
220,674
224
2022-06-08
June 8, 2022
Provincial Nominee Program
932
796
Provincial Nominee Program
June 08, 2022 at 14:03:28 UTC
October 18, 2021 at 17:13:17 UTC
June 6, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
223
2022-05-25
May 25, 2022
Provincial Nominee Program
590
741
Provincial Nominee Program
May 25, 2022 at 13:21:23 UTC
February 02, 2022 at 12:29:53 UTC
May 23, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
222
2022-05-11
May 11, 2022
Provincial Nominee Program
545
753
Provincial Nominee Program
May 11, 2022 at 14:08:07 UTC
December 15, 2021 at 20:32:57 UTC
May 9, 2022
635
7,193
52,684
3,749
7,237
16,027
14,466
11,205
50,811
10,484
11,030
8,393
9,945
10,959
62,341
32,590
4,839
211,093
I have the following dataset
id
date
7510
15 Jun 2020
7510
16 Jun 2020
7512
15 Jun 2020
7512
07 Jul 2020
7520
15 Jun 2020
7520
16 Aug 2020
I need to convert this to a dictionary which is quite straight forward, but need each unique id as a key and all corresponding values as values to the unique key.
for example;
dictionary = {7510: ["15 Jun 2020", "16 Jun 2020"], 7512: ["15 Jun 2020", "07 Jul 2020"],
7520: ["15 Jun 2020", "16 Aug 2020"] }
Try this:
df.groupby('id')['date'].agg(list).to_dict()
Output:
{7510: ['15 Jun 2020', '16 Jun 2020'],
7512: ['15 Jun 2020', '07 Jul 2020'],
7520: ['15 Jun 2020', '16 Aug 2020']}
I have a question in regards to DataFrame. I have written a code with Selenium to extract a table from a website. However, I am having doubt on how to transform the Selenium text into DataFrame and export it in CSV. Below is my code.
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("Path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
table = driver.find_element_by_xpath('//table[#id="inlineSearchTable"]/tbody')
while True:
try:
print(table.text)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
driver.quit()
If you using selenium you need to get the outerHTML of the table and then use pd.read_html() to get the dataframe.
Then append with empty dataframe and export to csv.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
dfbase=pd.DataFrame()
while True:
try:
table =WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#inlineSearchTable"))).get_attribute("outerHTML")
df=pd.read_html(str(table))[0]
dfbase=dfbase.append(df,ignore_index=True)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
print(dfbase)
dfbase.to_csv("TestResultsDF.csv")
driver.quit()
Output:
Name Date Added to the List
0 24option.com Aug 6, 2013
1 3storich Aug 20, 2020
2 4XP Investments & Trading and Forex Place Ltd. Mar 15, 2012
3 6149154 Canada Inc. d.b.a. Forexcanus Aug 25, 2011
4 72Option, owned and operated by Epic Ventures ... Dec 8, 2016
5 A&L Royal Finance Inc. May 6, 2015
6 Abler Finance Sep 26, 2014
7 Accredited International / Accredited FX Mar 15, 2013
8 Aidan Trading Jan 24, 2018
9 AlfaTrade, Nemesis Capital Limited (together, ... Mar 16, 2016
10 Alma Group Co Trading Ltd. Oct 7, 2020
11 Ameron Oil and Gas Ltd. Sep 23, 2010
12 Anchor Securities Limited Aug 29, 2011
13 Anyoption Jul 8, 2013
14 Arial Trading, LLC Nov 20, 2008
15 Asia & Pacific Holdings Inc. Dec 5, 2017
16 Astercap Ltd., doing business as Broker Official Aug 31, 2018
17 Astor Capital Fund Limited (Astor) Apr 9, 2020
18 Astrofx24 Nov 19, 2019
19 Atlantic Global Asset Management Sep 12, 2017
20 Ava FX, Ava Financial Ltd. and Ava Capital Mar... Mar 15, 2012
21 Ava Trade Ltd. May 30, 2016
22 Avariz Group Nov 4, 2020
23 B.I.S. Blueport Investment Services Ltd., doin... Sep 7, 2017
24 B4Option May 3, 2017
25 Banc de Binary Ltd. Jul 29, 2013
26 BCG Invest Apr 6, 2020
27 BeFaster.fit Limited (BeFaster) Jun 22, 2020
28 Beltway M&A Oct 6, 2009
29 Best Commodity Options Aug 1, 2012
.. ... ...
301 Trade12, owned and operated by Exo Capital Mar... Mar 1, 2017
302 TradeNix Jul 30, 2020
303 TradeQuicker May 21, 2014
304 TradeRush.com Aug 6, 2013
305 Trades Capital, operated by TTN Marketing Ltd.... May 18, 2016
306 Tradewell.io Jan 20, 2020
307 TradexOption Apr 20, 2020
308 Trinidad Oil & Gas Corporation Dec 6, 2011
309 Truevalue Investment International Limited May 11, 2018
310 UK Options Mar 3, 2015
311 United Financial Commodity Group, operating as... Nov 15, 2018
312 Up & Down Marketing Limited (dba OneTwoTrade) Apr 27, 2015
313 USI-TECH Limited Dec 15, 2017
314 uTrader and Day Dream Investments Ltd. (togeth... Nov 29, 2017
315 Vision Financial Partners, LLC Feb 18, 2016
316 Vision Trading Advisors Feb 18, 2016
317 Wallis Partridge LLC Apr 24, 2014
318 Waverly M&A Jan 19, 2010
319 Wealth Capital Corp. Sep 4, 2012
320 Wentworth & Wellesley Ltd. / Wentworth & Welle... Mar 13, 2012
321 West Golden Capital Dec 1, 2010
322 World Markets Sep 22, 2020
323 WorldWide CapitalFX Feb 8, 2019
324 XForex, owned and operated by XFR Financial Lt... Jul 19, 2016
325 Xtelus Profit Nov 30, 2020
326 You Trade Holdings Limited Jun 3, 2011
327 Zen Vybe Inc. Mar 27, 2020
328 ZenithOptions Feb 12, 2016
329 Ziptradex Limited (Ziptradex) May 21, 2020
330 Zulu Trade Inc. Mar 2, 2015
For example, if I have the following event data, and want to find clusters of at least 2 events that are within 1 minute of each other in which id_1, id_2, and id_3 are all the same. For reference, I have the epoch timestamp (in microseconds) in addition to the date-time timestamp.
event_id timestamp id_1 id_2 id_3
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442813 Jun 15, 2015 10:22 PM PDT A 2 34567
9442810 Jun 15, 2015 10:22 PM PDT A 3 12345
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442840 Jun 15, 2015 10:23 PM PDT C 3 12345
9442839 Jun 15, 2015 10:23 PM PDT C 1 34567
9442838 Jun 15, 2015 10:23 PM PDT C 2 12345
9442907 Jun 15, 2015 10:24 PM PDT C 3 34567
9442886 Jun 15, 2015 10:24 PM PDT C 1 12345
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
9442934 Jun 15, 2015 10:25 PM PDT C 3 12345
For each cluster found, I want to return a set of (id_1, id_2, id_3, [list of event_ids], min_timestamp_of_cluster, max_timestamp_of_cluster). Additionally, if there's a cluster with (e.g.) 6 events, I'd only want to return a single result with all events, not one for each grouping of 3 events.
If I understood your problem correctly, you can make use of scikit-learn's DBSCAN clustering algorithm with a custom distance (or metric) function. Your distance function should return a very large number if either of the id_1, id_2 or id_3's are different. Otherwise is should return the time difference.
But with this method, the number of clusters are determined by the algorithm and not as an input given to the algorithm. If you are determined to pass the number of clusters as an input, k-means is the clustering algorithm you may need to look into.
In pure python, use a "sliding window" that encompasses all the events in a 1 minute range.
The premise is simple: maintain a queue of events that is a subsequence
of the total list, in order. The "window" (queue) should be all the events you care about. In this case, that is determined by the 60-second time gap requirement.
As you process events, add one event to the end of the queue. If the first event in the queue is more than 60 seconds from the newly-added last event, slide the window forward by dropping the first event from the front of the queue.
This is python3:
import collections
import operator
import itertools
from datetime import datetime
#### FROM HERE: vvv is just faking events. Delete or replace.
class Event(collections.namedtuple('Event', 'event_id timestamp id_1 id_2 id_3 epoch_ts')):
def __str__(self):
return ('{e.event_id} {e.timestamp} {e.id_1} {e.id_2} {e.id_3}'
.format(e=self))
def get_events():
event_list = map(operator.methodcaller('strip'), '''
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442813 Jun 15, 2015 10:22 PM PDT A 2 34567
9442810 Jun 15, 2015 10:22 PM PDT A 3 12345
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442840 Jun 15, 2015 10:23 PM PDT C 3 12345
9442839 Jun 15, 2015 10:23 PM PDT C 1 34567
9442838 Jun 15, 2015 10:23 PM PDT C 2 12345
9442907 Jun 15, 2015 10:24 PM PDT C 3 34567
9442886 Jun 15, 2015 10:24 PM PDT C 1 12345
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
9442934 Jun 15, 2015 10:25 PM PDT C 3 12345
'''.strip().splitlines())
for line in event_list:
idev, *rest = line.split()
ts = rest[:6]
id1, id2, id3 = rest[6:]
id2 = int(id2) # faster when sorting (see find_clustered_events)
id3 = int(id3) # faster when sorting (see find_clustered_events)
ts_str = ' '.join(ts)
dt = datetime.strptime(ts_str.replace('PDT', '-0700'), '%b %d, %Y %I:%M %p %z')
epoch = dt.timestamp()
ev = Event(idev, ts_str, id1, id2, id3, epoch)
yield ev
#### TO HERE: ^^^ was just faking up your events. Delete or replace.
def add_cluster(key, group):
'''Do whatever you want with the clusters. I'll print them.'''
print('Cluster:', key)
print(' ','\n '.join(map(str, group)), sep='')
def find_clustered_events(events, cluster=3, gap_secs=60):
'''Call add_cluster on clusters of events within a maximum time gap.
Args:
events (iterable): series of events, in chronological order
cluster (int): minimum number of events in a cluster
gap_secs (float): maximum time-gap from start to end of cluster
Returns:
None.
'''
window = collections.deque()
evkey = lambda e: (e.id_1, e.id_2, e.id_3)
for ev in events:
window.append(ev)
t0 = window[0].epoch_ts
tn = window[-1].epoch_ts
if tn - t0 < gap_secs:
continue
window.pop()
for k, g in itertools.groupby(sorted(window, key=evkey), key=evkey):
group = tuple(g)
if len(group) >= cluster:
add_cluster(k, group)
window.append(ev)
window.popleft()
# Call find_clustered with event generator, cluster args.
# Note that your data doesn't have any 3-clusters without time seconds. :-(
find_clustered_events(get_events(), cluster=2)
The output looks like this:
$ python test.py
Cluster: ('A', 1, 34567)
9442823 Jun 15, 2015 10:22 PM PDT A 1 34567
9442805 Jun 15, 2015 10:22 PM PDT A 1 34567
Cluster: ('A', 2, 12345)
9442821 Jun 15, 2015 10:22 PM PDT A 2 12345
9442876 Jun 15, 2015 10:23 PM PDT A 2 12345
Cluster: ('A', 3, 34567)
9442817 Jun 15, 2015 10:22 PM PDT A 3 34567
9442866 Jun 15, 2015 10:23 PM PDT A 3 34567
Cluster: ('A', 1, 12345)
9442814 Jun 15, 2015 10:22 PM PDT A 1 12345
9442858 Jun 15, 2015 10:23 PM PDT A 1 12345
Cluster: ('C', 2, 34567)
9442845 Jun 15, 2015 10:23 PM PDT C 2 34567
9442949 Jun 15, 2015 10:25 PM PDT C 2 34567
Please note: the code above doesn't try to keep track of events already in a cluster. So if you have, for example, an event type that occurs every fifteen seconds, you will have a sequence like this:
event1 t=0:00
event2 t=0:15
event3 t=0:30
event4 t=0:45
event5 t=1:00
And you will get overlapping clusters:
event1, event2, event3 (t=0:00 .. 0:30)
event2, event3, event4 (t=0:15 .. 0:45)
event3, event4, event5 (t=0:30 .. 1:00)
Technically, those are valid clusters, each slightly different. But you may wish to expunge previously-clustered events from the window, if you want the events to only appear in a single cluster.
Alternatively, if the chance of clustering and repetition is low, it might improve performance to implement repeat-checking in the add_cluster() function, to reduce the work done by the main loop.
A final note: this does a LOT of sorting. And the sorting is not efficient, since it gets repeated every time a new event appears. If you have a large data set, the performance will probably be bad. If your event keys are relatively few - that is, if the id1,2,3 values tend to repeat over and over again - you would be better off dynamically creating separate deques for each distinct key (id1+id2+id3) and dispatching the event to the appropriate deque, applying the same window logic, and then checking the length of the deque.
On the other hand, if you are processing something like web-server logs, where the requester is always changing, that might tend to create a memory problem with all the useless deques. So this is a memory vs. time trade-off you'll have to be aware of.
I need to extract the info from this page -http://www.investing.com/currencies/usd-brl-historical-data. I need Date, Price, Open, High, Low,Change %.
I`m new to Python so I got stuck at this step:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('table', {'class':'genTbl closedTbl historicalTbl'})
d=[]
for item in g_data:
Table_Values = item.find_all('tr')
N=len(Table_Values)-1
for n in range(N):
k = (item.find_all('td', {'class':'first left bold noWrap'})[n].text)
print(item.find_all('td', {'class':'first left bold noWrap'})[n].text)
Here I have several problems:
Column for Price can de tagged as or . How can I specify that I want items tagged with class = 'redFont' or/and 'greenfont'?. Also Change % can also have class redFont and greenFont. Other columns are tagged by . How can I extract them?
Is there a way to extract columns from table?
Ideally I would like to have a dateframe with Columns Date, Price, Open, High, Low,Change %.
Thanks
How to parse the table from that site I have already answered here but since you want a DataFrame, just use pandas.read_html
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
import pandas as pd
df = pd.read_html(r.content,attrs = {'id': 'curr_table'})[0]
Which will give you:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3609 3.4411 3.4465 3.3584 -2.36%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%
You can generally pass the url directly but we get a 403 error for this particular site using urllib2 which is the lib used by read_html so we need to use requests to get that html.
Here's a way to convert the html table into a nested list
The solution is to find the specific table, then loop through each tr in the table, creating a sublist of the text of all the items inside that tr. The code to do this is a nested list comprehension.
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
#first row is empty
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
pprint(tableRows)
This gets all the data from the table
[['Jun 08, 2016', '3.3614', '3.4411', '3.4465', '3.3584', '-2.34%'],
['Jun 07, 2016', '3.4421', '3.4885', '3.5141', '3.4401', '-1.36%'],
['Jun 06, 2016', '3.4896', '3.5265', '3.5295', '3.4840', '-1.09%'],
['Jun 05, 2016', '3.5280', '3.5280', '3.5280', '3.5280', '0.11%'],
['Jun 03, 2016', '3.5240', '3.5910', '3.5947', '3.5212', '-1.91%'],
['Jun 02, 2016', '3.5926', '3.6005', '3.6157', '3.5765', '-0.22%'],
['Jun 01, 2016', '3.6007', '3.6080', '3.6363', '3.5755', '-0.29%'],
['May 31, 2016', '3.6111', '3.5700', '3.6383', '3.5534', '1.11%'],
['May 30, 2016', '3.5713', '3.6110', '3.6167', '3.5675', '-1.11%'],
['May 27, 2016', '3.6115', '3.5824', '3.6303', '3.5792', '0.81%'],
['May 26, 2016', '3.5825', '3.5826', '3.5857', '3.5757', '-0.03%'],
['May 25, 2016', '3.5836', '3.5702', '3.6218', '3.5511', '0.34%'],
['May 24, 2016', '3.5713', '3.5717', '3.5903', '3.5417', '-0.04%'],
['May 23, 2016', '3.5728', '3.5195', '3.5894', '3.5121', '1.49%'],
['May 20, 2016', '3.5202', '3.5633', '3.5663', '3.5154', '-1.24%'],
['May 19, 2016', '3.5644', '3.5668', '3.6197', '3.5503', '-0.11%'],
['May 18, 2016', '3.5683', '3.4877', '3.5703', '3.4854', '2.28%'],
['May 17, 2016', '3.4888', '3.4990', '3.5300', '3.4812', '-0.32%'],
['May 16, 2016', '3.5001', '3.5309', '3.5366', '3.4944', '-0.96%'],
['May 13, 2016', '3.5340', '3.4845', '3.5345', '3.4630', '1.39%'],
['May 12, 2016', '3.4855', '3.4514', '3.5068', '3.4346', '0.95%'],
['May 11, 2016', '3.4528', '3.4755', '3.4835', '3.4389', '-0.66%'],
['May 10, 2016', '3.4758', '3.5155', '3.5173', '3.4623', '-1.15%'],
['May 09, 2016', '3.5164', '3.5010', '3.6766', '3.4906', '0.40%']]
If you want to convert it to a pandas dataframe you just need to also grab the table headings and add them
import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]
#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)
print(df)
Then you'll get a dataframe that looks like this:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3596 3.4411 3.4465 3.3584 -2.40%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%