Im working on StockX scraping some products. There is a popup element called sales history where I click the text link and then loop through all the sales history through the "Load More" button.
My problem is that for the most part this works fine as I loop through URL's, but occasionally it will get hung up for a really long time where the button is present, but is not clickable (hasn't reached bottom either) so I believe it just stays in the loop. Any help with either breaking this loop or some workaround in Selenium would be awesome thank you!!
This is the function that I use to open the sales history information:
url = "https://stockx.com/adidas-ultra-boost-royal-blue"
driver = webdriver.Firefox()
driver.get(url)
content = driver.page_source
soup = BeautifulSoup(content, 'lxml')
def get_sales_history():
""" get sales history data from sales history table interaction """
sales_hist_data = []
try:
# click 'View All Sales' text link
View_all_sales_button = driver.find_element_by_xpath(".//div[#class='market-history-sales']/a[#class='all']")
View_all_sales_button.click()
# log in
login_button = driver.find_element_by_id("nav-signup")
login_button.click
# add username
username = driver.find_element_by_id("email-login")
username.clear()
username.send_keys("email#email.com")
# add password
password = driver.find_element_by_name("password-login")
password.clear()
password.send_keys("password")
except:
pass
while True:
try:
# If 'Load More' Appears Click Button
sales_hist_load_more_button = driver.find_element_by_xpath(
".//div[#class='latest-sales-container']/button[#class='button button-block button-white']")
sales_hist_load_more_button.click()
except:
#print("Reached bottom of page")
break
content = driver.page_source
soup = BeautifulSoup(content, 'lxml')
div = soup.find('div', class_='latest-sales-container')
for td in div.find_all('td'):
sales_hist_data.append(td.text)
return sales_hist_data
You can wait for button to become clickable using explicit wait.
while True:
try:
# If 'Load More' Appears Click Button
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, ".//div[#class='latest-sales-container']/button[#class='button button-block button-white']"))).click()
except StaleElementReferenceException:
pass
except TimeoutException:
break
Also, note that I have used 2 different exception handling. In case some time you get stale element ( it will be possible as you are trying to click same button after page refresh) it will ignore an again try to click same button , but when element is not found for 20 Sec it will get time out exception and break.
To click on the element with text View All Sales within the Last Sale block and click on the Load More element to scrape all the sales history you need to induce WebDriverWait for the visibility_of_all_elements_located() and you can use the following xpath based Locator Strategies:
Code Block:
driver.get('https://stockx.com/adidas-ultra-boost-royal-blue')
time.sleep(20) ## to interact with the location popup
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//div[#class='last-sale-block']//a[text()='View All Sales']"))).click()
while True:
try:
WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH, "//button[#class='button button-block button-white' and text()='Load More']"))).click()
print("Clicked on Load More")
time.sleep(3)
except (TimeoutException):
print("No more Load More")
break
print([my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.XPATH, "//div[#class='modal-body']//tbody//tr")))])
Console Output:
Clicked on Load More
Clicked on Load More
Clicked on Load More
No more Load More
['Sunday, August 2, 2020 2:16 AM EST 11 $236', 'Tuesday, June 2, 2020 7:34 AM EST 11 $262', 'Monday, April 27, 2020 11:03 AM EST 9 $143', 'Tuesday, January 7, 2020 8:54 AM EST 12.5 $137', 'Friday, December 27, 2019 12:30 PM EST 10 $307', 'Sunday, December 1, 2019 3:09 PM EST 8.5 $290', 'Tuesday, November 12, 2019 1:05 AM EST 12 $275', 'Tuesday, May 7, 2019 2:26 PM EST 8.5 $181', 'Saturday, April 27, 2019 1:04 PM EST 10 $228', 'Tuesday, March 5, 2019 12:25 AM EST 8.5 $230', 'Monday, November 5, 2018 1:35 AM EST 8 $320', 'Tuesday, August 28, 2018 7:29 PM EST 8.5 $240', 'Friday, August 24, 2018 10:26 PM EST 11 $580', 'Monday, July 16, 2018 10:02 PM EST 10.5 $255', 'Friday, July 6, 2018 2:44 PM EST 9 $260', 'Saturday, June 30, 2018 8:14 AM EST 9.5 $300', 'Tuesday, June 5, 2018 11:06 PM EST 10 $299', 'Saturday, May 12, 2018 10:48 AM EST 12 $371', 'Tuesday, March 20, 2018 1:09 AM EST 7.5 $279', 'Tuesday, March 20, 2018 11:17 PM EST 8 $250', 'Saturday, February 24, 2018 2:18 AM EST 7.5 $250', 'Monday, February 19, 2018 6:11 PM EST 7 $300', 'Sunday, February 18, 2018 2:05 PM EST 10 $400', 'Saturday, February 3, 2018 3:24 PM EST 7.5 $299', 'Thursday, January 25, 2018 11:13 PM EST 7 $190', 'Wednesday, December 27, 2017 11:09 PM EST 9 $355', 'Thursday, October 12, 2017 8:37 PM EST 8 $300', 'Friday, September 1, 2017 2:05 AM EST 12.5 $333', 'Friday, September 1, 2017 10:38 PM EST 12 $495', 'Saturday, August 5, 2017 10:53 AM EST 8 $355', 'Friday, August 4, 2017 3:28 AM EST 9.5 $325', 'Thursday, July 6, 2017 7:31 AM EST 10 $350', 'Tuesday, June 13, 2017 11:42 PM EST 9 $350', 'Monday, May 15, 2017 4:19 AM EST 11.5 $200', 'Sunday, May 14, 2017 3:42 PM EST 13 $370', 'Sunday, March 26, 2017 1:49 PM EST 11 $347', 'Sunday, August 21, 2016 7:33 PM EST 11 $250']
Related
I'm a newbie seeking help.
I've tried without success with the following.
from bs4 import BeautifulSoup
import pandas as pd
url = "https://www.canada.ca/en/immigration-refugees-citizenship/corporate/mandate/policies-operational-instructions-agreements/ministerial-instructions/express-entry-rounds.html"
html_text = requests.get(url).text
soup = BeautifulSoup(html_text, 'html.parser')
data = []
# Verifying tables and their classes
print('Classes of each table:')
for table in soup.find_all('table'):
print(table.get('class'))
Result:
['table']
None
Can anyone help me with how to get this data?
Thank you so much.
The data you see on the page is loaded from external URL. To load the data you can use next example:
import requests
import pandas as pd
url = "https://www.canada.ca/content/dam/ircc/documents/json/ee_rounds_123_en.json"
data = requests.get(url).json()
df = pd.DataFrame(data["rounds"])
df = df.drop(columns=["drawNumberURL", "DrawText1", "mitext"])
print(df.head(10).to_markdown(index=False))
Prints:
drawNumber
drawDate
drawDateFull
drawName
drawSize
drawCRS
drawText2
drawDateTime
drawCutOff
drawDistributionAsOn
dd1
dd2
dd3
dd4
dd5
dd6
dd7
dd8
dd9
dd10
dd11
dd12
dd13
dd14
dd15
dd16
dd17
dd18
231
2022-09-14
September 14, 2022
No Program Specified
3,250
510
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
September 14, 2022 at 13:29:26 UTC
January 08, 2022 at 10:24:52 UTC
September 12, 2022
408
6,228
63,860
5,845
9,505
19,156
16,541
12,813
58,019
12,245
12,635
9,767
11,186
12,186
68,857
35,833
5,068
238,273
230
2022-08-31
August 31, 2022
No Program Specified
2,750
516
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 31, 2022 at 13:55:23 UTC
April 16, 2022 at 18:24:41 UTC
August 29, 2022
466
7,224
63,270
5,554
9,242
19,033
16,476
12,965
58,141
12,287
12,758
9,796
11,105
12,195
68,974
36,001
5,120
239,196
229
2022-08-17
August 17, 2022
No Program Specified
2,250
525
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 17, 2022 at 13:43:47 UTC
December 28, 2021 at 11:03:15 UTC
August 15, 2022
538
8,221
62,753
5,435
9,129
18,831
16,465
12,893
58,113
12,200
12,721
9,801
11,138
12,253
68,440
35,745
5,137
238,947
228
2022-08-03
August 3, 2022
No Program Specified
2,000
533
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
August 03, 2022 at 15:16:24 UTC
January 06, 2022 at 14:29:50 UTC
August 2, 2022
640
8,975
62,330
5,343
9,044
18,747
16,413
12,783
57,987
12,101
12,705
9,747
11,117
12,317
68,325
35,522
5,145
238,924
227
2022-07-20
July 20, 2022
No Program Specified
1,750
542
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 20, 2022 at 16:32:49 UTC
December 30, 2021 at 15:29:35 UTC
July 18, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
226
2022-07-06
July 6, 2022
No Program Specified
1,500
557
Federal Skilled Worker, Canadian Experience Class, Federal Skilled Trades and Provincial Nominee Program
July 6, 2022 at 14:34:34 UTC
November 13, 2021 at 02:20:46 UTC
July 11, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
225
2022-06-22
June 22, 2022
Provincial Nominee Program
636
752
Provincial Nominee Program
June 22, 2022 at 14:13:57 UTC
April 19, 2022 at 13:45:45 UTC
June 20, 2022
664
8,017
55,917
4,246
7,845
16,969
15,123
11,734
53,094
10,951
11,621
8,800
10,325
11,397
64,478
33,585
4,919
220,674
224
2022-06-08
June 8, 2022
Provincial Nominee Program
932
796
Provincial Nominee Program
June 08, 2022 at 14:03:28 UTC
October 18, 2021 at 17:13:17 UTC
June 6, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
223
2022-05-25
May 25, 2022
Provincial Nominee Program
590
741
Provincial Nominee Program
May 25, 2022 at 13:21:23 UTC
February 02, 2022 at 12:29:53 UTC
May 23, 2022
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
222
2022-05-11
May 11, 2022
Provincial Nominee Program
545
753
Provincial Nominee Program
May 11, 2022 at 14:08:07 UTC
December 15, 2021 at 20:32:57 UTC
May 9, 2022
635
7,193
52,684
3,749
7,237
16,027
14,466
11,205
50,811
10,484
11,030
8,393
9,945
10,959
62,341
32,590
4,839
211,093
In my personal scraping project I cannot locate any job cards on https://unjobs.org neither with requests / requests_html, nor selenium. Job titles are the only fields that I can print in console. Company names and deadlines seem to be located in iframes, but there is no src, somehow href also are not scrapeable. I am not sure whether that site is SPA. Plus DevTools shows no XHR of interest. Please advise what selector/script tag contains all the data?
You are dealing with CloudFlare firewall. You've to inject the cookies. I couldn't share such answer for injecting the cookies as CloudFlare bots is very clever to fetch such threads and then improving the security.
Anyway below is a solution using Selenium
import pandas as pd
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
mainurl = "https://unjobs.org/"
def main(driver):
driver.get(mainurl)
try:
element = WebDriverWait(driver, 10).until(
EC.presence_of_all_elements_located(
(By.XPATH, "//article/div[#id]"))
)
data = (
(
x.find_element_by_class_name('jtitle').text,
x.find_element_by_class_name('jtitle').get_attribute("href"),
x.find_element_by_tag_name('br').text,
x.find_element_by_css_selector('.upd.timeago').text,
x.find_element_by_tag_name('span').text
)
for x in element
)
df = pd.DataFrame(data)
print(df)
except TimeoutException:
exit('Unable to locate element')
finally:
driver.quit()
if __name__ == "__main__":
driver = webdriver.Firefox()
data = main(driver)
Note: you can use headless browser as well.
Output:
0 1 2 3 4
0 Republication : Une consultance internationale... https://unjobs.org/vacancies/1627733212329 about 9 hours ago Closing date: Friday, 13 August 2021
1 Project Management Support Associate (Informat... https://unjobs.org/vacancies/1627734534127 about 9 hours ago Closing date: Tuesday, 17 August 2021
2 Finance Assistant - Retainer, Nairobi, Kenya https://unjobs.org/vacancies/1627734537201 about 10 hours ago Closing date: Saturday, 14 August 2021
3 Procurement Officer, Sana'a, Yemen https://unjobs.org/vacancies/1627734545575 about 10 hours ago Closing date: Wednesday, 4 August 2021
4 ICT Specialist (Geospatial Information Systems... https://unjobs.org/vacancies/1627734547681 about 10 hours ago Closing date: Saturday, 14 August 2021
5 Programme Management - Senior Assistant (Grant... https://unjobs.org/vacancies/1627734550335 about 10 hours ago Closing date: Thursday, 5 August 2021
6 Especialista en Normas Internacionales de Cont... https://unjobs.org/vacancies/1627734552666 about 10 hours ago Closing date: Saturday, 14 August 2021
7 Administration Assistant, Juba, South Sudan https://unjobs.org/vacancies/1627734561330 about 10 hours ago Closing date: Wednesday, 11 August 2021
8 Project Management Support - Senior Assistant,... https://unjobs.org/vacancies/1627734570991 about 10 hours ago Closing date: Saturday, 14 August 2021
9 Administration Senior Assistant [Administrativ... https://unjobs.org/vacancies/1627734572868 about 10 hours ago Closing date: Wednesday, 11 August 2021
10 Project Management Support Officer, Juba, Sout... https://unjobs.org/vacancies/1627734574639 about 10 hours ago Closing date: Wednesday, 11 August 2021
11 Information Management Senior Associate, Bamak... https://unjobs.org/vacancies/1627734576597 about 10 hours ago Closing date: Saturday, 7 August 2021
12 Regional Health & Safety Specialists (French a... https://unjobs.org/vacancies/1627734578207 about 10 hours ago Closing date: Friday, 6 August 2021
13 Project Management Support - Associate, Bonn, ... https://unjobs.org/vacancies/1627734587268 about 10 hours ago Closing date: Tuesday, 10 August 2021
14 Associate Education Officer, Goré, Chad https://unjobs.org/vacancies/1627247597092 a day ago Closing date: Tuesday, 3 August 2021
15 Senior Program Officer, High Impact Africa 2 D... https://unjobs.org/vacancies/1627597499846 a day ago Closing date: Thursday, 12 August 2021
16 Specialist, Supply Chain, Geneva https://unjobs.org/vacancies/1627597509615 a day ago Closing date: Thursday, 12 August 2021
17 Project Manager, Procurement and Supply Manage... https://unjobs.org/vacancies/1627597494487 a day ago Closing date: Thursday, 12 August 2021
18 WCO Drug Programme: Analyst for AIRCOP Project... https://unjobs.org/vacancies/1627594132743 a day ago Closing date: Tuesday, 31 August 2021
19 Regional Desk Assistant, Geneva https://unjobs.org/vacancies/1627594929351 a day ago Closing date: Thursday, 26 August 2021
20 Programme Associate, Zambia https://unjobs.org/vacancies/1627586510917 a day ago Closing date: Wednesday, 11 August 2021
21 Associate Programme Management Officer, Entebb... https://unjobs.org/vacancies/1627512175261 a day ago Closing date: Saturday, 14 August 2021
22 Expert in Transport Facilitation and Connectiv... https://unjobs.org/vacancies/1627594978539 a day ago Closing date: Sunday, 15 August 2021
23 Content Developer for COP Trainings (two posit... https://unjobs.org/vacancies/1627594862178 a day ago
24 Consultant (e) en appui aux Secteurs, Haiti https://unjobs.org/vacancies/1627585454029 a day ago Closing date: Sunday, 8 August 2021
It looks like either Cloudflare knows your request is not coming from an actual browser and is giving a captcha instead of the actual site and/or you need javascript for the site to run.
I would try using something like puppeteer and see if the response you get is valid.
I have a question in regards to DataFrame. I have written a code with Selenium to extract a table from a website. However, I am having doubt on how to transform the Selenium text into DataFrame and export it in CSV. Below is my code.
import requests
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("Path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
table = driver.find_element_by_xpath('//table[#id="inlineSearchTable"]/tbody')
while True:
try:
print(table.text)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
driver.quit()
If you using selenium you need to get the outerHTML of the table and then use pd.read_html() to get the dataframe.
Then append with empty dataframe and export to csv.
import pandas as pd
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome("path")
driver.get("https://www.bcsc.bc.ca/enforcement/early-intervention/investment-caution-list")
dfbase=pd.DataFrame()
while True:
try:
table =WebDriverWait(driver,10).until(EC.visibility_of_element_located((By.CSS_SELECTOR,"table#inlineSearchTable"))).get_attribute("outerHTML")
df=pd.read_html(str(table))[0]
dfbase=dfbase.append(df,ignore_index=True)
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next']"))).click()
time.sleep(1)
except:
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, "//a[#class='paginate_button next disabled']"))).click()
break
print(dfbase)
dfbase.to_csv("TestResultsDF.csv")
driver.quit()
Output:
Name Date Added to the List
0 24option.com Aug 6, 2013
1 3storich Aug 20, 2020
2 4XP Investments & Trading and Forex Place Ltd. Mar 15, 2012
3 6149154 Canada Inc. d.b.a. Forexcanus Aug 25, 2011
4 72Option, owned and operated by Epic Ventures ... Dec 8, 2016
5 A&L Royal Finance Inc. May 6, 2015
6 Abler Finance Sep 26, 2014
7 Accredited International / Accredited FX Mar 15, 2013
8 Aidan Trading Jan 24, 2018
9 AlfaTrade, Nemesis Capital Limited (together, ... Mar 16, 2016
10 Alma Group Co Trading Ltd. Oct 7, 2020
11 Ameron Oil and Gas Ltd. Sep 23, 2010
12 Anchor Securities Limited Aug 29, 2011
13 Anyoption Jul 8, 2013
14 Arial Trading, LLC Nov 20, 2008
15 Asia & Pacific Holdings Inc. Dec 5, 2017
16 Astercap Ltd., doing business as Broker Official Aug 31, 2018
17 Astor Capital Fund Limited (Astor) Apr 9, 2020
18 Astrofx24 Nov 19, 2019
19 Atlantic Global Asset Management Sep 12, 2017
20 Ava FX, Ava Financial Ltd. and Ava Capital Mar... Mar 15, 2012
21 Ava Trade Ltd. May 30, 2016
22 Avariz Group Nov 4, 2020
23 B.I.S. Blueport Investment Services Ltd., doin... Sep 7, 2017
24 B4Option May 3, 2017
25 Banc de Binary Ltd. Jul 29, 2013
26 BCG Invest Apr 6, 2020
27 BeFaster.fit Limited (BeFaster) Jun 22, 2020
28 Beltway M&A Oct 6, 2009
29 Best Commodity Options Aug 1, 2012
.. ... ...
301 Trade12, owned and operated by Exo Capital Mar... Mar 1, 2017
302 TradeNix Jul 30, 2020
303 TradeQuicker May 21, 2014
304 TradeRush.com Aug 6, 2013
305 Trades Capital, operated by TTN Marketing Ltd.... May 18, 2016
306 Tradewell.io Jan 20, 2020
307 TradexOption Apr 20, 2020
308 Trinidad Oil & Gas Corporation Dec 6, 2011
309 Truevalue Investment International Limited May 11, 2018
310 UK Options Mar 3, 2015
311 United Financial Commodity Group, operating as... Nov 15, 2018
312 Up & Down Marketing Limited (dba OneTwoTrade) Apr 27, 2015
313 USI-TECH Limited Dec 15, 2017
314 uTrader and Day Dream Investments Ltd. (togeth... Nov 29, 2017
315 Vision Financial Partners, LLC Feb 18, 2016
316 Vision Trading Advisors Feb 18, 2016
317 Wallis Partridge LLC Apr 24, 2014
318 Waverly M&A Jan 19, 2010
319 Wealth Capital Corp. Sep 4, 2012
320 Wentworth & Wellesley Ltd. / Wentworth & Welle... Mar 13, 2012
321 West Golden Capital Dec 1, 2010
322 World Markets Sep 22, 2020
323 WorldWide CapitalFX Feb 8, 2019
324 XForex, owned and operated by XFR Financial Lt... Jul 19, 2016
325 Xtelus Profit Nov 30, 2020
326 You Trade Holdings Limited Jun 3, 2011
327 Zen Vybe Inc. Mar 27, 2020
328 ZenithOptions Feb 12, 2016
329 Ziptradex Limited (Ziptradex) May 21, 2020
330 Zulu Trade Inc. Mar 2, 2015
First, obligatory advance apologies - almost newbie here, and this is my first question; please be kind...
I'm struggling to scrape javascript generated pages; in particular those of the Metropolitan Opera schedule. For any given month, I would like to create a calendar with just the name of the production, and the date and time of performance. I threw beautifulsoup and selenium at it, and I can get tons of info about the composer's love life, etc. - but not these 3 elements. Any help would be greatly appreciated.
Link to a random month in their schedule
One thing that you should look for (in the future) on websites are calls to an API. I opened up Chrome Dev Tools (F12) and reloaded the page while in the Network tab.
I found two api calls, one for "productions" and one for "events". The "events" response has much more information. This code below makes a call to the "events" endpoint and then returns a subset of that data (specifically, title, date and time according to your description).
I wasn't sure what you wanted to do with that data so I just printed it out. Let me know if the code needs to be updated/modified and I will do my best to help!
I wrote this code using Python 3.6.4
from datetime import datetime
import requests
BASE_URL = 'http://www.metopera.org/api/v1/calendar'
EVENT = """\
Title: {title}
Date: {date}
Time: {time}
---------------\
"""
def get_events(*, month, year):
params = {
'month': month,
'year': year
}
r = requests.get('{}/events'.format(BASE_URL), params=params)
r.raise_for_status()
return r.json()
def get_name_date_time(*, events):
result = []
for event in events:
d = datetime.strptime(event['eventDateTime'], '%Y-%m-%dT%H:%M:%S')
result.append({
'title': event['title'],
'date': d.strftime('%A, %B %d, %Y'),
'time': d.strftime('%I:%M %p')
})
return result
if __name__ == '__main__':
events = get_events(month=11, year=2018)
names_dates_times = get_name_date_time(events=events)
for event in names_dates_times:
print(EVENT.format(**event))
Console:
Title: Tosca
Date: Friday, November 02, 2018
Time: 08:00 PM
---------------
Title: Carmen
Date: Saturday, November 03, 2018
Time: 01:00 PM
---------------
Title: Marnie
Date: Saturday, November 03, 2018
Time: 08:00 PM
---------------
Title: Tosca
Date: Monday, November 05, 2018
Time: 08:00 PM
---------------
Title: Carmen
Date: Tuesday, November 06, 2018
Time: 07:30 PM
---------------
Title: Marnie
Date: Wednesday, November 07, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Thursday, November 08, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Friday, November 09, 2018
Time: 08:00 PM
---------------
Title: Marnie
Date: Saturday, November 10, 2018
Time: 01:00 PM
---------------
Title: Carmen
Date: Saturday, November 10, 2018
Time: 08:00 PM
---------------
Title: Mefistofele
Date: Monday, November 12, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Tuesday, November 13, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Wednesday, November 14, 2018
Time: 07:30 PM
---------------
Title: Carmen
Date: Thursday, November 15, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Friday, November 16, 2018
Time: 07:30 PM
---------------
Title: Tosca
Date: Saturday, November 17, 2018
Time: 01:00 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Saturday, November 17, 2018
Time: 08:00 PM
---------------
Title: Mefistofele
Date: Monday, November 19, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Tuesday, November 20, 2018
Time: 08:00 PM
---------------
Title: Il Trittico
Date: Friday, November 23, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Saturday, November 24, 2018
Time: 01:00 PM
---------------
Title: Mefistofele
Date: Saturday, November 24, 2018
Time: 08:00 PM
---------------
Title: Il Trittico
Date: Monday, November 26, 2018
Time: 07:30 PM
---------------
Title: Mefistofele
Date: Tuesday, November 27, 2018
Time: 07:30 PM
---------------
Title: Les Pêcheurs de Perles (The Pearl Fishers)
Date: Wednesday, November 28, 2018
Time: 07:30 PM
---------------
Title: La Bohème
Date: Thursday, November 29, 2018
Time: 07:30 PM
---------------
Title: Il Trittico
Date: Friday, November 30, 2018
Time: 07:30 PM
---------------
For reference, here is a link to the full JSON response from the events endpoint. There is a bunch more potentially interesting information you may want but I just grabbed the subset of what you asked for in the description.
I need to extract the info from this page -http://www.investing.com/currencies/usd-brl-historical-data. I need Date, Price, Open, High, Low,Change %.
I`m new to Python so I got stuck at this step:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup=BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('table', {'class':'genTbl closedTbl historicalTbl'})
d=[]
for item in g_data:
Table_Values = item.find_all('tr')
N=len(Table_Values)-1
for n in range(N):
k = (item.find_all('td', {'class':'first left bold noWrap'})[n].text)
print(item.find_all('td', {'class':'first left bold noWrap'})[n].text)
Here I have several problems:
Column for Price can de tagged as or . How can I specify that I want items tagged with class = 'redFont' or/and 'greenfont'?. Also Change % can also have class redFont and greenFont. Other columns are tagged by . How can I extract them?
Is there a way to extract columns from table?
Ideally I would like to have a dateframe with Columns Date, Price, Open, High, Low,Change %.
Thanks
How to parse the table from that site I have already answered here but since you want a DataFrame, just use pandas.read_html
url = 'http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
import pandas as pd
df = pd.read_html(r.content,attrs = {'id': 'curr_table'})[0]
Which will give you:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3609 3.4411 3.4465 3.3584 -2.36%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%
You can generally pass the url directly but we get a 403 error for this particular site using urllib2 which is the lib used by read_html so we need to use requests to get that html.
Here's a way to convert the html table into a nested list
The solution is to find the specific table, then loop through each tr in the table, creating a sublist of the text of all the items inside that tr. The code to do this is a nested list comprehension.
import requests
from bs4 import BeautifulSoup
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
#first row is empty
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
pprint(tableRows)
This gets all the data from the table
[['Jun 08, 2016', '3.3614', '3.4411', '3.4465', '3.3584', '-2.34%'],
['Jun 07, 2016', '3.4421', '3.4885', '3.5141', '3.4401', '-1.36%'],
['Jun 06, 2016', '3.4896', '3.5265', '3.5295', '3.4840', '-1.09%'],
['Jun 05, 2016', '3.5280', '3.5280', '3.5280', '3.5280', '0.11%'],
['Jun 03, 2016', '3.5240', '3.5910', '3.5947', '3.5212', '-1.91%'],
['Jun 02, 2016', '3.5926', '3.6005', '3.6157', '3.5765', '-0.22%'],
['Jun 01, 2016', '3.6007', '3.6080', '3.6363', '3.5755', '-0.29%'],
['May 31, 2016', '3.6111', '3.5700', '3.6383', '3.5534', '1.11%'],
['May 30, 2016', '3.5713', '3.6110', '3.6167', '3.5675', '-1.11%'],
['May 27, 2016', '3.6115', '3.5824', '3.6303', '3.5792', '0.81%'],
['May 26, 2016', '3.5825', '3.5826', '3.5857', '3.5757', '-0.03%'],
['May 25, 2016', '3.5836', '3.5702', '3.6218', '3.5511', '0.34%'],
['May 24, 2016', '3.5713', '3.5717', '3.5903', '3.5417', '-0.04%'],
['May 23, 2016', '3.5728', '3.5195', '3.5894', '3.5121', '1.49%'],
['May 20, 2016', '3.5202', '3.5633', '3.5663', '3.5154', '-1.24%'],
['May 19, 2016', '3.5644', '3.5668', '3.6197', '3.5503', '-0.11%'],
['May 18, 2016', '3.5683', '3.4877', '3.5703', '3.4854', '2.28%'],
['May 17, 2016', '3.4888', '3.4990', '3.5300', '3.4812', '-0.32%'],
['May 16, 2016', '3.5001', '3.5309', '3.5366', '3.4944', '-0.96%'],
['May 13, 2016', '3.5340', '3.4845', '3.5345', '3.4630', '1.39%'],
['May 12, 2016', '3.4855', '3.4514', '3.5068', '3.4346', '0.95%'],
['May 11, 2016', '3.4528', '3.4755', '3.4835', '3.4389', '-0.66%'],
['May 10, 2016', '3.4758', '3.5155', '3.5173', '3.4623', '-1.15%'],
['May 09, 2016', '3.5164', '3.5010', '3.6766', '3.4906', '0.40%']]
If you want to convert it to a pandas dataframe you just need to also grab the table headings and add them
import requests
from bs4 import BeautifulSoup
import pandas
from pprint import pprint
url='http://www.investing.com/currencies/usd-brl-historical-data'
r = requests.get(url)
soup = BeautifulSoup(r.content,'html.parser')
table = soup.find("table", {"id" : "curr_table"})
tableRows = [[td.text for td in row.find_all("td")] for row in table.find_all("tr")[1:]]
#get headers for dataframe
tableHeaders = [th.text for th in table.find_all("th")]
#build df from tableRows and headers
df = pandas.DataFrame(tableRows, columns=tableHeaders)
print(df)
Then you'll get a dataframe that looks like this:
Date Price Open High Low Change %
0 Jun 08, 2016 3.3596 3.4411 3.4465 3.3584 -2.40%
1 Jun 07, 2016 3.4421 3.4885 3.5141 3.4401 -1.36%
2 Jun 06, 2016 3.4896 3.5265 3.5295 3.4840 -1.09%
3 Jun 05, 2016 3.5280 3.5280 3.5280 3.5280 0.11%
4 Jun 03, 2016 3.5240 3.5910 3.5947 3.5212 -1.91%
5 Jun 02, 2016 3.5926 3.6005 3.6157 3.5765 -0.22%
6 Jun 01, 2016 3.6007 3.6080 3.6363 3.5755 -0.29%
7 May 31, 2016 3.6111 3.5700 3.6383 3.5534 1.11%
8 May 30, 2016 3.5713 3.6110 3.6167 3.5675 -1.11%
9 May 27, 2016 3.6115 3.5824 3.6303 3.5792 0.81%
10 May 26, 2016 3.5825 3.5826 3.5857 3.5757 -0.03%
11 May 25, 2016 3.5836 3.5702 3.6218 3.5511 0.34%
12 May 24, 2016 3.5713 3.5717 3.5903 3.5417 -0.04%
13 May 23, 2016 3.5728 3.5195 3.5894 3.5121 1.49%
14 May 20, 2016 3.5202 3.5633 3.5663 3.5154 -1.24%
15 May 19, 2016 3.5644 3.5668 3.6197 3.5503 -0.11%
16 May 18, 2016 3.5683 3.4877 3.5703 3.4854 2.28%
17 May 17, 2016 3.4888 3.4990 3.5300 3.4812 -0.32%
18 May 16, 2016 3.5001 3.5309 3.5366 3.4944 -0.96%
19 May 13, 2016 3.5340 3.4845 3.5345 3.4630 1.39%
20 May 12, 2016 3.4855 3.4514 3.5068 3.4346 0.95%
21 May 11, 2016 3.4528 3.4755 3.4835 3.4389 -0.66%
22 May 10, 2016 3.4758 3.5155 3.5173 3.4623 -1.15%
23 May 09, 2016 3.5164 3.5010 3.6766 3.4906 0.40%