Fetch dynamically generated website content and write as text file?

Fetch dynamically generated website content and write as text file? - python

I want to store this site as text file:
https://www.tradingview.com/symbols/BTCUSD/technicals/
But I have a hard time since it's content is dependent on "buttons" that generate the content dynamically.
I wrote this code:
import requests
from bs4 import BeautifulSoup
results = requests.get("https://www.tradingview.com/symbols/BTCUSD/technicals/")
src = results.content
soup = BeautifulSoup(src)
var = soup.find_all()
with open('readme.txt', 'w') as f:
f.write(str(var))
But it appears it only fetches the source code and not the dynamically generated content. E.g. when clicking the button "1 minute" the content changes. I want some sort of "snapshot" of what "1 minute" generates as content and then sorta look for keywords "buy" or "sell" somehow later.
Really stuck with the first step of fetching the website's dynamic content... Can anyone help?

If you want to be able to get different content based on which buttons you click, you'll want to use Selenium instead of BeautifulSoup.
Here's a rough example which uses the page you provided:
from selenium import webdriver
from selenium.webdriver.common.by import By
import os
driver_path = os.path.abspath('chromedriver')
driver = webdriver.Chrome(driver_path)
driver.get('https://www.tradingview.com/symbols/BTCUSD/technicals/')
driver.implicitly_wait(10)
# Get the '1 minute' button element
btn_1min = driver.find_element(By.XPATH, '//*[#id="1m"]')
# Get the '15 minutes' button element
btn_15min = driver.find_element(By.XPATH, '//*[#id="15m"]')
# Decide which button to click on based on user input
choice = ''
while True:
time_input = input('Choose which button to click:\n> ')
if time_input == '1':
btn_1min.click() # Click the '1 minute' button
choice = '1 min'
break
elif time_input == '15':
btn_15min.click() # Click the '15 minute' button
choice = '15 min'
break
else:
print("Invalid input - try again")
# Get the 'Summary' section of the selected page
summary = driver.find_element(By.XPATH, '//*[#id="technicals-root"]/div/div/div[2]/div[2]')
# Get the element containing the signal ("Buy", "Sell", etc.)
signal = summary.find_element(By.XPATH, '//*[#id="technicals-root"]/div/div/div[2]/div[2]/span[2]')
# Get the current text of the signal element
signal_txt = signal.text
print(f'{choice} Summary: {signal_txt}')
driver.quit()
This script gives us the option to choose which button to click via user input, and prints different content based on which button was chosen.
For example: If we input '1' when prompted, thereby telling the script to click the "1 minute" button, we get the signal currently displayed in the Summary section of this button's page, which in this case happens to be "BUY":
Choose which button to click:
> 1
1 min Summary: BUY
If you want to read up on how to use Selenium with Python, here's the documentation: https://selenium-python.readthedocs.io

Related

Selenium - HTML doesn't always update after a click. Content in the browser changes, but I often get the same HTML from prior to the click

I'm trying to set up a simple webscraping script to pull every hyperlink from the discover cards on Bandcamp.
Here is my code:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.select import Select
browser = webdriver.Chrome()
all_links = []
page = 1
url = "https://bandcamp.com/?g=all&s=new&p=0&gn=0&f=digital&w=-1"
browser.get(url)
while page < 6:
page += 1
# wait until discover cards are loaded
test = WebDriverWait(browser, 20).until(EC.element_to_be_clickable(
(By.XPATH, '//*[#id="discover"]/div[9]/div[2]/div/div[1]/div/table/tbody/tr[1]/td[1]/a/div')))
# scrape hyperlinks for each of the 8 albums shown
titles = browser.find_elements(By.CLASS_NAME, "item-title")
links = [title.get_attribute('href') for title in titles[-8:]]
all_links = all_links + links
print(links)
# pagination - click through the page buttons as the links are scraped
page_nums = browser.find_elements(By.CLASS_NAME, 'item-page')
for page_num in page_nums:
if page_num.text.isnumeric():
if int(page_num.text) == page:
page_num.click()
time.sleep(20) # I've tried multiple long wait times as well as WebDriverWaits on different elements to see if the HTML will update, but I haven't seen a positive effect
break
I'm using print(links) to see where this is going wrong. In the selenium browser, it clicks through the pages well. Note that pagination via the url parameters doesn't seem possible as the discover cards often won't load unless you click the page buttons towards the bottom of my picture. BetterSoup and Requests don't work either for the same reason. The print function is returning the following:
['https://neubauten.bandcamp.com/album/stimmen-reste-musterhaus-7?from=discover-new', 'https://cirka1.bandcamp.com/album/time?from=discover-new', 'https://futuramusicsound.bandcamp.com/album/yoga-meditation?from=discover-new', 'https://deathsoundbatrecordings.bandcamp.com/album/real-mushrooms-dsbep092?from=discover-new', 'https://riacurley.bandcamp.com/album/take-me-album?from=discover-new', 'https://terracuna.bandcamp.com/album/el-origen-del-viento?from=discover-new', 'https://hyper-music.bandcamp.com/album/hypermusic-vol-4?from=discover-new', 'https://defisis1.bandcamp.com/album/priceless?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://jarnosalo.bandcamp.com/album/here-lies-ancient-blob?from=discover-new', 'https://andreneitzel.bandcamp.com/album/allegasi-gold-2?from=discover-new', 'https://moonraccoon.bandcamp.com/album/prequels?from=discover-new', 'https://lolivone.bandcamp.com/album/live-at-the-berklee-performance-center?from=discover-new', 'https://nilswrasse.bandcamp.com/album/a-calling-from-the-desert-to-the-sea-original-motion-picture-soundtrack?from=discover-new', 'https://whitereaperaskingride.bandcamp.com/album/asking-for-a-ride?from=discover-new', 'https://collageeffect.bandcamp.com/album/emerald-network?from=discover-new', 'https://foxteethnj.bandcamp.com/album/through-the-blue?from=discover-new']
['https://finitysounds.bandcamp.com/album/kreme?from=discover-new', 'https://mylittlerobotfriend.bandcamp.com/album/amen-break?from=discover-new', 'https://electrinityband.bandcamp.com/album/rise?from=discover-new', 'https://abyssal-void.bandcamp.com/album/ritualist?from=discover-new', 'https://plataformarecs.bandcamp.com/album/v-a-david-lynch-experience?from=discover-new', 'https://hurricaneturtles.bandcamp.com/album/industrial-synth?from=discover-new', 'https://blackwashband.bandcamp.com/album/2?from=discover-new', 'https://worldwide-bitchin-records.bandcamp.com/album/wack?from=discover-new']
Each time it correctly pulls the first 8 albums on page 1, then for pages 2-4 it repeats the 8 albums on page 2, for pages 5-7 it repeats the 8 albums on page 5, and so on. Even though the page is updating (and the url changes) in the selenium browser, for some reason selenium is not recognizing any changes to the html so it repeats the same titles. Any idea where I've gone wrong?

Your definition of titles, i.e.
titles = browser.find_elements(By.CLASS_NAME, "item-title")
is a bad idea because item-title is the class of many elements in the page. Then another bad idea is to pick titles[-8:]. It may sounds good because you think ok each time I click a page the new elements are added at the end, but this is not always the case. Your case is one of those were elements are not added sequentially.
So let's start by considering a class exclusive of cards. For example discover-item. Then open the DevTools, press CTRL+F and enter .discover-item. When the url is first loaded, it will find 8 results. Now click next page, now it finds 16 results, click again and will find 24 results. To better see what's going I suggest you to run the following each time you click on the "next" button.
el = driver.find_elements(By.CSS_SELECTOR, '.discover-item')
for i,e in enumerate(el):
print(i,e.get_attribute('innerText').replace('\n',' - '))
In particular, when arriving to page 3, you will see that the first item shown in page 1 (which in my case is 0 and friends - Jacob Stanley - alternative), is now printed at a different position (in my case 8 and friends - Jacob Stanley - alternative). What happened is that the items of page 3 were added at the beginning of the list, and so you can see why titles[-8:] was a bad choice.
So a better choice is to consider all cards each time you go to the next page, instead of the last 8 only (notice that the HTML of this site can contain no more than 24 cards), and then add all current cards to a set (since a set cannot contain duplicates, only new elements will be added).
# scroll to cards
cards = WebDriverWait(driver,20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".discover-item")))
driver.execute_script('arguments[0].scrollIntoView({block: "center", behavior: "smooth"});', cards[0])
time.sleep(1)
items = set()
while 1:
links = driver.find_elements(By.CSS_SELECTOR, '.discover-item .item-title')
# extract info and store it
for idx,card in enumerate(cards):
tit_art_gen = card.get_attribute('innerText').replace('\n',' - ')
href = links[idx].get_attribute('href')
# print(idx, tit_art_gen)
items.add(tit_art_gen + ' - ' + href)
# click 'next' button if it is not disabled
next_button = driver.find_element(By.XPATH, "//a[.='next']")
if 'disabled' in next_button.get_attribute('class'):
print('last page reached')
break
else:
next_button.click()
# wait until new elements are loaded
cards_new = cards.copy()
while cards_new == cards:
cards_new = driver.find_elements(By.CSS_SELECTOR, '.discover-item')
time.sleep(.5)
cards = cards_new.copy()

Trouble with accessing dropdown values while scraping using Python Selenium

I am new to web scrapping. I am trying to scrape a website (for getting flight prices) using Python Selenium which has some drop down menu for fields like departure city, arrival city. The first option listen should be accessed and used.
Lets say that the field departure city is button, on click of that button, a new html page will be loaded with input box and a list set is available with every list element consisting of a button.
While debugging I found that, once my keyword is entered the options loading screen appears but options are not getting loaded even after increasing the sleep time (I have attached the image for the same).
image with error
Manually there are no issues with the drop down menu. As soon as I enter my departure city, options will be loaded and I choose the respective location.
I also tried to use actionchains. Unfortunately there is no luck. I am attaching my code below. Any help is appreciated.
#Manually trying to access the first element:**
#Accessing Input Field:**
fly_from = browser.find_element("xpath", "//button[contains(., 'Departing')]").click()
element = WebDriverWait(browser,10).until(ec.presence_of_element_located((By.ID,"location")))
#Sending the keyword:**
element.send_keys("Los Angeles")
#Selecting the first element:**
first_elem = browser.find_element("xpath", "//div[#class='classname default-padding']//div//ul//li[0]]//button")
Using Action Chains:
fly_from = browser.find_element("xpath", "//button[contains(., 'Departing')]").click()
time.sleep(10)
element = WebDriverWait(browser, 10).until(ec.presence_of_element_located((By.ID, "location")))
element.send_keys("Los Angeles")
time.sleep(3)
list_elem = browser.find_element("xpath", "//div[#class='class name default-padding']//div//ul//li[0]]//button")
size=element.size
action = ActionChains(browser)
action.move_to_element_with_offset(to_element=list_elem,xoffset =0.5*size['width'], yoffset=70).click().perform()
action.click(on_element = list_elem)
action.perform()

Selenium Python Pass Date parameter using Calendar

I am trying to pull data from following page:
https://www.lifeinscouncil.org/industry%20information/ListOfFundNAVs
I have to select the name of the company and select a date from the calendar and click get data button.
I am trying to achieve this using Selenium Web Driver using Chrome in Python, I am stuck how do i pass the date parameter to the page.
it seems the page is postback after selection of date from the calendar.
Date needs to be selected from the calendar else the data is not returned by the webpage.
I have tried using requests Post method as well but am not able to get the NAV data.
I need to iterate this for a period of 5 years on daily (Trading Days) basis.
PS: I am bad at understanding DOM elements and have basic knowledge of Python and coding. by profession I am a data analyst.
Thanks in Advance.
Kiran Jain
edit: adding current code below:
from selenium import webdriver
url='https://www.lifeinscouncil.org/industry%20information/ListOfFundNAVs'
opt = webdriver.ChromeOptions()
prefs = {"profile.managed_default_content_settings.images": 2}
opt.add_argument("--start-maximized")
# opt.add_argument("--headless")
opt.add_argument("--disable-notifications")
opt.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(options=opt)
driver.get('https://www.lifeinscouncil.org/industry%20information/ListOfFundNAVs');
insurer = driver.find_element_by_id("MainContent_drpselectinscompany")
nav_date=driver.find_element_by_id('MainContent_txtdateselect')
get_data_btn=driver.find_element_by_id('MainContent_btngetdetails')
options=insurer.find_elements_by_tag_name("option")
data=[]
row={'FirmName','SFIN','fundName','NAVDate','NAV'}
for option in options:
print('here')
print(option.get_attribute("value") + ' ' + option.text)
if(option.text!='--Select Insurer--'):
option.click()
driver.find_element_by_id("MainContent_imgbtncalender").click()#Calender Icon
driver.find_element_by_link_text("June").click()#Date
driver.find_element_by_link_text("25").click()#Date
get_data_btn=driver.find_element_by_id('MainContent_btngetdetails') #this is put here again because on clicking the date, the page is reloaded
get_data_btn.click()
print('clicked')
driver.quit()

The date is in "a" tag. You can try to do select the date using "link-text".
driver.find_element_by_id("MainContent_imgbtncalender").click()#Calender Icon
driver.find_element_by_link_text("27").click()#Date
As per your comment I tried to traverse through dates but it only worked for that particular month. I tried to use "send_keys()" to that text box and its not working. Below is the code to traverse it for a month.
driver.get("https://www.lifeinscouncil.org/industry%20information/ListOfFundNAVs")
driver.find_element_by_id("MainContent_drpselectinscompany").click()
driver.find_element_by_xpath("//option[starts-with(text(),'Aditya')]").click()
driver.find_element_by_id("MainContent_imgbtncalender").click()
driver.find_element_by_link_text("1").click()
driver.find_element_by_id("MainContent_btngetdetails").click()
dateval = 2
while True:
if dateval == 32:
break
try:
driver.find_element_by_id("MainContent_imgbtncalender").click()
driver.find_element_by_link_text(str(dateval)).click()
driver.find_element_by_id("MainContent_btngetdetails").click()
dateval+=1
time.sleep(2)
except:
driver.switch_to.default_content()
dateval+=1
time.sleep(2)
time.sleep(5)
driver.quit()

radio button in google form with selenium python

I was writing a web scraper in python using selenium on a google form page, but i want to be able to select a particular button that i want. i tired selecting all the buttons once so it ended up selecting the last one as expected. Going through a multiple of a number is also not possible as each contain different number of buttons.
So i want a way to get the number of radio buttons in a group and a way to select the desired one.
Test site
If there are any other suggestions then i will be happy to listen ?
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import Resources
from selenium.webdriver.support.select import Select
driver = webdriver.Chrome(executable_path= Resources.driverChrome)
driver.maximize_window()
driver.get(Resources.linkTest)
time.sleep(3)
try:
email = driver.find_element_by_xpath("//input[#type='email']")
if email.is_displayed() & email.is_enabled():
email.send_keys(Resources.emailTest)
except:
print("Email box was not found")
containers = driver.find_elements_by_xpath("//div[#class ='freebirdFormviewerViewNumberedItemContainer']")
sNoBoxes = driver.find_elements_by_xpath("//input[#type='text' ] ")
time.sleep(2)
radios = driver.find_elements_by_xpath("//div[#class='appsMaterialWizToggleRadiogroupOffRadio exportOuterCircle']")
for radio in radios:
radio.click()
for container in containers:
sNo = int(containers.index(container))
print("\n\n" + str(sNo) + " " + container.text)
button = driver.find_element_by_xpath("//span[contains(text(),'Submit')]")
button.click()
time.sleep(3)
try:
viewScore = driver.find_element_by_xpath("//span[contains(text(),'View score')]").click()
except:
pass

First, you can find all radio buttons on the webpage with the following code:
radiobuttons = driver.find_elements_by_class_name("appsMaterialWizToggleRadiogroupElContainer")
Result: a list of radio button elements
Next, you are going to count the radio buttons. Note that radio buttons are global; The radio button "Option 2" is the 11th button instead of 2nd. If you want to click the very first radio button, use:
radiobuttons[0].click()
Result: the radio button is clicked
To click the n th radio button, use:
radiobuttons[n-1].click()

python selenium: cannot click invisible element

I am trying to scrape the Google News page in the following way:
from selenium import webdriver
import time
from pprint import pprint
base_url = 'https://www.google.com/'
driver = webdriver.Chrome('/home/vincent/wintergreen/chromedriver') ## change here to your location of the chromedriver
driver.implicitly_wait(30)
driver.get(base_url)
input = driver.find_element_by_id('lst-ib')
input.send_keys("brexit key dates timetable schedule briefing")
click = driver.find_element_by_name('btnK')
click.click()
news = driver.find_element_by_link_text('News')
news.click()
tools = driver.find_element_by_link_text('Tools')
tools.click()
time.sleep(1)
recent = driver.find_element_by_css_selector('div.hdtb-mn-hd[aria-label=Recent]')
recent.click()
# custom = driver.find_element_by_link_text('Custom range...')
custom = driver.find_element_by_css_selector('li#cdr_opt span')
custom.click()
from_ = driver.find_element_by_css_selector('input#cdr_min')
from_.send_keys("9/1/2018")
to_ = driver.find_element_by_css_selector('input#cdr_max')
to_.send_keys("9/2/2018")
time.sleep(1)
go_ = driver.find_element_by_css_selector('form input[type="submit"]')
print(go_)
pprint(dir(go_))
pprint(go_.__dict__)
go_.click()
This script manage to enter search terms, switch to the news tab, open the custom time period tab, fill in start and end date, but fails to click on the 'Go' button after that point.
From the print and pprint statement at the end of the script, I can deduct that it does find the 'go' button succesfully, but is somehow unable to click on it. The error displays as selenium.common.exceptions.ElementNotVisibleException: Message: element not visible
Could anyone experienced with Selenium have a quick run at it and give me hints as why it returns such error?
Thx!

Evaluating the css using developer tools in chrome yields 4 elements.
Click here for the image
use the following css instead:
go_ = driver.find_element_by_css_selector('#cdr_frm > input.ksb.mini.cdr_go')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Fetch dynamically generated website content and write as text file? - python

Related

Selenium - HTML doesn't always update after a click. Content in the browser changes, but I often get the same HTML from prior to the click

Trouble with accessing dropdown values while scraping using Python Selenium

Selenium Python Pass Date parameter using Calendar

radio button in google form with selenium python

python selenium: cannot click invisible element

Categories

Resources