How can I reach the following webpage using Python Requests?
https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page?FundId=10306
The page is forwarded until I click the 2 "Accept" buttons.
This is what I do:
import requests
s = requests.Session()
r = s.post("https://www.fidelity.com.hk/investor/en/important-notice.page?submit=true&componentID=1298599783876")
r = s.get("https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page?&FundId=10306")
How do I handle the first "Accept" button, I have checked there is a cookie called "Accepted", am I correct?:
<a id="terms_use_accept" class="btn btn-default standard-btn smallBtn" title="Accept" href="javascript:void(0);">Accept</a>
First of all, requests is not a browser and there is no JavaScript engine built-in.
But, you can mimic the unrelying logic by inspecting what is going on in the browser when you click "Accept". This is there Browser Developer Tools are handy.
If you click "Accept" in the first Accept/Decline "popup" - there is an "accepted=true" cookie being set. As for the second "Accept", here is how the button link looks in the source code:
<a href="javascript:agree()">
<img src="/static/images/investor/en/buttons/accept_Btn.jpg" alt="Accept" title="Accept">
</a>
If you click the button agree() function is being called. And here is what it does:
function agree() {
$("form[name='agreeFrom']").submit();
}
In other words, agreeFrom form is being submitted. This form is hidden, but you can find it in the source code:
<form name="agreeFrom" action="/investor/en/important-notice.page?submit=true&componentID=1298599783876" method="post">
<input value="Agree" name="iwPreActions" type="hidden">
<input name="TargetPageName" type="hidden" value="en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends">
<input type="hidden" name="FundId" value="10306">
</form>
We can submit this form with requests. But, there is an easier option. If you click "Accept" and inspect what cookies are set, you'll notice that besides "accepted" there are 4 new cookies set:
"irdFundId" with a "FundId" value from the "FundId" form input or a value from the requested URL (see "?FundId=10306")
"isAgreed=yes"
"isExpand=true"
"lastAgreedTime" with a timestamp
Let's use this information to build a solution using requests+BeautifulSoup (for HTML parsing part):
import time
from bs4 import BeautifulSoup
import requests
from requests.cookies import cookiejar_from_dict
fund_id = '10306'
last_agreed_time = str(int(time.time() * 1000))
url = 'https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.0.3; ko-kr; LG-L160L Build/IML74K) AppleWebkit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30'}
session.cookies = cookiejar_from_dict({
'accepted': 'true',
'irdFundId': fund_id,
'isAgreed': 'yes',
'isExpand': 'true',
'lastAgreedTime': last_agreed_time
})
response = session.get(url, params={'FundId': fund_id})
soup = BeautifulSoup(response.content)
print soup.title
It prints:
Fidelity Funds - America Fund A-USD| Fidelity
which means we are seeing the desired page.
You can also approach it with a browser automation tool called selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox() # could also be headless: webdriver.PhantomJS()
driver.get('https://www.fidelity.com.hk/investor/en/fund-prices-performance/fund-price-details/factsheet-historical-nav-dividends.page?FundId=10306')
# switch to the popup
frame = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, "iframe.cboxIframe")))
driver.switch_to.frame(frame)
# click accept
accept = driver.find_element_by_link_text('Accept')
accept.click()
# switch back to the main window
driver.switch_to.default_content()
# click accept
accept = driver.find_element_by_xpath('//a[img[#title="Accept"]]')
accept.click()
# wait for the page title to load
WebDriverWait(driver, 10).until(EC.title_is("Fidelity Funds - America Fund A-USD| Fidelity"))
# TODO: extract the data from the page
You can't handle JavaScript using requests nor the urllib modules. But based on my knowledge (which is not much) I'll tell you how I would solve this problem.
This site is using a specific cookie to know if you have already accepted their policy. If not, the server redirects you to the page shown in the image above. Look for that cookie using some Add-On and set it manually so the website shows you the content you're looking for.
Another way is to use Qt's built-in web browser (which uses WebKit) that lets you execute JavaScript code. Simply use evaluateJavaScript("agree();") and there you go.
Hope it helps.
Related
I want to extract the chat-id from telegram web version z. I have done it on the telegram web version k but, it is not present in the z version. I looked every where but could not find any element containing chat-id. I know I can get the chat-id from url after opening the chat, but I can not open chat due to some reason.
The following is the basic code to open the telegram.
from seleniumwire import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.firefox import GeckoDriverManager
import sys
URL = 'https://web.telegram.org/'
firefox_options = webdriver.FirefoxOptions()
path_to_firefox_profile = "output_files\\firefox\\ghr2wgpa.default-release"
profile = webdriver.FirefoxProfile(path_to_firefox_profile)
profile.set_preference("dom.webdriver.enabled", False)
profile.set_preference('useAutomationExtension', False)
firefox_options.set_preference("general.useragent.override", 'user-agent=Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0')
profile.set_preference("security.mixed_content.block_active_content", False)
profile.update_preferences()
firefox_options.add_argument("--width=1400")
firefox_options.add_argument("--height=1000")
driver_installation = GeckoDriverManager().install()
service = Service(driver_installation)
if sys.platform == 'win32':
from subprocess import CREATE_NO_WINDOW
service.creationflags = CREATE_NO_WINDOW
driver = webdriver.Firefox(options=firefox_options, firefox_profile=profile, capabilities = caps,
service=service)
driver.get(URL)
driver.close()
driver.quit()
I only know is that the chat-id is being passed to some function in common.js when we click on the chat to open it.
The chat-id is available as the following attribute.
data-peer-id="777000"
Thanks in advance.
If the goal - to open a chat first, so I think you may link not to id first, but to the name, I see in the tree of this version:
<div class="ListItem Chat chat-item-clickable group selected no-selection has-ripple"....
<div class="ListItem-button"
<div class="info"
<div class="info-row"
<h3 class="fullName" > Unique name is here>
So, as an option, you find unique chat via its name, open it, and grab the id from url, coz the chat will be already open.
I am using the following code to attempt to keep clicking a "Load More" button until all page results are shown on the website:
from selenium import webdriver
import time
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def startWebDriver():
global driver
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--incognito")
chrome_options.add_argument("--window-size=1920x1080")
driver = webdriver.Chrome(options = chrome_options)
startWebDriver()
driver.get("https://together.bunq.com/all")
time.sleep(4)
while True:
try:
wait = WebDriverWait(driver, 10,10)
element = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "[title='Load More']")))
element.click()
print("Loading more page results")
except:
print("All page results displayed")
break;
However, since the button click does not change the URL, no new data is loaded into chromedriver and the while loop will break on the second iteration.
Selenium is overkill for this. You only need requests. Logging one's network traffic reveals that at some point JavaScript makes an XHR HTTP GET request to a REST API endpoint, the response of which is JSON and contains all the information you're likely to want to scrape.
One of the query-string parameters for that endpoint URL is page[offset], which is used to offset the query results for pagination (in this case the "load more button"). A value of 0 corresponds to no offset, or "start at the beginning". Increment this value to suit your needs - in a loop would probably be a good place to do this.
Simply imitate that XHR HTTP GET request - copy the API endpoint URL and query-string parameters and request headers, then parse the JSON response:
def get_discussions():
import requests
url = "https://together.bunq.com/api/discussions"
params = {
"include": "user,lastPostedUser,tags,firstPost",
"page[offset]": 0
}
headers = {
"user-agent": "Mozilla/5.0"
}
response = requests.get(url, params=params, headers=headers)
response.raise_for_status()
yield from response.json()["data"]
def main():
for discussion in get_discussions():
print(discussion["attributes"]["title"])
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
โก๏ธWhatโs new in App Update 18.8.0
Local Currencies Accounts Fees
Local Currencies ๐ธ
Spanish IBANs Service Coverage ๐ช๐ธ
bunq Update 18 FAQ ๐
Phishing and Spoofing - The new ways of scamming, explained ๐
Easily rent a car with your bunq credit card ๐
Giveaway - Hallo Deutschland! ๐ฉ๐ช
Giveaway - Hello Germany! ๐ฉ๐ช
True Name: Feedback ๐ฌ
True Name ๐
What plans are available?
Everything about refunds ๐ธ
Identity verification explained! ๐คณ
When will I receive my payment?
Together Community Guidelines ๐ฃ
What is my Tax Identification Number (TIN)?
How do I export a bank statement?
How do I change my contact info?
Large cash withdrael
If this is a new concept for you, I would suggest you look up tutorials on how to use your browser's developer tools (Google Chrome's Devtools, for example), how to log your browser's network traffic, REST APIs, HTTP, etc.
I am currently using python to scrape this site, with thousands of pages and it is doing fine, but it takes a couple of hours to go through all the pages in parts (because I have a short delay between each page which I believe is fair to the provider of the site.)
However on the real site there is a dropdown menu with an option to display more results on the page. In the HTML that looks like this:
<div class="page-sizer">
<select id="itemsPerPage" class="form-control input-sm">
<option value="10" selected>10</option>
<option value="50" >50</option>
<option value="200" >200</option>
</select>
</div>
<script>
$(document).on('bb:ready', function () {
var pageSizeOptions = {
setPageSizeUrl: '/Pager/SetPageSize'
};
ScrapeThisWebsite.PageSize.init(pageSizeOptions);
});
</script>
Is there any way for me to automatically display the 200 results per page instead of only 10 and save some time for the provider and me? The selection does not show in the link. So, if I copy the page-link to another browser, it returns to the default.
I'm going through the pages using the following simple steps:
myheaders = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36', "Upgrade-Insecure-Requests": "1","DNT": "1","Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Language": "en-US,en;q=0.5","Accept-Encoding": "gzip, deflate"}
page = requests.get(url,headers=myheaders)
Is it linked to how the page is loaded?
You can use selenium library to interact with the dropdown. Also, might be worth checking if there is an API from which you could fetch data directly. To see it inspect the page, go to Network tab and see Fetch/XHR, if API is there you could fetch data using requests library.
Here is how to select the value in the dropdown using selenium. More on select in the docs.
from selenium import webdriver
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome('/Users/username/chromedriver') # here is the path where your web driver is
#get the website
driver.get('https://yourwebsite.com')
# Get the element by ID
dropdown= Select(driver.find_element_by_id('itemsPerPage'))
#Click on the dropdown
dropdown.select_by_value('200')
I was trying to scrape some image data from some stores. For example, I was looking at some images from Nordstrom (tested with 'https://www.nordstrom.com/browse/men/clothing/sweaters').
I had initially used requests.get() to get the code, but I noticed that I was getting some javascript -- and upon further researc I found that this occured because it was dynamically loaded in the html using javascript.
To remedy this issue, following this post (Python requests.get(url) returning javascript code instead of the page html), I tried to use selenium to get the html code. However, I still ran into issues trying to access all the html: it was still returning alot of javascript. Finally, I added in some time delay as I thought maybe it needed some time to load in all of the html, but this still failed. Is there a way to get all the html using selenium? I have attached the current code below:
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
#headless tag created an invisible browser
browser_options.add_argument("--headless")
browser_options.add_argument('--no-sandbox')
browser = webdriver.Chrome(webdriver_path, chrome_options=browser_options)
print("Done Creating Browser")
return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('path/to/chromedriver_win32/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
time.sleep(10)
html_source = browser.page_source
print(html_source)
Is there something that I am not doing properly to load in all of the html code?
browser.page_source always returns initial HTML source but not current DOM state. Try
time.sleep(10)
html_source = browser.find_element_by_tag_name('html').get_attribute('outerHTML')
I would recommend reading "Test-Driven Development with Python", you'll get an answer for your question and so many more. You can read it for free here: https://www.obeythetestinggoat.com/ (and then you can also buy it ;-) )
Regarding your question, you have to wait that the element you're looking for is effectively loaded. You may use time.sleep but you'll get strange behavior depending on the speed of your internet connection and browser.
A better solution is explained here in depth: https://www.obeythetestinggoat.com/book/chapter_organising_test_files.html#self.wait-for
You can use the proposed solution:
def wait_for(self, fn):
start_time = time.time()
while True:
try:
return fn()
except (AssertionError, WebDriverException) as e:
if time.time() - start_time > MAX_WAIT:
raise e
time.sleep(0.5)
fn is just a function finding the element in the page.
Just add a user agent. Chrome's headless user agent says headless that is the problem.
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def create_browser(webdriver_path):
#create a selenium object that mimics the browser
browser_options = Options()
browser_options.add_argument('--headless')
browser_options.add_argument('--user-agent="Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"')
browser = webdriver.Chrome(webdriver_path, options=browser_options)
print("Done Creating Browser")
return browser
url = 'https://www.nordstrom.com/browse/men/clothing/sweaters'
browser = create_browser('C:/bin/chromedriver.exe')
browser.implicitly_wait(10)
browser.get(url)
divs = browser.find_elements_by_tag_name('a')
for div in divs:
print(div.text)
Output:-
Displays all links on the page..
Patagonia Better Sweaterยฎ Quarter Zip Pullover
(56)
Nordstrom Men's Shop Regular Fit Cashmere Quarter Zip Pullover (Regular & Tall)
(73)
Nordstrom Cashmere Crewneck Sweater
(51)
Cutter & Buck Lakemont Half Zip Sweater
(22)
Nordstrom Washable Merino Quarter Zip Sweater
(2)
ALLSAINTS Mode Slim Fit Merino Wool Sweater
Process finished with exit code -1
I'm using Selenium library on Python to scrape a website written on js. My strategy is moving through the website using selenium and, at the right time, scraping with BeautifulSoup. This works just fine on simple tests, except when, as shown in the following picture,
I need to click on the "<" button.
The "class" of the button changes at hover, so I'm using ActionChains to move to the element and click on it (I'm also using sleep to give enough time for the browser to load the page). Python is not throwing any exception, but the click is not working (i.e. the calendar is not moving backwards).
Below I provide the mentioned website and the code I wrote with an example. Do you have any idea why this is happening and/or how can I overcome this issue? Thank you very very much.
Website = https://burocomercial.profeco.gob.mx/index.jsp
Code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
import time
driver = webdriver.Chrome(path_to_webdriver)
driver.get('https://burocomercial.profeco.gob.mx/index.jsp') #access website
# Search bar and search button
search_bar = driver.find_elements_by_xpath('//*[#id="txtbuscar"]')
search_button = driver.find_element_by_xpath('//*[#id="contenido"]/div[2]/div[2]/div[2]/div/div[2]/div/button')
# Perform search
search_bar[0].send_keys("inmobiliaria")
search_button.click()
# Select result
time.sleep(2)
xpath='//*[#id="resultados"]/div[4]/table/tbody/tr[1]/td[5]/button'
driver.find_elements_by_xpath(xpath)[0].click()
# Open calendar
time.sleep(5)
driver.find_element_by_xpath('//*[#id="calI"]').click() #opens calendar
time.sleep(2)
# Hover-and-click on "<" (Here's the problem!!!)
cal_button=driver.find_element_by_xpath('//div[#id="ui-datepicker-div"]/div/a')
time.sleep(4)
ActionChains(driver).move_to_element(cal_button).perform() #hover
prev_button = driver.find_element_by_class_name('ui-datepicker-prev') #catch element whose class was changed by the hover
ActionChains(driver).click(prev_button).perform() #click
time.sleep(1)
print('clicked on it a second ago. No exception was raised, but the click was not performed')
time.sleep(1)
This is a different approach using requests. I think that Selenium should be the last option to use when doing webscraping. Usually, It is possible to retrieve the data from a webpage emulating the requests made by the web application
import requests
from bs4 import BeautifulSoup as BS
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.101 Safari/537.36'}
## Starts session
s = requests.Session()
s.headers = headers
url_base = 'https://burocomercial.profeco.gob.mx/'
ind = 'index.jsp'
resp0 = s.get(url_base+ind) ## First request, to get the 'name' parameter that is dynamic
soup0 = BS(resp0.text, 'lxml')
param_name = soup0.select_one('input[id="txtbuscar"]')['name']
action = 'BusGeneral' ### The action when submit the form
keyword = 'inmobiliaria' # Word to search
data_buscar = {param_name:keyword,'yy':'2017'} ### Data submitted
resp1 = s.post(url_base+action,data=data_buscar) ## second request: make the search
resp2 = s.get(url_base+ind) # Third request: retrieve the results
print(resp2.text)
queja = 'Detalle_Queja.jsp' ## Action when Quejas selected
data_queja = {'Lookup':'2','Val':'1','Bus':'2','FI':'28-Nov-2016','FF':'28-Feb-2017','UA':'0'} # Data for queja form
## Lookup is the number of the row in the table, FI is the initial date and FF, the final date, UA is Unidad Administrativa
## You can change these parameters to obtain different queries.
resp3 = s.post(url_base+queja,data=data_queja) # retrieve Quejas results
print(resp3.text)
With this I got:
'\r\n\r\n\r\n\r\n\r\n\r\n1|<h2>ABITARE PROMOTORA E INMOBILIARIA, SA DE CV</h2>|0|0|0|0.00|0.00|0|0.00|0.00|0.00|0.00|0 % |0 % ||2'
Which contains the data that is used in the webpage.
Maybe this answer is not exactly what you are looking for, but I think it could be easier for you to use requests.
You don't need to hover the <, just click it.
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome(path_to_webdriver)
driver.get('https://burocomercial.profeco.gob.mx/index.jsp') #access website
# set up wait
wait = WebDriverWait(driver, 10)
# Perform search
driver.find_element_by_id('txtbuscar').send_keys("inmobiliaria")
driver.find_element_by_css_selector('button[alt="buscar"]').click()
# Select result
xpath='//*[#id="resultados"]/div[4]/table/tbody/tr[1]/td[5]/button'
wait.until(EC.element_to_be_clickable((By.XPATH, xpath))).click()
# Open calendar
wait.until(EC.element_to_be_clickable((By.ID, 'calI'))).click() #opens calendar
wait.until(EC.element_to_be_visible((By.ID, 'ui-datepicker-div'))
# Click on "<"
wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'a[title="Ant"]'))).click()
A few things
If your XPath consists only of an ID, just use .find_element_by_id(). It's faster and easier to read.
If you are only using the first element in a collection, e.g. search_bar, just use .find_element_* instead of .find_elements_* and search_bar[0].
Don't use sleeps. Sleeps are a bad practice and result in unreliable tests. Instead use expected conditions, e.g. to wait for an element to be clickable.