I am using playwright for test automation.
Every test run creates a new instance of chromium.
When I pass --auto-open-devtools-for-tabs it opens devtools as expected.
But, I need to go one step further and have checkbox Perserve Log enabled.
Tests are fast and I need to see requests before redirect.
Based on this answer, one trick would be loading the browser with a persistent_context, closing the browser, and then edit the preference file setting the Preserve log value.
user_data_dir = './prefs'
pref_file_path = user_data_dir + '/Default/Preferences'
browser = playwright.chromium.launch_persistent_context(user_data_dir, headless=False, args= ['--auto-open-devtools-for-tabs'])
browser.close()
with open(pref_file_path, 'r') as pref_file:
data = json.load(pref_file)
data['devtools'] = {
'preferences': {
'network_log.preserve-log': '"true"'
}
}
with open(pref_file_path, 'w') as pref_file:
json.dump(data, pref_file)
browser = playwright.chromium.launch_persistent_context(user_data_dir, headless=False, args= ['--auto-open-devtools-for-tabs'])
page = browser.new_page()
page.goto('https://stackoverflow.com/questions/63661366/puppeteer-launch-chromium-with-preserve-log-enabled')
Related
I'm trying to automate downloading the holdings of Vanguard funds from the web. The links resolve through JavaScript so I'm using Pyppeteer but I'm not getting the file. Note, the link says CSV but it provides an Excel file.
From my browser it works like this:
Go to the fund URL, eg
https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio
Follow the link, "See count total holdings"
Click the link, "Export to CSV"
My attempt to replicate this in Python follows. The first link follow seems to work because I get different HTML but the second click gives me the same page, not a download.
import asyncio
from pyppeteer import launch
import os
async def get_page(browser, url):
page = await browser.newPage()
await page.goto(url)
return page
async def fetch(url):
browser = await launch(options={'args': ['--no-sandbox']}) #headless=True,
page = await get_page(browser, url)
await page.waitFor(2000)
# save the page so we can see the source
wkg_dir = 'vanguard'
t_file = os.path.join(wkg_dir,'8225.htm')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
accept = await page.xpath('//a[contains(., "See count total holdings")]')
print(f'Found {len(accept)} "See count total holdings" links')
if accept:
await accept[0].click()
await page.waitFor(2000)
else:
print('DID NOT FIND THE LINK')
return False
# save the pop-up page for debug
t_file = os.path.join(wkg_dir,'first_page.htm')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
links = await page.xpath('//a[contains(., "Export to CSV")]')
print(f'Found {len(links)} "Export to CSV" links') # 3 of these
for i, link in enumerate(links):
print(f'Trying link {i}')
await link.click()
await page.waitFor(2000)
t_file = os.path.join(wkg_dir,f'csv_page{i}.csv')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
return True
#---------- Main ------------
# Set constants and global variables
url = 'https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio'
loop = asyncio.get_event_loop()
status = loop.run_until_complete(fetch(url))
Would love to hear suggestions from anyone that knows Puppeteer / Pyppeteer well.
First of all, page.waitFor(2000) should be the last resort. That's a race condition that can lead to a false negative at worst and slows your scrape down at best. I recommend page.waitForXPath which spawns a tight polling loop to continue your code as soon as the xpath becomes available.
Also on the topic of element selection, I'd use text() in your xpath instead of . which is more precise.
I'm not sure how ef.write(await page.content()) is working for you -- that should only give page HTML, not the XLSX download. The link click triggers downloads via a dialog. Accepting this download involves enabling Chrome downloads with
await page._client.send("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": r"C:\Users\you\Desktop" # TODO set your path
})
The next hurdle is bypassing or suppressing the "multiple downloads" permission prompt Chrome displays when you try to download multiple files on the same page. I wasn't able to figure out how to stop this, so my code just navigates to the page for each link as a workaround. I'll leave my solution as sub-optimal but functional and let others (or my future self) improve on it.
By the way, two of the XLSX files at indices 1 and 2 seem to be identical. This code downloads all 3 anyway, but you can probably skip the last depending on whether the page changes or not over time -- I'm not familiar with it.
I'm using a trick for clicking non-visible elements, using the browser console's click rather than Puppeteer: page.evaluate("e => e.click()", csv_link)
Here's my attempt:
import asyncio
from pyppeteer import launch
async def get_csv_links(page):
await page.goto(url)
xp = '//a[contains(text(), "See count total holdings")]'
await page.waitForXPath(xp)
accept, = await page.xpath(xp)
await accept.click()
xp = '//a[contains(text(), "Export to CSV")]'
await page.waitForXPath(xp)
return await page.xpath(xp)
async def fetch(url):
browser = await launch(headless=False)
page, = await browser.pages()
await page._client.send("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": r"C:\Users\you\Desktop" # TODO set your path
})
csv_links = await get_csv_links(page)
for i in range(len(csv_links)):
# open a fresh page each time as a hack to avoid multiple file prompts
csv_link = (await get_csv_links(page))[i]
await page.evaluate("e => e.click()", csv_link)
# let download finish; this is a race condition
await page.waitFor(3000)
if __name__ == "__main__":
url = "https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio"
asyncio.get_event_loop().run_until_complete(fetch(url))
Notes for improvement:
Try an arg like --enable-parallel-downloading or a setting like 'profile.default_content_setting_values.automatic_downloads': 1 to suppress the "multiple downloads" warning.
Figure out how to wait for all downloads to complete so the final waitFor(3000) can be removed. Another option here might involve polling for the files you expect; you can visit the linked thread for ideas.
Figure out how to download headlessly.
Other resources for posterity:
How do I get puppeteer to download a file?
Download content using Pyppeteer (will show a 404 unless you have 10k+ reputation)
I'm writing a python program which will test some functions on website. It will log in to this site, check it version and do some tests on it regarding the site version. I want to write few tests for this site but few things will repeat, for example login to the site.
I try to split my code into functions, like hue_login() and use it on every test I need to login to the site. To login to site I use selenium webdriver. So If I split the code into small functions and try to use it in other function where I also use selenium webdriver I end up with two browser windows. One from my hue_login() function where function log me in. And second browser window where it try to put url where I want to go after I log in to the site interface. Of course, because I am not login into the second browser window, site wont show and other tests will fail (tests from this second function).
Example:
def hue_version():
url = global_var.domain + global_var.about
response = urllib.request.urlopen(url)
htmlparser = etree.HTMLParser()
xpath = etree.parse(response, htmlparser).xpath('/html/body/div[4]/div/div/h2/text()')
string = "".join(xpath)
pattern = re.compile(r'(\d{1,2}).(\d{1,2}).(\d{1,2})')
return pattern.search(string).group()
hue_ver = hue_version()
print(hue_ver)
if hue_ver == '3.9.0':
do something
elif hue_version == '3.7.0':
do something else
else:
print("Hue version not recognized!")
def hue_login():
driver = webdriver.Chrome(global_var.chromeDriverPath)
driver.get(global_var.domain + global_var.loginPath)
input_username = driver.find_element_by_name('username')
input_password = driver.find_element_by_name('password')
input_username.send_keys(username)
input_password.send_keys(password)
input_password.submit()
sleep(1)
driver.find_element_by_id('jHueTourModalClose').click()
def file_browser():
hue_login()
click_file_browser_link = global_var.domain + global_var.fileBrowserLink
driver = webdriver.Chrome(global_var.chromeDriverPath)
driver.get(click_file_browser_link)
How can I call hue_login() from file_browser() function that rest of the code from file_browser() will be executed in the same window opened by hue_login()?
Here you go:
driver = webdriver.Chrome(global_var.chromeDriverPath)
def hue_login():
driver.get(global_var.domain + global_var.loginPath)
input_username = driver.find_element_by_name('username')
input_password = driver.find_element_by_name('password')
input_username.send_keys(username)
input_password.send_keys(password)
input_password.submit()
sleep(1)
driver.find_element_by_id('jHueTourModalClose').click()
def file_browser():
hue_login()
click_file_browser_link = global_var.domain + global_var.fileBrowserLink
driver.get(click_file_browser_link)
I am working on robotframework and stuck with login screen because my application is using selfsign certificates, How I can set the flag for accept all certificates programmaticaly? I tried to create FireFox profile and set
fp.set_preference("webdriver_accept_untrusted_certs",True)
But It didn't work.
I know that there is workaround after manually accept certificates but I don't want to do it manually.
My Python code is looks like
def create_profile(path):
from selenium import webdriver
fp =webdriver.FirefoxProfile(profile_directory=path)
fp.set_preference("browser.download.folderList",2)
fp.set_preference("browser.download.manager.showWhenStarting",False)
fp.set_preference("browser.helperApps.alwaysAsk.force", False)
fp.set_preference("webdriver_accept_untrusted_certs",True)
fp.set_preference("webdriver_assume_untrusted_issuer", False)
fp.update_preferences()
return fp.path
And robot Test case is looks like
Launch Infoscale Access
${random_string} = Generate Random String 8
${path} Catenate SEPARATOR=\\ ${EXECDIR} ${random_string}
Create Directory ${path}
${profile} = create_profile ${path}
Log ${path}
Open Browser ${LOGIN URL} ${BROWSER} ff_profile_dir=${profile}
Below is version of libraries which I am using
python2.7
selenium2.32.0
robotframework2.6.0
geckodriver-v0.11.1-win64
How can I print a webpage as a PDF with headers and footers so it looks like it has been printed from google chrome.
This is what I have tried so far using PhantomJS but it doesn't have the headers and footers.
from selenium import webdriver
import selenium
def execute(script, args):
driver.execute('executePhantomScript', {'script': script, 'args' : args })
driver = webdriver.PhantomJS('phantomjs')
# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver.get('http://stackoverflow.com')
# set page format
# inside the execution script, webpage is "this"
pageFormat = '''this.paperSize = {format: "A4", orientation: "portrait" };'''
execute(pageFormat, [])
# render current page
render = '''this.render("test.pdf")'''
execute(render, [])
This can be done using Selenium v4.1.3 by following pyppeteer examples.
You will need to be able to send commands to chromium and call Page.printToPDF Devtools API as shown in the following snip:
# ...
result = send_cmd(driver, "Page.printToPDF", params={
'landscape': True
,'margin':{'top':'1cm', 'right':'1cm', 'bottom':'1cm', 'left':'1cm'}
,'format': 'A4'
,'displayHeaderFooter': True
,'headerTemplate': '<span style="font-size:8px; margin-left: 5px">Page 1 of 1</span>'
,'footerTemplate': f'<span style="font-size:8px; margin-left: 5px">Generated on {datetime.now().strftime("%-m/%-d/%Y at %H:%M")}</span>'
,'scale': 1.65
,'pageRanges': '1'
})
with open(out_path_full, 'wb') as file:
file.write(base64.b64decode(result['data']))
# ...
def send_cmd(driver, cmd, params={}):
response = driver.command_executor._request(
'POST'
,f"{driver.command_executor._url}/session/{driver.session_id}/chromium/send_command_and_get_result"
,json.dumps({'cmd': cmd, 'params': params}))
if response.get('status'): raise Exception(response.get('value'))
return response.get('value')
I've included a full example in my GitHub Repo.
pdfkit may help you out. It will render a PDF from an html page.
I am trying to download the CSV files from this page, via a python script.
But when I try to access the CSV file directly by links in my browser, an agreement form is displayed. I have to agree to this form before I am allowed to download the file.
The exact URLs to the csv files can't be retrieved. It is a value being sent to backend db which fetches the file - e.g PERIOD_ID=2013-0:
https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0
I've tried urllib2.open() and urllib2.read(), but it leads to the html content of agreement form, not the file content.
How do i write a python code which handles this re-direct and then fetches me the CSV file and let me save on disk ?
You need to set the ASP.NET_SessionId cookie. You can find this by using Chrome's Inspect element option in the context menu, or by using Firefox and the Firebug extension.
With Chrome:
Right-click on the webpage (after you've agreed to the terms) and select Inspect element
Click Resources -> Cookies
Select the only element in the list
Copy the Value of the ASP.NET_SessionId element
With Firebug:
Right-click on the webpage (after you've agreed to the terms), and click *Inspect Element with Firebug
Click Cookies
Copy the Value of the ASP.NET_SessionId element
In my case, I got ihbjzynwfcfvq4nzkncbviou - it might work for you, if not you need to perform the above procedure.
Add the cookie to your request, and download the file using the requests module (based on an answer by eladc):
import requests
cookies = {'ASP.NET_SessionId': 'ihbjzynwfcfvq4nzkncbviou'}
r = requests.get(
url=('https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/'
'DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0'),
cookies=cookies
)
with open('2013-0.csv', 'wb') as ofile:
for chunk in r.iter_content(chunk_size=1024):
ofile.write(chunk)
ofile.flush()
Here's my suggestion, for automatically applying the server cookies and basically mimicking standard client session behavior.
(Shamelessly inspired by #pope's answer 554580.)
import urllib2
import urllib
from lxml import etree
_TARGET_URL = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/DataExports/ExportProductionData.aspx?PERIOD_ID=2013-0'
_AGREEMENT_URL = 'https://www.paoilandgasreporting.state.pa.us/publicreports/Modules/Welcome/Agreement.aspx'
_CSV_OUTPUT = 'urllib2_ProdExport2013-0.csv'
class _MyHTTPRedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
print 'Follow redirect...' # Any cookie manipulation in-between redirects should be implemented here.
return urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
http_error_301 = http_error_303 = http_error_307 = http_error_302
cookie_processor = urllib2.HTTPCookieProcessor()
opener = urllib2.build_opener(_MyHTTPRedirectHandler, cookie_processor)
urllib2.install_opener(opener)
response_html = urllib2.urlopen(_TARGET_URL).read()
print 'Cookies collected:', cookie_processor.cookiejar
page_node, submit_form = etree.HTML(response_html), {} # ElementTree node + dict for storing hidden input fields.
for input_name in ['ctl00$MainContent$AgreeButton', '__EVENTVALIDATION', '__VIEWSTATE']: # Form `input` fields used on the ``Agreement.aspx`` page.
submit_form[input_name] = page_node.xpath('//input[#name="%s"][1]' % input_name)[0].attrib['value']
print 'Form input \'%s\' found (value: \'%s\')' % (input_name, submit_form[input_name])
# Submits the agreement form back to ``_AGREEMENT_URL``, which redirects to the CSV download at ``_TARGET_URL``.
csv_output = opener.open(_AGREEMENT_URL, data=urllib.urlencode(submit_form)).read()
print csv_output
with file(_CSV_OUTPUT, 'wb') as f: # Dumps the CSV output to ``_CSV_OUTPUT``.
f.write(csv_output)
f.close()
Good luck!
[Edit]
On the why of things, I think #Steinar Lima is correct with respect to requiring a session cookie. Though unless you've already visited the Agreement.aspx page and submitted a response via the provider's website, the cookie you copy from the browser's web inspector will only result in another redirect to the Welcome to the PA DEP Oil & Gas Reporting Website welcome page. Which of course eliminates the whole point of having a Python script do the job for you.