Using Pyppeteer to download CSV / Excel file from Vanguard via JavaScript - python

I'm trying to automate downloading the holdings of Vanguard funds from the web. The links resolve through JavaScript so I'm using Pyppeteer but I'm not getting the file. Note, the link says CSV but it provides an Excel file.
From my browser it works like this:
Go to the fund URL, eg
https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio
Follow the link, "See count total holdings"
Click the link, "Export to CSV"
My attempt to replicate this in Python follows. The first link follow seems to work because I get different HTML but the second click gives me the same page, not a download.
import asyncio
from pyppeteer import launch
import os
async def get_page(browser, url):
page = await browser.newPage()
await page.goto(url)
return page
async def fetch(url):
browser = await launch(options={'args': ['--no-sandbox']}) #headless=True,
page = await get_page(browser, url)
await page.waitFor(2000)
# save the page so we can see the source
wkg_dir = 'vanguard'
t_file = os.path.join(wkg_dir,'8225.htm')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
accept = await page.xpath('//a[contains(., "See count total holdings")]')
print(f'Found {len(accept)} "See count total holdings" links')
if accept:
await accept[0].click()
await page.waitFor(2000)
else:
print('DID NOT FIND THE LINK')
return False
# save the pop-up page for debug
t_file = os.path.join(wkg_dir,'first_page.htm')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
links = await page.xpath('//a[contains(., "Export to CSV")]')
print(f'Found {len(links)} "Export to CSV" links') # 3 of these
for i, link in enumerate(links):
print(f'Trying link {i}')
await link.click()
await page.waitFor(2000)
t_file = os.path.join(wkg_dir,f'csv_page{i}.csv')
with open(t_file, 'w', encoding="utf-8") as ef:
ef.write(await page.content())
return True
#---------- Main ------------
# Set constants and global variables
url = 'https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio'
loop = asyncio.get_event_loop()
status = loop.run_until_complete(fetch(url))
Would love to hear suggestions from anyone that knows Puppeteer / Pyppeteer well.

First of all, page.waitFor(2000) should be the last resort. That's a race condition that can lead to a false negative at worst and slows your scrape down at best. I recommend page.waitForXPath which spawns a tight polling loop to continue your code as soon as the xpath becomes available.
Also on the topic of element selection, I'd use text() in your xpath instead of . which is more precise.
I'm not sure how ef.write(await page.content()) is working for you -- that should only give page HTML, not the XLSX download. The link click triggers downloads via a dialog. Accepting this download involves enabling Chrome downloads with
await page._client.send("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": r"C:\Users\you\Desktop" # TODO set your path
})
The next hurdle is bypassing or suppressing the "multiple downloads" permission prompt Chrome displays when you try to download multiple files on the same page. I wasn't able to figure out how to stop this, so my code just navigates to the page for each link as a workaround. I'll leave my solution as sub-optimal but functional and let others (or my future self) improve on it.
By the way, two of the XLSX files at indices 1 and 2 seem to be identical. This code downloads all 3 anyway, but you can probably skip the last depending on whether the page changes or not over time -- I'm not familiar with it.
I'm using a trick for clicking non-visible elements, using the browser console's click rather than Puppeteer: page.evaluate("e => e.click()", csv_link)
Here's my attempt:
import asyncio
from pyppeteer import launch
async def get_csv_links(page):
await page.goto(url)
xp = '//a[contains(text(), "See count total holdings")]'
await page.waitForXPath(xp)
accept, = await page.xpath(xp)
await accept.click()
xp = '//a[contains(text(), "Export to CSV")]'
await page.waitForXPath(xp)
return await page.xpath(xp)
async def fetch(url):
browser = await launch(headless=False)
page, = await browser.pages()
await page._client.send("Page.setDownloadBehavior", {
"behavior": "allow",
"downloadPath": r"C:\Users\you\Desktop" # TODO set your path
})
csv_links = await get_csv_links(page)
for i in range(len(csv_links)):
# open a fresh page each time as a hack to avoid multiple file prompts
csv_link = (await get_csv_links(page))[i]
await page.evaluate("e => e.click()", csv_link)
# let download finish; this is a race condition
await page.waitFor(3000)
if __name__ == "__main__":
url = "https://www.vanguard.com.au/personal/products/en/detail/8225/portfolio"
asyncio.get_event_loop().run_until_complete(fetch(url))
Notes for improvement:
Try an arg like --enable-parallel-downloading or a setting like 'profile.default_content_setting_values.automatic_downloads': 1 to suppress the "multiple downloads" warning.
Figure out how to wait for all downloads to complete so the final waitFor(3000) can be removed. Another option here might involve polling for the files you expect; you can visit the linked thread for ideas.
Figure out how to download headlessly.
Other resources for posterity:
How do I get puppeteer to download a file?
Download content using Pyppeteer (will show a 404 unless you have 10k+ reputation)

Related

How do I read whatsapp messages from a contact using python?

I'm building a bot that logs into zoom at specified times and the links are being obtained from whatsapp. So i was wondering if it is was possible to retrieve those links from whatsapp directly instead of having to copy paste it into python. Google is filled with guides to send messages but is there any way to READ and RETRIEVE those messages and then manipulate it?
You can, at most, try to read WhatsApp messages with Python using Selenium WebDriver since I strongly doubt that you can access WhatsApp APIs.
Selenium is basically an automation tool that lets you automate tasks in your browser so, perhaps, you could write a Python script using Selenium that automatically opens WhatsApp and parses HTML information regarding your WhatsApp web client.
First of all, we mentioned Selenium, but we will use it only to automate the opening and closing of WhatsApp, now we have to find a way to read what's inside the WhatsApp client, and that's where the magic of Web Scraping comes is hand.
Web scraping is a process of extracting data from a website, in this case, the data is represented by the Zoom link you need to automatically obtain, while the web site is your WhatsApp client. To perform this process you need a way to extract (parse) information from the website, to do so I suggest you use Beautiful Soup, but I advise you that a minimum knowledge of how HTML works is required.
Sorry if this may not completely answer your question but this is all the knowledge I have on this specific topic.
You can open WhatsApp on browser using https://selenium-python.readthedocs.io/ in Python.
Selenium is basically an automation tool that lets you automate tasks in your browser so, perhaps, you could write a Python script using Selenium that automatically opens WhatsApp and parses HTML information regarding your WhatsApp web client.
I learn and use code from "https://towardsdatascience.com/complete-beginners-guide-to-processing-whatsapp-data-with-python-781c156b5f0b" this site. Go through the details written on mentioned link.
You have to install external python library "whatsapp-web" from this link --- "https://pypi.org/project/whatsapp-web/". Just type in command prompt / windows terminal by "python -m pip install whatsapp-web".
It will show result ---
python -m pip install whatsapp-web
Collecting whatsapp-web
Downloading whatsapp_web-0.0.1-py3-none-any.whl (21 kB)
Installing collected packages: whatsapp-web
Successfully installed whatsapp-web-0.0.1
You can read all the cookies from whatsapp web and add them to headers and use the requests module or you can also use selenium with that.
Update :
Please change the xpath's class name of each section from the current time class name of WhatsApp web by using inspect element section in WhatsApp web to use the following code. Because WhatsApp have changed its element's class names.
I have tried that in creating a WhatsApp bot using python.
But there are still many bugs because of I am also beginner.
steps based on my research :
Open browser using selenium webdriver
Login on WhatsApp using qr code
If you know from which number you are going to received the meeting link then use this step otherwise check the following process mention after this process.
Find and open the chat room where you are going to received zoom meeting link.
For getting message from known chat room to perform action
#user_name = "Name of meeting link Sender as in your contact list"
Example :
user_name = "Anurag Kushwaha"
#In above variable at place of `Anurag Kushwaha` pass Name or number of Your Teacher
# who going to sent you zoom meeting link same as you have in your contact list.
user = webdriver.find_element_by_xpath('//span[#title="{}"]'.format(user_name))
user.click()
# For getting message to perform action
message = webdriver.find_elements_by_xpath("//span[#class='_3-8er selectable-text copyable-text']")
# In the above line Change the xpath's class name from the current time class name by inspecting span element
# which containing received text message of any chat room.
for i in message:
try:
if "zoom.us" in str(i.text):
# Here you can use you code to preform action according to your need
print("Perform Your Action")
except:
pass
If you do not know by which number you are going to received the link.
Then you can get div class of any unread contact block and get open all the chat room list which are containing that unread div class.
Then check all the unread messages of open chat and get the message from the div class.
When you don't know from whom you gonna received zoom meeting link.
# For getting unread chats you can use
unread_chats = webdriver.find_elements_by_xpath("// span[#class='_38M1B']")
# In the above line Change the xpath's class name from the current time class name by inspecting span element
# which containing the number of unread message showing the contact card inside a green circle before opening the chat room.
# Open each chat using loop and read message.
for chat in unread_chats:
chat.click()
# For getting message to perform action
message = webdriver.find_elements_by_xpath("//span[#class='_3-8er selectable-text copyable-text']")
# In the above line Change the xpath's class name from the current time class name by inspecting span element
# which containing received text message of any chat room.
for i in messge:
try:
if "zoom.us" in str(i.text):
# Here you can use you code to preform action according to your need
print("Perform Your Action")
except:
pass
Note : In the above code 'webdriver' is the driver by which you open web.whatsapp.com
Example :
from selenium import webdriver
webdriver = webdriver.Chrome("ChromePath/chromedriver.exe")
webdriver.get("https://web.whatsapp.com")
# This wendriver variable is used in above code.
# If you have used any other name then please rename in my code or you can assign your variable in that code variable name as following line.
webdriver = your_webdriver_variable
A complete code reference Example :
from selenium import webdriver
import time
webdriver = webdriver.Chrome("ChromePath/chromedriver.exe")
webdriver.get("https://web.whatsapp.com")
time.sleep(25) # For scan the qr code
# Plese make sure that you have done the qr code scan successful.
confirm = int(input("Press 1 to proceed if sucessfully login or press 0 for retry : "))
if confirm == 1:
print("Continuing...")
elif confirm == 0:
webdriver.close()
exit()
else:
print("Sorry Please Try again")
webdriver.close()
exit()
while True:
unread_chats = webdriver.find_elements_by_xpath("// span[#class='_38M1B']")
# In the above line Change the xpath's class name from the current time class name by inspecting span element
# which containing the number of unread message showing the contact card inside a green circle before opening the chat room.
# Open each chat using loop and read message.
for chat in unread_chats:
chat.click()
time.sleep(2)
# For getting message to perform action
message = webdriver.find_elements_by_xpath("//span[#class='_3-8er selectable-text copyable-text']")
# In the above line Change the xpath's class name from the current time class name by inspecting span element
# which containing received text message of any chat room.
for i in messge:
try:
if "zoom.us" in str(i.text):
# Here you can use you code to preform action according to your need
print("Perform Your Action")
except:
pass
Please make sure that the indentation is equal in code blocks if you are copying it.
Can read my another answer in following link for more info about WhatsApp web using python.
Line breaks in WhatsApp messages sent with Python
I am developing WhatsApp bot using python.
For contribution you can contact at : anurag.cse016#gmail.com
Please give a star on my https://github.com/4NUR46 If this Answer helps you.
Try This Its A bit of a hassle but it might work
import pyautogui
import pyperclip
import webbrowser
grouporcontact = pyautogui.locateOnScreen("#group/contact", confidence=.6) # Take a snip of the group or contact name/profile photo
link = pyperclip.paste()
def searchforgroup():
global link
time.sleep(5)
webbrowser.open("https://web.whatsapp.com")
time.sleep(30)#for you to scan the qr code if u have done it then u can edit it to like 10 or anything
grouporcontact = pyautogui.locateOnScreen("#group/contact", confidence=.6)
x = grouporcontact[0]
y = grouporcontact[1]
if grouporcontact == None:
#Do any other option in my case i just gave it my usual link as
link = "mymeetlink"
else:
pyautogui.moveTo(x,y, duration=1)
pyautogui.click()
# end of searching group
def findlink():
global link
meetlink = pyautogui.locateOnScreen("#", confidence=.6)#just take another snap of a meet link without the code after the "/"
f = meetlink[0]
v = meetlink[1]
if meetlink == None:
#Do any other option in my case i just gave it my usual link as
link = "mymeetlink"
else:
pyautogui.moveTo(f,v, duration=.6)
pyautogui.rightClick()
pyautogui.moveRel(0,0, duration=2) # You Have to play with this it basically is considered by your screen size so just edit that and edit it till it reaches the "Copy Link Address"
pyautogui.click()
link = pyperclip.paste()
webbrowser.open(link) # to test it out
So Now You Have It Have To Install pyautogui, pyperclip
and just follow the comments in the snippet and everything should work :)

Python requests-html session GET correct usage

I'm working on a web scraper that needs to open several thousand pages and get some data.
Since one of the data fields I need the most is only loaded after all javascripts of the site have been loaded, I'm using html-requests to render the page and then get the data I need.
I want to know, what's the best way to do this?
1- Open a session at the start of the script, do my whole scraping and then close the session when the script finishes after thousands of "clicks" and several hours?
2- Or should I open a session everytime I open a link, render the page, get the data, and then close the session, and repeat n times in a cycle?
Currently I'm doing the 2nd option, but I'm getting a problem. This is the code I'm using:
def getSellerName(listingItems):
for item in listingItems:
builtURL = item['href']
try:
session = HTMLSession()
r = session.get(builtURL,timeout=5)
r.html.render()
sleep(1)
sellerInfo = r.html.search("<ul class=\"seller_name\"></ul></div><a href=\"{user}\" target=")["user"]
##
##Do some stuff with sellerinfo
##
session.close()
except requests.exceptions.Timeout:
log.exception("TimeOut Ex: ")
continue
except:
log.exception("Gen Ex")
continue
finally:
session.close()
break
This works pretty well and is quite fast. However, after about 1.5 or 2 hours, I start getting OS exception like this one:
OSError: [Errno 24] Too many open files
And then that's it, I just get this exception over and over again, until I kill the script.
I'm guessing I need to close something else after every get and render, but I'm not sure what or if I'm doing it correctly.
Any help and/or suggestions, please?
Thanks!
You should create a session object outside the loop
def getSellerName(listingItems):
session = HTMLSession()
for item in listingItems:
//code

Opening the redirected link

Note: New to Python.
I'm working on a bot, that whenever the prefix and command are given, it inserts a random wikipedia article. WikiPedia has a url for this.
'https://en.wikipedia.org/wiki/Special:Random'
Instead of displaying the /wiki/Special:Random I want to display the random article it has redirected to and have it displayed at XYZ. How would I go about properly redirecting this?
elif message.content.startswith(config.prefix + 'edo'):
await client.send_message(message.channel, content = 'I like XYZ')

Web Scraping with Python in combination with asyncio

I've written a script in python to get some information from a webpage. The code itself is running flawlessly if it is taken out of the asyncio. However, as my script runs synchronously I wanted to make it go through asyncronous process so that it accomplishes the task within the shortest possible time providing optimum performance and obviously not in a blocking manner. As i didn't ever work with this asyncio library, I'm seriously confused how to make it a go. I've tried to fit my script within the asyncio process but it doesn't seem right. If somebody stretches a helping hand to complete this, I would really be grateful to him. Thanks is advance. Here is my erroneous code:
import requests ; from lxml import html
import asyncio
link = "http://quotes.toscrape.com/"
async def quotes_scraper(base_link):
response = requests.get(base_link)
tree = html.fromstring(response.text)
for titles in tree.cssselect("span.tag-item a.tag"):
processing_docs(base_link + titles.attrib['href'])
async def processing_docs(base_link):
response = requests.get(base_link).text
root = html.fromstring(response)
for soups in root.cssselect("div.quote"):
quote = soups.cssselect("span.text")[0].text
author = soups.cssselect("small.author")[0].text
print(quote, author)
next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
if next_page:
page_link = link + next_page
processing_docs(page_link)
loop = asyncio.get_event_loop()
loop.run_until_complete(quotes_scraper(link))
loop.close()
Upon execution what I see in the console is:
RuntimeWarning: coroutine 'processing_docs' was never awaited
processing_docs(base_link + titles.attrib['href'])
You need to call processing_docs() with await.
Replace:
processing_docs(base_link + titles.attrib['href'])
with:
await processing_docs(base_link + titles.attrib['href'])
And replace:
processing_docs(page_link)
with:
await processing_docs(page_link)
Otherwise it tries to run an asynchronous function synchronously and gets upset!

Selenium Python Wait until all the HTML of a page is Load [duplicate]

I don't really have idea about that so I'd like you to give me some advice if you can.
Generally when I use Selenium I try to search the element that I'm interested in, but now I was thinking to develop some kind of performance test so check how much time take a specific webpage (html, script, etc...) to load.
Do you have some idea how to know the load time of html, script etc without search for a specific element of the page?
PS I use IE or Firefox
You could check the underlying javascript framework for active connections. When there are no active connections you could then assume the page is finished loading.
That, however, requires that you either know what framework the page uses, or that you must systematically check for different frameworks and then check for connections.
def get_js_framework(driver):
frameworks = [
'return jQuery.active',
'return Ajax.activeRequestCount',
'return dojo.io.XMLHTTPTransport.inFlight.length'
]
for f in frameworks:
try:
driver.execute_script(f)
except Exception:
logging.debug("{0} didn't work, trying next js framework".format(f))
continue
else:
return f
else:
return None
def load_page(driver, link):
timeout = 5
begin = time.time()
driver.get(link)
js = _get_js_framework(driver)
if js:
while driver.execute_script(js) and time.time() < begin + timeout:
time.sleep(0.25)
else:
time.sleep(timeout)

Categories

Resources