I am trying to get data from evernote 'shared notebook'.
For example, from this one: https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c
I tried to use Beautiful Soup:
url = 'https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c'
r = requests.get(url)
bs = BeautifulSoup(r.text, 'html.parser')
bs
The result doesn't contain any text information from the notebook, only some code.
I also seen an advice to use selenium and find elements by XPath.
For example I want to find the head of this note - 'Term 3 Week2'. In Google Chrome i found that it's XPath is '/html/body/div[1]/div[1]/b/span/u/b'.
So i tried this:
driver = webdriver.PhantomJS()
driver.get(url)
t = driver.find_element_by_xpath('/html/body/div[1]/div[1]/b/span/u/b')
But it also didn't work, the result was 'NoSuchElementException:... '.
I am a newbie in python and especially parsing, so I would be glad to receive any help.
I am using python 3.6.2 and jupiter-notebook.
Thanks in advance.
The easiest way to interface with Evernote is to use their official Python API.
After you've configured your API key and can generally connect, you can then download and reference Notes and Notebooks.
Evernote Notes use their own template language called ENML (EverNote Markup Language) which is a subset of HTML. You'll be able to use BeautifulSoup4 to parse the ENML and extract the elements you're looking for.
If you're trying to extract information against a local installation (instead of their web app) you may also be able to get what you need from the executable. See how to pass arguments to the local install to extract data. For this you're going to need to use the Python3 subprocess module.
HOWEVER
If you want to use selenium, this will get you started:
import selenium.webdriver.support.ui as ui
from selenium.webdriver import Chrome
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
# your example URL
URL = 'https://www.evernote.com/pub/missrspink/evernoteexamples#st=p&n=56b67555-158e-4d10-96e2-3b2c57ee372c'
# create the browser interface, and a generic "wait" that we can use
# to intelligently block while the driver looks for elements we expect.
# 10: maximum wait in seconds
# 0.5: polling interval in seconds
driver = Chrome()
wait = ui.WebDriverWait(driver, 10, 0.5)
driver.get(URL)
# Note contents are loaded in an iFrame element
find_iframe = By.CSS_SELECTOR, 'iframe.gwt-Frame'
find_html = By.TAG_NAME, 'html'
# .. so we have to wait for the iframe to exist, switch our driver context
# and then wait for that internal page to load.
wait.until(EC.frame_to_be_available_and_switch_to_it(find_iframe))
wait.until(EC.visibility_of_element_located(find_html))
# since ENML is "just" HTML we can select the top tag and get all the
# contents inside it.
doc = driver.find_element_by_tag_name('html')
print(doc.get_attribute('innerHTML')) # <-- this is what you want
# cleanup our browser instance
driver.quit()
Related
I am looking to extract content from a page that is requires a list node to be selected. I have retrieve the page html using python and Selenium. Passing the page source to BS4 I can parse out the content that I am looking for using
open_li = soup.select('div#tree ul.jstree-container-ul li')
Each list item returned has an
aria-expanded = "false" and class="jstree-node jstree-closed"
Looking at inspect element the content is called when these variables are set to
aria-expanded = "true" and class="jstree-node jstree-open"
I have tried using .click method on the content
driver.find_element_by_id('tree').click()
But that only changes other content on the page. I think the list nodes themselves have to be expanded when making the request.
Does someone know how to change aria-expand elements on a page before returning the content ?
Thanks
You can use requests package to get all information as a json.
Here example how you can get all information the page:
import requests
if __name__ == '__main__':
url = "https://app.updateimpact.com/api/singleArtifact?artifactId=commons-lang3&groupId=org.apache.commons&version=3.7"
req_params = requests.get(url).json()
response = requests.get(
'https://app.updateimpact.com/api/builds/%s/%s' % (req_params["userIdStr"], req_params["buildId"]))
print(response.json())
There could be multiple reasons for not getting the output
a) You are clicking on the wrong element
b) You are not waiting for the element to be loaded before clicking on it
c) You are not waiting for the content to be loaded after clicking on the element
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome('/path/to/chromedriver')
url="https://app.updateimpact.com/treeof/org.apache.commons/commons-lang3/3.7"
driver.get(url)
element=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[#id="org.apache.commons:commons-lang3:3.7:jar_anchor"]/span')))
element.click()
element=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[#id="tree-detail"]/div[2]/span[1]')))
print(driver.find_element_by_xpath('//*[#id="detail_div"]').text)
Output
org.apache.commons:commons-lang3:3.7:jar (back)
Project module (browse only dependencies of this module)
Group id org.apache.commons
Artifact id commons-lang3
Version 3.7
Type jar
This dependency isn't a dependency of any other dependencies.
I was trying to save some PDF from a link via PhantomJS (selenium). So, I refered to this code that turns webpages to pdfs. and it worked just fine when I ran the exactly same code.
So, I have this pdf I wanted to save from a direct url and I tried that script... it didn't work. It just saves a PDF with 1 white page. That's all...
My Code :
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
def execute(script, args):
driver.execute('executePhantomScript', {'script': script, 'args' : args })
driver = webdriver.PhantomJS('phantomjs')
# hack while the python interface lags
driver.command_executor._commands['executePhantomScript'] = ('POST', '/session/$sessionId/phantom/execute')
driver.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')
try:
WebDriverWait(driver, 40).until(EC.presence_of_element_located((By.ID, 'plugin')))
except Exception as TimeoutException:
print("I waited for far too long and still couldn't fine the view.")
pass
# set page format
# inside the execution script, webpage is "this"
pageFormat = '''this.paperSize = {format: "A4", orientation: "portrait" };'''
execute(pageFormat, [])
# render current page
render = '''this.render("test2.pdf")'''
execute(render, [])
I'm not sure what's happening and why is it happening. Need some assistance.
EDIT: This is just the test PDF that I was trying to get via Selenium. There are some other PDFs which I need to get and that website is checking god-knows-what to decide whether it's a human or a bot. So, Selenium is the only way.
EDIT 2 : So, here's the website I was practicing on : http://services.ecourts.gov.in/ecourtindia/cases/case_no.php?state_cd=26&dist_cd=8&appFlag=web
Select "Cr Rev - Criminal Revision" from "Case Type" drop down and input any number in case number and year. Click on "Go".
This will show a little table, click on "view" and it should show a table on full page.
Scroll down to the "orders" table and you should see "Copy of order". That's the pdf I'm trying to get.I have tried requests as well and it did not work.
Currently, PhantomJS and Chrome headless doesn't support download a file. If you are OK with Chrome browser, please see my example below. It finds a elements, and then add an attribute download. Finally, it clicks on the link to download the file to default Downloads folder.
import time
driver = webdriver.Chrome()
driver.get('http://www.planetpublish.com/free-ebooks/93/heart-of-darkness/')
pdfLinks = driver.find_elements_by_css_selector(".entry-content ul > li > a")
for pdfLink in pdfLinks:
script = "arguments[0].setAttribute('download',arguments[1]);"
driver.execute_script(script, pdfLink, pdfLink.text)
time.sleep(1)
pdfLink.click()
time.sleep(3)
driver.quit()
If you're just looking at downloading PDFs which aren't protected behind some javascript or stuff (essentially straightforward stuff), I suggest using the requests library instead.
import requests
url ='http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf'
r = requests.get(url)
with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
f.write(r.content)
# If large file
with requests.get(url, stream=True) as r:
with open('The_Scarlet_Letter_T.pdf', 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk:
f.write(chunk)
I recommend you look at the pdfkit library.
import pdfkit
pdfkit.from_url('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf', 'out.pdf')
It makes downloading pdfs very simple with python. You will also need to download this for the library to work.
You could also try the code from this link shown below
#!/usr/bin/env python
from contextlib import closing
from selenium.webdriver import Firefox # pip install selenium
from selenium.webdriver.support.ui import WebDriverWait
# use firefox to get page with javascript generated content
with closing(Firefox()) as browser:
browser.get('http://www.planetpublish.com/wp-content/uploads/2011/11/The_Scarlet_Letter_T.pdf')
button = browser.find_element_by_name('button')
button.click()
# wait for the page to load
WebDriverWait(browser, timeout=10).until(
lambda x: x.find_element_by_id('someId_that_must_be_on_new_page'))
# store it to string variable
page_source = browser.page_source
print(page_source)
which you will need to edit to make work for your pdf.
I'm just learning python and decided to play with some website scraping.
I created 1 that works, and a second, almost identical as far as I can tell, that doesn't work, and I can't figure out why.
from lxml import html
import requests
page = requests.get('https://thronesdb.com/set/Core')
tree = html.fromstring(page.content)
cards = [tree.xpath('//a[#class = "card-tip"]/text()'),tree.xpath('//td[#data-th = "Faction"]/text()'),
tree.xpath('//td[#data-th = "Cost"]/text()'),tree.xpath('//td[#data-th = "Type"]/text()'),
tree.xpath('//td[#data-th = "STR"]/text()'),tree.xpath('//td[#data-th = "Traits"]/text()'),
tree.xpath('//td[#data-th = "Set"]/text()'),tree.xpath('//a[#class = "card-tip"]/#data-code')]
print(cards)
That one does what I expect (I know it's not pretty). It grabs certain elements from a table on the site.
This one returns [[]]:
from lxml import html
import requests
page = requests.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
tree = html.fromstring(page.content)
offers = [tree.xpath('//a[#class = "offer_title"]/text()')]
print(offers)
What I expect it to do is give me a list that has the text from each offer_title element on the page.
The xpath I'm gunning at I grabbed from Firebug, which is:
/html/body/div[1]/div/div/div/section/div[2]/ul[1]/li[2]/div/h3/a
Here's the actual string from the site:
Angelcare Digital Video And Sound Monitor - $89.99 ($90.00 Off)
I have also read a few other questions, but they didn't answer how this could work the first way, but not the second. Can't post them because of the link restrictions on new accounts.
Titles:
Python - Unable to Retrieve Data From Webpage Table Using Beautiful
Soup or lxml xpath
Python lxml xpath no output
Trouble with scraping text from site using lxml / xpath()
Any help would be appreciated. I did some reading on the lxml website about xpath, but I may be missing something in the way I'm building a query.
Thanks!
The reason why the first code is working is that required data is initially present in DOM while on second page required data is generated dynamically by JavaScript, so you cannot scrape it because requests doesn't support handling dynamic content.
You can try to use, for example, Selenium + PhantomJS to get required data as below:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = webdriver.PhantomJS(executable_path='/path/to/phantomJS')
driver.get('http://www.redflagdeals.com/search/#!/q=baby%20monitor')
xpath = '//a[#class = "offer_title"]'
wait(driver, 10).until(EC.element_to_be_clickable((By.XPATH, xpath)))
offers = [link.get_attribute('textContent') for link in driver.find_elements_by_xpath(xpath)]
I want to retrieve all visible content of a web page. Let say for example this webpage. I am using a headless firefox browser remotely with selenium.
The script I am using looks like this
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
dom = BeautifulSoup(driver.page_source, parser)
f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))
with open('out.html', 'w') as fe:
fe.write(dom.encode('utf-8'))
This is supposed to load the page, parse the dom, and then replace the iframe with id dsq-app1 with it's visible content. If I execute those commands one by one via my python command line it works as expected. I can then see the paragraphs with all the visible content. When instead I execute all those commands at once, either by executing the script or by pasting all this snippet in my interpreter, it behaves differently. The paragraphs are missing, the content still exists in json format, but it's not what I want.
Any idea why this may happening? Something to do with replace_with maybe?
Sounds like the dom elements are not yet loaded when your code try to reach them.
Try to wait for the elements to be fully loaded and just then replace.
This works for your when you run it command by command because then you let the driver load all the elements before you execute more commands.
To add to Or Duan's answer I provide what I ended up doing. The problem of finding whether a page or parts of a page have loaded completely is an intricate one. I tried to use implicit and explicit waits but again I ended up receiving half-loaded frames. My workaround is to check the readyState of the original document and the readyState of iframes.
Here is a sample function
def _check_if_load_complete(driver, timeout=10):
elapsed_time = 1
while True:
if (driver.execute_script('return document.readyState') == 'complete' or
elapsed_time == timeout):
break
else:
sleep(0.0001)
elapsed_time += 1
then I used that function right after I changed the focus of the driver to the iframe
driver.switch_to_frame('dsq-app1')
_check_if_load_complete(driver, timeout=10)
Try to get the Page Source after detecting the required ID/CSS_SELECTOR/CLASS or LINK.
You can always use explicit wait of Selenium WebDriver.
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Remote('http://0.0.0.0:xxxx/wd/hub', desired_capabilities)
driver.get(url)
f = WebDriverWait(driver,10).until(EC.presence_of_element_located((By.ID,idName)
# here 10 is time for which script will try to find given id
# provide the id name
dom = BeautifulSoup(driver.page_source, parser)
f = dom.find('iframe', id='dsq-app1')
driver.switch_to_frame('dsq-app1')
s = driver.page_source
f.replace_with(BeautifulSoup(s, 'html.parser'))
with open('out.html', 'w') as fe:
fe.write(dom.encode('utf-8'))
Correct me if this not work
I'm using Python 3 and I'm trying to retrieve data from a website. However, this data is dynamically loaded and the code I have right now doesn't work:
url = eveCentralBaseURL + str(mineral)
print("URL : %s" % url);
response = request.urlopen(url)
data = str(response.read(10000))
data = data.replace("\\n", "\n")
print(data)
Where I'm trying to find a particular value, I'm finding a template instead e.g."{{formatPrice median}}" instead of "4.48".
How can I make it so that I can retrieve the value instead of the placeholder text?
Edit: This is the specific page I'm trying to extract information from. I'm trying to get the "median" value, which uses the template {{formatPrice median}}
Edit 2: I've installed and set up my program to use Selenium and BeautifulSoup.
The code I have now is:
from bs4 import BeautifulSoup
from selenium import webdriver
#...
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = BeautifulSoup(html)
print "Finding..."
for tag in soup.find_all('formatPrice median'):
print tag.text
Here is a screenshot of the program as it's executing. Unfortunately, it doesn't seem to be finding anything with "formatPrice median" specified.
Assuming you are trying to get values from a page that is rendered using javascript templates (for instance something like handlebars), then this is what you will get with any of the standard solutions (i.e. beautifulsoup or requests).
This is because the browser uses javascript to alter what it received and create new DOM elements. urllib will do the requesting part like a browser but not the template rendering part. A good description of the issues can be found here. This article discusses three main solutions:
parse the ajax JSON directly
use an offline Javascript interpreter to process the request SpiderMonkey, crowbar
use a browser automation tool splinter
This answer provides a few more suggestions for option 3, such as selenium or watir. I've used selenium for automated web testing and its pretty handy.
EDIT
From your comments it looks like it is a handlebars driven site. I'd recommend selenium and beautiful soup. This answer gives a good code example which may be useful:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://eve-central.com/home/quicklook.html?typeid=34')
html = driver.page_source
soup = BeautifulSoup(html)
# check out the docs for the kinds of things you can do with 'find_all'
# this (untested) snippet should find tags with a specific class ID
# see: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-by-css-class
for tag in soup.find_all("a", class_="my_class"):
print tag.text
Basically selenium gets the rendered HTML from your browser and then you can parse it using BeautifulSoup from the page_source property. Good luck :)
I used selenium + chrome
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
url = "www.sitetotarget.com"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')`
Building off another answer. I had a similar issue. wget and curl do not work well anymore to get the content of a web page. It's particularly broken with dynamic and lazy content. Using Chrome (or Firefox or Chromium version of Edge) allows you to deal with redirects and scripting.
Below will launch an instance of Chrome, increase the timeout to 5 sec, and navigate this browser instance to a url. I ran this from Jupyter.
import time
from tqdm.notebook import trange, tqdm
from PIL import Image, ImageFont, ImageDraw, ImageEnhance
from selenium import webdriver
driver = webdriver.Chrome('/usr/bin/chromedriver')
driver.set_page_load_timeout(5)
time.sleep(1)
driver.set_window_size(2100, 9000)
time.sleep(1)
driver.set_window_size(2100, 9000)
## You can manually adjust the browser, but don't move it after this.
## Do stuff ...
driver.quit()
Example of grabbing dynamic content and screenshots of the anchored (hence the "a" tag) HTML object, another name for hyperlink:
url = 'http://www.example.org' ## Any website
driver.get(url)
pageSource = driver.page_source
print(driver.get_window_size())
locations = []
for element in driver.find_elements_by_tag_name("a"):
location = element.location;
size = element.size;
# Collect coordinates of object: left/right, top/bottom
x1 = location['x'];
y1 = location['y'];
x2 = location['x']+size['width'];
y2 = location['y']+size['height'];
locations.append([element,x1,y1,x2,y2, x2-x1, y2-y1])
locations.sort(key = lambda x: -x[-2] - x[-1])
locations = [ (el,x1,y1,x2,y2, width,height)
for el,x1,y1,x2,y2,width,height in locations
if not (
## First, filter links that are not visible (located offscreen or zero pixels in any dimension)
x2 <= x1 or y2 <= y1 or x2<0 or y2<0
## Further restrict if you expect the objects to be around a specific size
## or width<200 or height<100
)
]
for el,x1,y1,x2,y2,width,height in tqdm(locations[:10]):
try:
print('-'*100,f'({width},{height})')
print(el.text[:100])
element_png = el.screenshot_as_png
with open('/tmp/_pageImage.png', 'wb') as f:
f.write(element_png)
img = Image.open('/tmp/_pageImage.png')
display(img)
except Exception as err:
print(err)
Installation for mac+chrome:
pip install selenium
brew cask install chromedriver
brew cask install google-chrome
I was using Mac for the original answer and Ubuntu + Windows 11 preview via WSL2 after updating. Chrome ran from Linux side with X service on Windows to render the UI.
Regarding responsibility, please respect robots.txt on each site.
I know this is an old question, but sometimes there is a better solution than using heavy selenium.
This request module for python comes with JS support (in the background it is still chromium) and you can still use beautifulsoup like normal.
Though, sometimes if you have to click elements or sth, I guess selenium is the only option.