I've written a script in python to get the price of last trade from a javascript rendered webpage. I can get the content If I choose to go with selenium. My goal here is not to use any browser simulator like selenium or something because the latest release of Requests-HTML is supposed to have the ability to parse javascript encrypted content. However, I am not being able to make a go successfully. When I run the script, I get the following error. Any help on this will be highly appreciated.
Site address : webpage_link
The script I've tried with:
import requests_html
with requests_html.HTMLSession() as session:
r = session.get('https://www.gdax.com/trade/LTC-EUR')
js = r.html.render()
item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
print(item)
This is the complete traceback:
Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
self._callback(*self._args)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
self._timeout)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\experiment.py", line 6, in <module>
item = js.find('.MarketInfo_market-num_1lAXs',first=True).text
AttributeError: 'NoneType' object has no attribute 'find'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\shutil.py", line 387, in _rmtree_unsafe
os.unlink(fullname)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\ar\\.pyppeteer\\.dev_profile\\tmp1gng46sw\\CrashpadMetrics-active.pma'
The price I'm after is available on the top of the page which can be visible like this 177.59 EUR Last trade price. I wish to get 177.59 or whatever the current price is.
You have several errors. The first is a 'navigation' timeout, showing that the page didn’t complete rendering:
Exception in callback NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49
handle: <Handle NavigatorWatcher.waitForNavigation.<locals>.watchdog_cb(<Task finishe...> result=None>) at C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py:49>
Traceback (most recent call last):
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py", line 145, in _run
self._callback(*self._args)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 52, in watchdog_cb
self._timeout)
File "C:\Users\ar\AppData\Local\Programs\Python\Python36-32\lib\site-packages\pyppeteer\navigator_watcher.py", line 40, in _raise_error
raise error
concurrent.futures._base.TimeoutError: Navigation Timeout Exceeded: 3000 ms exceeded
This traceback is not raised in the main thread, your code was not aborted because of this. Your page may or may not be complete; you may want to set a longer timeout or introduce a sleep cycle for the browser to have time to process AJAX responses.
Next, the response.html.render() element returns None. It loads the HTML into a headless Chromium browser, leaves JavaScript rendering to that browser, then copies back the page HTML into the response.html datasctructure in place, and nothing needs to be returned. So js is set to None, not a new HTML instance, causing your next traceback.
Use the existing response.html object to search, after rendering:
r.html.render()
item = r.html.find('.MarketInfo_market-num_1lAXs', first=True)
There is most likely no such CSS class, because the last 5 characters are generated on each page render, after JSON data is loaded over AJAX. This makes it hard to use CSS to find the element in question.
Moreover, I found that without a sleep cycle, the browser has no time to fetch AJAX resources and render the information you wanted to load. Give it, say, 10 seconds of sleep to do some work before copying back the HTML. Set a longer timeout (the default is 8 seconds) if you see network timeouts:
r.html.render(timeout=10, sleep=10)
You could set the timeout to 0 too, to remove the timeout and just wait indefinitely until the page has loaded.
Hopefully a future API update also provides features to wait for network activity to cease.
You can use the included parse library to find the matching CSS classes:
# search for CSS suffixes
suffixes = [r[0] for r in r.html.search_all('MarketInfo_market-num_{:w}')]
for suffix in suffixes:
# for each suffix, find all matching elements with that class
items = r.html.find('.MarketInfo_market-num_{}'.format(suffix))
for item in items:
print(item.text)
Now we get output produced:
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
169.81 EUR
+
1.01 %
18,420 LTC
Your last traceback shows that the Chromium user data path could not be cleaned up. The underlying Pyppeteer library configures the headless Chromium browser with a temporary user data path, and in your case the directory contains some still-locked resource. You can ignore the error, although you may want to try and remove any remaining files in the .pyppeteer folder at a later time.
Do you need it to go through Requests-HTML? On the day you posted, the repo was 4 days old and in the 3 days that have passed there have been 50 commits. It's not going to be completely stable for some time.
See here:
https://github.com/kennethreitz/requests-html/graphs/commit-activity
OTOH, there is an API for gdax.
https://docs.gdax.com/#market-data
Now if you're dead set on using Py3, there is a python client listed on the GDAX website. Upfront I'll mention that it's the unofficial client; however, if you use this you'd be able to quickly and easily get responses from the official GDAX api.
https://github.com/danpaquin/gdax-python
In case you want to use another way by running Selenium web scraping
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import TimeoutException
chrome_path = r"C:\Users\Mike\Desktop\chromedriver.exe"
driver = webdriver.Chrome(chrome_path)
driver.get("https://www.gdax.com/trade/LTC-EUR")
item = driver.find_element_by_xpath('''//span[#class='MarketInfo_market-num_1lAXs']''')
item = item.text
print item
driver.close()
result:177.60 EUR
Related
I'm using Playwright to access and interact with a website and it was going perfect till I found myself in the page where I can't interact any button ou search bar to apply a filter. I can use .locator('xpath') to find the elemente, but when I tried .click('xpath'), .fill('xpath') or even .locator ('xpath').click(), I receive the below error:
Traceback (most recent call last):
File "c:\Users\Usuario\Desktop\Python Files\join\necessidades\join.py", line 24, in <module>
pagina.locator('//*[#id="jrhFrm:barFiltro:filtros:nomeDoCurso_hinput"]').click()
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\sync_api\_generated.py", line 13670, in click
self._sync(
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_sync_base.py", line 104, in _sync
return task.result()
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_locator.py", line 146, in click
return await self._frame.click(self._selector, strict=True, **params)
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_frame.py", line 489, in click
await self._channel.send("click", locals_to_params(locals()))
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_connection.py", line 44, in send
return await self._connection.wrap_api_call(
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_connection.py", line 419, in wrap_api_call
return await cb()
File "C:\Users\Usuario\AppData\Local\Programs\Python\Python39\lib\site-packages\playwright\_impl\_connection.py", line 79, in inner_send
result = next(iter(done)).result()
playwright._impl._api_types.TimeoutError: Timeout 30000ms exceeded.
=========================== logs ===========================
waiting for locator("xpath=//*[#id=\"jrhFrm:barFiltro:filtros:nomeDoCurso_hinput\"]")
Here's the inspection of the page to ~maybe~ help understand the context. I don't know why the search bar is inside a tag.
Example of my code so far, with the Codegen "suggestion"
from playwright.sync_api import sync_playwright
from time import sleep
with sync_playwright() as p:
navegador = p.chromium.launch(headless=False)
pagina = navegador.new_page()
pagina.goto("page_url")
pagina.fill('full_xpath from Username input','USERNAME')
pagina.fill('full_xpath from Password input', 'Password')
pagina.click('full_xpath from Enter button')
try:
pagina.click('full_xpath from a boring pop-up that sometimes shows up')
except:
pass
sleep(10) #waiting the page to fully load
pagina.click('full_xpath from the title of a Menu Item called Trainings')
pagina.click('full_xpath from an Item called Course List that appeared from the Menu List')
# HERE'S WHERE I'M HAVING PROBLEM
sleep(5) #waiting the page to fully load
pagina.locator('full_xpath from the search bar that I want to fill').fill('text I need to insert to search the Training')
# THE BELOW CODE WAS GENERATED BY codegen
pagina.frame_locator("#embedJoinRhJsf").locator("[id=\"jrhFrm\\:barFiltro\\:filtros\\:cursoPesquisa\"]").fill("TEXT") #raise an exception that I posted above in the comments
Have you tried using general CSS selector?
Unsure why your Xpath selector doesn't work, and assuming there's a specific reason you are using xpaths, but by default I usually use general CSS selectors rather than xpaths if I can. Right click the element and Copy > Selector.
From your pic it looks like it will be "#jrhFRM:barFiltro:filtros:cursoPesquisa"
Thus
pagina.locator("#jrhFRM:barFiltro:filtros:cursoPesquisa").click()
Hope this works.
Also, check that you have copied the correct xpath, as the id in your error message...
"jrhFrm:barFiltro:filtros:nomeDoCurso_hinput"
...does not look like it matches the id in your screenshot...
"#jrhFRM:barFiltro:filtros:cursoPesquisa"
Let us know.
I am trying to extract all the comments on a movie from this page https://www.imdb.com/title/tt0114709/reviews?ref_=tt_ql_3 but some of them are hidden behind a button "Load More", I have tried with selenium to to click on this button but it doesn't seem to work. Here is my code and the error message, if someone has an idea on how to achieve that.
h = httplib2.Http("./docs/.cache")
resp, content = h.request(url, "GET")
soup = bs4.BeautifulSoup(content, "html.parser")
divs = soup.find_all("div")
driver = webdriver.Chrome(executable_path='C:\Program Files\Intel\iCLS Client\chromedriver.exe')
driver.get(url)
html = driver.page_source.encode('utf-8')
while driver.find_elements_by_class_name("load-more-data"):
driver.find_elements_by_name("Load More").click()
Traceback (most recent call last):
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 567, in <module>
Mat()
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 518, in Mat
dicoCam =testC.extract_data()
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 368, in extract_data
self.extract_comment(movie, url)
File "C:/Users/demo/PycharmProjects/untitled/Extraction.py", line 469, in extract_comment
driver.find_elements_by_name("Load More").click()
AttributeError: 'list' object has no attribute 'click'```
As you can see in the error message, a list is returned when doing:
driver.find_elements_by_name("Load More")
That's why I suggest doing this:
driver.find_elements_by_name("Load More")[0].click()
You have to make sure that there is only 1 element named Load More.
If this is not the case, increase the list index [0] by 1 for each element
named Load More.
Hope that helped.
EDIT: If you still get error messages, like list index out of range , the driver.find_elements_by_name() function isn't working the proper way you want it to.
I'm not an expert when dealing with the Internet with Python,
but you should look for
functions like
driver.find_elements_by_innerhtml() or driver.find_elements_by_text().
Is there any function like that?
The reason of the error is, you search it with find_elements_by_name, beware of elements, so it returns a list since you are asking it to find multiple elements. If you want to click "Load More" button infinitely, I suggest:
while True:
try:
driver.find_element_by_class_name("load-more-data").click()
except selenium.common.exceptions.ElementNotFoundException:
break
I'm not sure if the class names are true though since they are based on your example. I didn't inspect the web page you gave. You can alter my code for your situation if it won't work.
I am currently building a website using cherrypy and have run into an error when attempting to open a page with a query string in the url http://localhost:8080/protected/my_page.html?pk=3e4e285c-ed33-403e-a7a1-6b79b5f8356d included I get an error:
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\cherrypy\_cprequest.py", line 670, in respond
response.body = self.handler()
File "C:\Python27\lib\site-packages\cherrypy\lib\encoding.py", line 221, in __call__
self.body = self.oldhandler(*args, **kwargs)
File "C:\Python27\lib\site-packages\cherrypy\_cpdispatch.py", line 66, in __call__
raise sys.exc_info()[1]
HTTPError: (404, 'Unexpected query string parameters: pk')
I currently host this page in a /protected/ directory which requires authentication. But, if I move a copy of the page back to my root dir (/wwwroot) and then attempt to load the exact same url (besides the /protected/) http://localhost:8080/my_page.html?pk=3e4e285c-ed33-403e-a7a1-6b79b5f8356d my page loads successfully and all the fields are populated with the data I expect that matches my query. That then though leaves that page available for anyone to access rather than require authentication.
The page currently should load and if the query string is in the url it should get the value from this line of python code pk = context.get("pk", default = "") and then there is a DB lookup using the cx_Oracle python module and the pk variable to find the matching record to display on the page. I tried commenting out the pk variable assignment from my page to see if it would work but I am greeted with the same error.
Besides just moving the page to my base directory to make it work, does anyone have any idea how to resolve this error?
I'm trying to set up a Google Compute Engine server to pull options data using Python Pandas. When I make this request from my Mac at home, I only have problems late at night when Yahoo! is resetting its servers (the data is being pulled from Yahoo! Finance). But when I try doing the same thing from my Compute Engine server, the request always fails for some of the stocks I'm interested in, although it typically works for options on larger companies, such as 'aapl' or 'ge'. On my computer at home, running it at the same time, the same requests succeed for both small and large companies.
The requests do typically take a few seconds, maybe as many as 15. Is there a way to get to more extensive logs as to what is going on when I make these requests on the Google servers? The only things I can think of would be that there are permissions issues for some reason with these specific http requests or that there is a timeout configured that's interfering. But as far as I can tell, the general timeout should be 75 seconds for that kind of request, and there's no way it's taking that long.
Here's a sample of what I see from the python shell:
>>> from pandas.io.data import Options
>>> spwr = Options('spwr', 'yahoo')
>>> data = spwr.get_all_data()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/disk1/venv/optbot/local/lib/python2.7/site-packages/pandas/io/data.py", line 1090, in get_all_data
return self._get_data_in_date_range(dates=expiry_dates, call=call, put=put)
File "/mnt/disk1/venv/optbot/local/lib/python2.7/site-packages/pandas/io/data.py", line 1104, in _get_data_in_date_range
frame = self._get_option_data(expiry=expiry_date, name=name)
File "/mnt/disk1/venv/optbot/local/lib/python2.7/site-packages/pandas/io/data.py", line 723, in _get_option_data
frames = self._get_option_frames_from_yahoo(expiry)
File "/mnt/disk1/venv/optbot/local/lib/python2.7/site-packages/pandas/io/data.py", line 655, in _get_option_frames_from_yahoo
option_frames = self._option_frames_from_url(url)
File "/mnt/disk1/venv/optbot/local/lib/python2.7/site-packages/pandas/io/data.py", line 692, in _option_frames_from_url
raise RemoteDataError('Received no data from Yahoo at url: %s' % url)
pandas.io.data.RemoteDataError: Received no data from Yahoo at url: http://finance.yahoo.com/q/op?s=SPWR&date=1430438400
>>> aapl = Options('aapl', 'yahoo')
>>> data = aapl.get_all_data()
>>>
I've never yet been successful in getting the options data for 'spwr', but usually it will work for larger companies.
Any ideas how I might fix the issue? Or get to logs that will tell me more about what's happening here?
This is caused by an issue in Pandas 0.15.2. When I reverted back to Pandas 0.15.1, it started working again. The issue has been filed with Pandas. Check there to see if it has been resolved in later releases.
I'm working on a scraper using Mechanize and Beautiful Soup in Python and for some reason redirects aren't working. Here's my code (I apologize for naming my variables "thing" and "stuff"; I don't normally do that, trust me):
stuff = soup.find('div', attrs={'class' : 'paging'}).ul.findAll('a', href=True)
for thing in stuff:
pageUrl = thing['href']
print pageUrl
req = mechanize.Request(pageUrl)
response = browser.open(req)
searchPage = response.read()
soup = BeautifulSoup(searchPage)
soupString = soup.prettify()
print soupString
Anyway, products on Kraft's website that have more than one page for search results display a link to go the next page(s). The source code lists, for example, this as the next page for Kraft's line of steak sauces and marinades, which redirects to this
Anyway, thing['href'] has the old link in it because it scrapes the web page for it; one would think that doing browser.open() on that link would cause mechanize to go to the new link and return that as a response. However, running the code gives this result:
http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2
Traceback (most recent call last):
File "C:\Development\eclipse\mobile development\Crawler\src\Kraft.py", line 58, in <module>
response = browser.open(req)
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 203, in open
File "build\bdist.win-amd64\egg\mechanize\_mechanize.py", line 255, in _mech_open
mechanize._response.httperror_seek_wrapper: HTTP Error 408: Request Time-out
I get a time-out; I imagine it's because, for some reason, mechanize is looking for the old URL and isn't being redirected to the new one (I also tried this with urllib2 and received the same result). What's going on here?
Thanks for the help and let me know if you need any more information.
Update: Alright, I enabled logging; now my code reads:
req = mechanize.Request(pageUrl)
print logging.INFO
When I run it I get this:
url argument is not a URI (contains illegal characters) u'http://www.kraftrecipes.com/products/pages/productinfosearchresults.aspx?catalogtype=1&brandid=1&searchtext=a.1. steak sauces and marinades&pageno=2'
20
Update 2 (which occurred while writing the first update): It turns out that it was the spaces in my string! All I had to do was this: pageUrl = thing['href'].replace(' ', "+") and it works perfectly.
Both urllib2 and mechanize openers include a handler for redirect responses by default (you can check looking at the handlers attribute), so I don't think the problem is that a redirect response isn't being correctly followed.
To troubleshoot the problem, you should capture the traffic in your web browser (in firefox, Live HTTP Headers and HttpFox are useful to do this) and compare it with the logs from your script (I'd recommend subclassing urllib2.BaseHandler to create your own handler to log all the information you need for every request and add the handler to your opener object using the add_handler method).