Python 3.X Extract Source Code ONLY when page is done loading - python

I submit a query on a web page. The query takes several seconds before it is done. Only when it is done does it display an HTML table that I would like to get the information from. Let's say this query takes a maximum of 4 seconds to load. While I would prefer to get the data as soon as it is loaded, it would be acceptable to wait 4 seconds then get the data from the table.
The issue I have is when I make my urlread request, the page hasn't finished loading yet. I tried loading the page, then issuing a sleep command, then loading it again, but that does not work either.
My code is
import urllib.request
import time
uf = urllib.request.urlopen(urlname)
time.sleep(3)
uf.decode('UTF-8')
text = uf.read()
print (text)
The webpage I am looking at is http://bookscouter.com/prices.php?isbn=9781111835811 (feel free to ignore the interesting textbook haha)
And I am using Python 3.X on a Raspberry Pi

The prices you want are not in the page you're retrieving, so no amount of waiting will make them appear. Instead, the prices are retrieved by a JavaScript in that page after it has loaded. The urllib module is not a browser, so it won't run that script for you. You'll want to figure out what the URL is for the AJAX request (a quick look at the source code gives a pretty big hint) and retrieve that instead. It's probably going to be in JSON format so you can just use Python's json module to parse it.

Related

Impossible to recover some information with Beautifulsoup on a site

I need your help because I have for the first time problems to get some information with Beautifulsoup .
I have two problems on this page
The green button GET COUPON CODE appear after a few moment see GIF capture
When we inspect the button link, we find a a simple href attribute that call to an out.php function that performs the opening of the destination link that I am trying to capture.
GET COUPON CODE
Thank you for your help
Your problem is a little unclear but if I understand correctly, your first problem is that the 'get coupon code' button looks like this when you render the HTML that you receive from the original page request.
The mark-up for a lot of this code is rendered dynamically using javascript. So that button is missing its href value until it gets loaded in later. You would need to also run the javascript on that page to render this after the initial request. You can't really get this easily using just the python requests library and BeautifulSoup. It will be a lot easier if you use Selenium too which lets you control a browser so it runs all that javascript for you and then you can just get the button info a couple of seconds after loading the page.
There is a way to do this all with plain requests, but it's a bit tedious. You would need to read through the requests the page makes and figure out which one gets the link for the button. The upside to this is it would cut the number of steps to get the info you need and the amount of time it takes to get. You could just use this new request every time to get the right PHP link then just get the info from there.
For your second point, I'm also not sure if I answered it already, but maybe you're also trying to get the redirect link from that PHP link. From inspecting the network requests, it looks like the info will be found in the response headers, there is no body to inspect.
(I know it says 'from cache' but the point is that the redirect is being caused by the header info)

Script cannot fetch data from a web page

I am trying to write a program in Python that can take the name of a stock and its price and print it. However, when I run it, nothing is printed. it seems like the data is having a problem being fetched from the website. I double checked that the path from the web page is correct, but for some reason the text does not want to show up.
from lxml import html
import requests
page = requests.get('https://www.bloomberg.com/quote/UKX:IND?in_source=topQuotes')
tree = html.fromstring(page.content)
Prices = tree.xpath('//span[#class="priceText__1853e8a5"]/text()')
print ('Prices:' , Prices)
here is the website I am trying to get the data from
I have tried BeautifulSoup, but it has the same problem.
If you print the string page.content, you'll see that the website code it captures is actually for a captcha test, not the "real" destination page itself you see when you manually visit the website. It seems that the website was smart enough to see that your request to this URL was from a script and not manually from a human, and it effectively prevented your script from scraping any real content. So Prices is empty because there simply isn't a span tag of class "priceText__1853e8a5" on this special Captcha page. I get the same when I try scraping with urllib2.
As others have suggested, Selenium (actual web automation) might be able to launch the page and get you what you need. The ID looks dynamically generated, though I do get the same one when I manually look at the page. Another alternative is to simply find a different site that can give you the quote you need without blocking your script. I tried it with https://tradingeconomics.com/ukx:ind and that works. Though of course you'll need a different xpath to find the cell you need.

How to optimize web site / make it load faster?

I have a webpage which does web scraping and displays news in a slideshow. It also extracts tweets from Twitter using tweepy.
The code sequence is below:
class extract_news:
def bcnews(self):
//code to extract news
def func2(self):
//code to extract news
...
...
def extractfromtwitter(self):
//code to extract using tweepy
I have multiple such functions to extract from different websites using BS4 and to display the news and tweets. I am using Flask to run this code.
But the page takes about 20seconds or so to load. And if someone tries to access it remotely, it takes too long and the browser gives the error "Connection Timed Out" or just doesn't load.
How can I make this page load faster? Say in like >5 seconds.
Thanks!
You need to identify the bottlenecks in your code and then figure out how to reduce them. It's difficult to help you with the minimal amount of code that you have provided, but the most likely cause is that each HTTP request takes most of the time, and the parsing is probably negligible in comparison.
See if you can figure out a way to paralleise the HTTP requests, e.g. using the multiprocessing or threading modules.
I agree with the others. To give a concrete answer/solution we will NEED to see the code.
However in a nutshell what you will need to do is profile the application with your DevTools. This will result in you pushing the sync javascript code below the CSS, markup, and ASCII loading.
Also create a routine to load an initial chunk of content (approximately one page or slide worth) so that the user will have something to look at. The rest can load in the background and they will never know the difference. It will almost certainly be available before they are able to click to scroll to the next slide. Even if it does take 10 seconds or so.
Perceived performance is what I am describing here. Yes I agree , you will and should find ways to improve the overall loading. However arguably more important is improving the "perceived performance". This is done (as I said), by loading some initial content. Then streaming in the rest immediately afterwards.

Trying to read data from War Thunder local host with python

Basically I'm using python to send serial data to an arduino so that I can make moving dials using data from the game. This would work because you can use the url "localhost:8111" to give you a list of these stats when ingame. The problem is I'm using urllib and BeautifulSoup but they seem to be blindly reading the source code not giving the data I need.
The data I need comes up when I inspect the element of that page. Other pages seem to suggest that using something to run the HTML in python would fix this but I have found no way of doing this. Any help here would be great thanks.
Not the poster but i have been working on this with him. We managed to get it working. In case anyone else is having this problem here is the code that got it to display our speed
from selenium import webdriver
import time
driver = webdriver.Chrome()
driver.get("http://localhost:8111")
time.sleep(1)
while True:
elements = driver.find_element_by_id("stt-IAS, km/h")
print(elements.text)
Don't know why the time.sleep is needed but the code doesn't seem to work without it.
Your problem might be that the page elements are Dynamic. (Revealed by JavaScript for example)
Why is this a problem? A: You can't access those tags or data. You'll have to use either a headless/Automated browser ( Learn more about selenium ).
Then make a session through selenium and keep feeding the data the way you wanted to the Arduino.
Summary: If you inspect elements you can see the tag, if you go to view source you cant see it. This can't be solved using bs4 or requests alone. You'll have to use a module called Selenium or something similar.
Here is a Python module that you can use to get all air vehicle telemetry data from War Thunder localhost server pages "indicators" and "status". The contents of each of these pages are static JSON descriptions of the vehicle's current telemetry values.
The Python package uses the requests module to query the localhost server for the data, converts the returned JSON data into dictionaries, and then consolidates all the data into a singular telemetry dictionary. This data can then be used for other Python processes such as datalogging or graphing.

How can I make python wait untill a webpage loads some data I'm trying to get?

I need to get some numbers from this website
http://www.preciodolar.com/
But the data I need, takes a little time to load and shows a message of 'wait' until it completely loads.
I used find all and some regular expressions to get the data I need, but when I execute, python gives me the 'wait' message that appears before the data loads.
Is there a way to make python 'wait' until all data is loaded?
my code looks like this,
import urllib.request
from re import findall
def divisas():
pag = urllib.request.urlopen('http://www.preciodolar.com/')
html = str(pag.read())
brasil = findall('<td class="usdbrl_buy">(.*?)</td>',html)
return brasil
This is because the page is generated with JavaScript. You're getting the full HTML, but the JavaScript handles changing the DOM and showing the information.
You have two options:
Try and interpret the JavaScript (not easy). There are a lot of questions about this in stack overflow already.
Find the URL the page is hitting with AJAX to get the actual data and use that.
It really just depends on what you need the page for. It looks like you are trying to parse the data and so the second option allows you to make a single request to just get the raw data.
You should find ajax request or jsonp request instead.
In this case , it's jsonp: http://api.preciodolar.com/api/crossdata.php?callback=jQuery1112024555979575961828_1442466073980&_=1442466073981

Categories

Resources